Compare commits

...

98 Commits

Author SHA1 Message Date
Patryk Jędrzejczak
7164f11b99 Merge '[Backport 2025.3] Revert 24418: main.cc: fix group0 shutdown order' from Petr Gusev
This PR reverts the changes of #24418 since they can cause use-after-free.

The `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).

However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.

This PR reverts two of the three commits from #24418. The commit [e456d2d](e456d2d507) is not reverted because it only affects logging and does not impact correctness.

Fixes scylladb/scylladb#25221

Backport: this PR is a backport

Closes scylladb/scylladb#25206

* https://github.com/scylladb/scylladb:
  Revert "main.cc: fix group0 shutdown order"
  Revert "storage_service: test_group0_apply_while_node_is_being_shutdown"
2025-07-30 16:18:13 +02:00
Pavel Emelyanov
99f328b7a7 Merge '[Backport 2025.3] s3_client: Enhance s3_client error handling' from Scylladb[bot]
Enhance and fix error handling in the `chunked_download_source` to prevent errors seeping from the request callback. Also stop retrying on seastar's side since it is going to break the integrity of data which maybe downloaded more than once for the same range.

Fixes: https://github.com/scylladb/scylladb/issues/25043

Should be backported to 2025.3 since we have an intention to release native backup/restore feature

- (cherry picked from commit d53095d72f)

- (cherry picked from commit b7ae6507cd)

- (cherry picked from commit ba910b29ce)

- (cherry picked from commit fc2c9dd290)

Parent PR: #24883

Closes scylladb/scylladb#25137

* github.com:scylladb/scylladb:
  s3_client: Disable Seastar-level retries in HTTP client creation
  s3_test: Validate handling of non-`aws_error` exceptions
  s3_client: Improve error handling in chunked_download_source
  aws_error: Add factory method for `aws_error` from exception
2025-07-29 14:42:45 +03:00
Pavel Emelyanov
07f46a4ad5 Merge '[Backport 2025.3] storage_service: cancel all write requests after stopping transports' from Scylladb[bot]
When a node shuts down, in storage service, after storage_proxy RPCs are stopped, some write handlers within storage_proxy may still be waiting for background writes to complete. These handlers hold appropriate ERMs to block schema changes before the write finishes. After the RPCs are stopped, these writes cannot receive the replies anymore.

If, at the same time, there are RPC commands executing `barrier_and_drain`, they may get stuck waiting for these ERM holders to finish, potentially blocking node shutdown until the writes time out.

This change introduces cancellation of all outstanding write handlers from storage_service after the storage proxy RPCs were stopped.

Fixes scylladb/scylladb#23665

Backport: since this fixes an issue that frequently causes issues in CI, backport to 2025.1, 2025.2, and 2025.3.

- (cherry picked from commit bc934827bc)

- (cherry picked from commit e0dc73f52a)

Parent PR: #24714

Closes scylladb/scylladb#25170

* github.com:scylladb/scylladb:
  storage_service: Cancel all write requests on storage_proxy shutdown
  test: Add test for unfinished writes during shutdown and topology change
2025-07-29 14:42:25 +03:00
Taras Veretilnyk
a9f5e7d18f docs: fix typo in command name enbleautocompaction -> enableautocompaction
Renamed the file and updated all references from 'enbleautocompaction' to the correct 'enableautocompaction'.

Fixes scylladb/scylladb#25172

Closes scylladb/scylladb#25175

(cherry picked from commit 6b6622e07a)

Closes scylladb/scylladb#25218
2025-07-29 14:41:50 +03:00
Petr Gusev
d8f6a497a5 Revert "main.cc: fix group0 shutdown order"
This reverts commit 6b85ab79d6.
2025-07-28 17:50:38 +02:00
Petr Gusev
c98dde92db Revert "storage_service: test_group0_apply_while_node_is_being_shutdown"
This reverts commit b1050944a3.
2025-07-28 17:49:03 +02:00
Aleksandra Martyniuk
8efee38d6f tasks: do not use binary progress for task manager tasks
Currently, progress of a parent task depends on expected_total_workload,
expected_children_number, and children progresses. Basically, if total
workload is known or all children have already been created, progresses
of children are summed up. Otherwise binary progress is returned.

As a result, two tasks of the same type may return progress in different
units. If they are children of the same task and this parent gathers the
progress - it becomes meaningless.

Drop expected_children_number as we can't assume that children are able
to show their progresses.

Modify get_progress method - progress is calculated based on children
progresses. If expected_total_workload isn't specified, the total
progress of a task may grow. If expected_total_workload isn't specified
and no children are created, empty progress (0/0) is returned.

Fixes: https://github.com/scylladb/scylladb/issues/24650.

Closes scylladb/scylladb#25113

(cherry picked from commit a7ee2bbbd8)

Closes scylladb/scylladb#25200
2025-07-28 13:11:45 +03:00
Michael Litvak
934260e9a9 storage service: drain view builder before group0
The view builder uses group0 operations to coordinate view building, so
we should drain the view builder before stopping group0.

Fixes scylladb/scylladb#25096

Closes scylladb/scylladb#25101

(cherry picked from commit 3ff388cd94)

Closes scylladb/scylladb#25198
2025-07-28 13:05:14 +03:00
Nadav Har'El
583c118ccd Merge '[Backport 2025.3] alternator: avoid oversized allocation in Query/Scan' from Scylladb[bot]
This series fixes one cause of oversized allocations - and therefore potentially stalls and increased tail latencies - in Alternator.

The first patch in the series is the main fix - the later patches are cleanups requested by reviewers but also involved other pre-existing code, so I did those cleanups as separate patches.

Alternator's Scan or Query operation return a page of results. When the number of items is not limited by a "Limit" parameter, the default is to return a 1 MB page. If items are short, a large number of them can fit in that 1MB. The test test_query.py::test_query_large_page_small_rows has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array "Items". Before this patch, we build the full response as a RapidJSON object before sending it. The problem is that unfortunately, RapidJSON stores arrays as contiguous allocations. This results in large contiguous allocations in workloads that scan many small items, and large contiguous allocations can also cause stalls and high tail latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e, a chunked (non-contiguous) array of items (each a JSON value). After collecting this array separately from the response object, we need to print its content without actually inserting it into the object - we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number (currently, >256) of items were scanned. When there is a smaller number of items in a page (this is typical when each item is longer), we just insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation (which is now gone), this patch also includes a new test which exercises the new code with a scan of 700 (>256) items in a page - but this new test is fast enough to be permanently in our test suite and not a manual "veryslow" test as the other test.

Fixes #23535

The stalls caused by large allocations was seen by actual users, so it makes sense to backport this patch. On the other hand, the patch while not big is fairly intrusive (modifies the nomal Scan and Query path and also the later patches do some cleanup of additional code) so there is some small risk involved in the backport.

- (cherry picked from commit 2385fba4b6)

- (cherry picked from commit d8fab2a01a)

- (cherry picked from commit 13ec94107a)

- (cherry picked from commit a248336e66)

Parent PR: #24480

Closes scylladb/scylladb#25194

* github.com:scylladb/scylladb:
  alternator: clean up by co-routinizing
  alternator: avoid spamming the log when failing to write response
  alternator: clean up and simplify request_return_type
  alternator: avoid oversized allocation in Query/Scan
2025-07-27 14:12:49 +03:00
Nadav Har'El
f1c5350141 alternator: clean up by co-routinizing
Reviewers of the previous patch complained on some ugly pre-existing
code in alternator/executor.cc, where returning from an asynchronous
(future) function require lengthy verbose casts. So this patch cleans
up a few instances of these ugly casts by using co_return instead of
return.

For example, the long and verbose

    return make_ready_future<executor::request_return_type>(
        rjson::print(std::move(response)));

can be changed to the shorter and more readable

    co_return rjson::print(std::move(response));

This patch should not have any functional implications, and also not any
performance implications: I only coroutinized slow-path functions and
one function that was already "partially" coroutinized (and this was
expecially ugly and deserved being fixed).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit a248336e66)
2025-07-27 07:42:01 +00:00
Nadav Har'El
f897f38003 alternator: avoid spamming the log when failing to write response
Both make_streamed() and new make_streamed_with_extra_array() functions,
used when returning a long response in Alternator, would write an error-
level log message if it failed to write the response. This log message
is probably not helpful, and may spam the log if the application causes
repeated errors intentionally or accidentally.

So drop these log messages. The exception is still thrown as usual.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 13ec94107a)
2025-07-27 07:42:01 +00:00
Nadav Har'El
fe037663ea alternator: clean up and simplify request_return_type
The previous patch introduced a function make_streamed_with_extra_array
which was a duplicate of the existing make_streamed. Reviewers
complained how baroque the new function is (just like the old function),
having to jump through hoops to return a copyable function working
on non-copyable objects, making strange-named copies and shared pointers
of everything.

We needed to return a copyable function (std::function) just because
Alternator used Seastar's json::json_return_type in the return type
from executor function (request_return_type). This json_return_type
contained either a sstring or an std::function, but neither was ever
really appropriate:

  1. We want to return noncopyable_function, not an std::function!
  2. We want to return an std::string (which rjson::print()) returns,
     not an sstring!

So in this patch we stop using seastar::json::json_return_type
entirely in Alternator.

Alternator's request_return_type is now an std::variant of *three* types:
  1. std::string for short responses,
  2. noncopyable_function for long streamed response
  3. api_error for errors.

The ugliest parts of make_streamed() where we made copies and shared
pointers to allow for a copyable function are all gone. Even nicer, a
lot of other ugly relics of using seastar::json_return_type are gone:

1. We no longer need obscure classes and functions like make_jsonable()
   and json_string() to convert strings to response bodies - an operation
   can simply return a string directly - usually returning
   rjson::print(value) or a fixed string like "" and it just works.

2. There is no more usage of seastar::json in Alternator (except one
   minor use of seastar::json::formatter::to_json in streams.cc that
   can be removed later). Alternator uses RapidJSON for its JSON
   needs, we don't need to use random pieces from a different JSON
   library.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit d8fab2a01a)
2025-07-27 07:42:01 +00:00
Nadav Har'El
b7da50d781 alternator: avoid oversized allocation in Query/Scan
This patch fixes one cause of oversized allocations - and therefore
potentially stalls and increased tail latencies - in Alternator.

Alternator's Scan or Query operation return a page of results. When the
number of items is not limited by a "Limit" parameter, the default is
to return a 1 MB page. If items are short, a large number of them can
fit in that 1MB. The test test_query.py::test_query_large_page_small_rows
has 30,000 items returned in a single page.

In the response JSON, all these items are returned in a single array
"Items". Before this patch, we build the full response as a RapidJSON
object before sending it. The problem is that unfortunately, RapidJSON
stores arrays as contiguous allocations. This results in large
contiguous allocations in workloads that scan many small items, and
large contiguous allocations can also cause stalls and high tail
latencies. For example, before this patch, running

    test/alternator/run --runveryslow \
        test_query.py::test_query_large_page_small_rows

reports in the log:

    oversized allocation: 573440 bytes.

After this patch, this warning no longer appears.
The patch solves the problem by collecting the scanned items not in a
RapidJSON array, but rather in a chunked_vector<rjson::value>, i.e,
a chunked (non-contiguous) array of items (each a JSON value).
After collecting this array separately from the response object, we
need to print its content without actually inserting it into the object -
we add a new function print_with_extra_array() to do that.

The new separate-chunked-vector technique is used when a large number
(currently, >256) of items were scanned. When there is a smaller number
of items in a page (this is typical when each item is longer), we just
insert those items in the object and print it as before.

Beyond the original slow test that demonstrated the oversized allocation
(which is now gone), this patch also includes a new test which
exercises the new code with a scan of 700 (>256) items in a page -
but this new test is fast enough to be permanently in our test suite
and not a manual "veryslow" test as the other test.

Fixes #23535

(cherry picked from commit 2385fba4b6)
2025-07-27 07:42:01 +00:00
Pavel Emelyanov
7c04619ecf Merge '[Backport 2025.3] encryption_at_rest_test: Fix some spurious errors' from Scylladb[bot]
Fixes #24574

* Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert
* Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free.

- (cherry picked from commit ee98f5d361)

- (cherry picked from commit 8d37e5e24b)

Parent PR: #24633

Closes scylladb/scylladb#24772

* github.com:scylladb/scylladb:
  encryption_at_rest_test: Add exception handler to ensure proxy stop
  encryption: Ensure stopping timers in provider cache objects
2025-07-24 16:35:53 +03:00
Pavel Emelyanov
b07f4fb26b Merge '[Backport 2025.3] streaming: Avoid deadlock by running view checks in a separate scheduling group' from Scylladb[bot]
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.

The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B

Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.

Even if we didn't deadlock, and the streaming semaphore was simply exhausted
by other receiving sessions (via tracking-only permit), the query may still time-out due to starvation.

To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.

Fixes #24807
Fixes #24925

- (cherry picked from commit ee2fa58bd6)

- (cherry picked from commit dff2b01237)

Parent PR: #24929

Closes scylladb/scylladb#25058

* github.com:scylladb/scylladb:
  streaming: Avoid deadlock by running view checks in a separate scheduling group
  service: migration_manager: Run group0 barrier in gossip scheduling group
2025-07-24 16:35:24 +03:00
Ran Regev
c5f4ad3665 nodetool restore: sstable list from a file
Fixes: #25045

added the ability to supply the list of files to
restore from the a given file.
mainly required for local testing.

Signed-off-by: Ran Regev <ran.regev@scylladb.com>

Closes scylladb/scylladb#25077

(cherry picked from commit dd67d22825)

Closes scylladb/scylladb#25124
2025-07-24 16:35:04 +03:00
Ran Regev
013e0d685c docs: update nodetool restore documentation for --sstables-file-list
Fixes: #25128
A leftover from #25077

Closes scylladb/scylladb#25129

(cherry picked from commit 3d82b9485e)

Closes scylladb/scylladb#25139
2025-07-24 16:34:39 +03:00
Jakub Smolar
800f819b5b gdb: handle zero-size reads in managed_bytes
Fixes: https://github.com/scylladb/scylladb/issues/25048

Closes scylladb/scylladb#25050

(cherry picked from commit 6e0a063ce3)

Closes scylladb/scylladb#25142
2025-07-24 16:34:04 +03:00
Sergey Zolotukhin
8ac6aaadaf storage_service: Cancel all write requests on storage_proxy shutdown
During a graceful node shutdown, RPC listeners are stopped in `storage_service::drain_on_shutdown`
as one of the first steps. However, even after RPCs are shut down, some write handlers in
`storage_proxy` may still be waiting for background writes to complete. These handlers retain the ERM.
Since the RPC subsystem is no longer active, replies cannot be received, and if any RPC commands are
concurrently executing `barrier_and_drain`, they may get stuck waiting for those writes. This can block
the messaging server shutdown and delay the entire shutdown process until the write timeout occurs.

This change introduces the cancellation of all outstanding write handlers in `storage_proxy`
during shutdown to prevent unnecessary delays.

Fixes scylladb/scylladb#23665

(cherry picked from commit e0dc73f52a)
2025-07-24 13:03:32 +00:00
Sergey Zolotukhin
16a8cd9514 test: Add test for unfinished writes during shutdown and topology change
This test reproduces an issue where a topology change and an ongoing write query
during query coordinator shutdown can cause the node to get stuck.

When a node receives a write request, it creates a write handler that holds
a copy of the current table's ERM (Effective Replication Map). The ERM ensures
that no topology or schema changes occur while the request is being processed.

After the query coordinator receives the required number of replica write ACKs
to satisfy the consistency level (CL), it sends a reply to the client. However,
the write response handler remains alive until all replicas respond — the remaining
writes are handled in the background.

During shutdown, when all network connections are closed, these responses can no longer
be received. As a result, the write response handler is only destroyed once the write
timeout is reached.

This becomes problematic because the ERM held by the handler blocks topology or schema
change commands from executing. Since shutdown waits for these commands to complete,
this can lead to unnecessary delays in node shutdown and restarts, and occasional
test case failures.

Test for: scylladb/scylladb#23665

(cherry picked from commit bc934827bc)
2025-07-24 13:03:32 +00:00
Ernest Zaslavsky
e45852a595 s3_client: Disable Seastar-level retries in HTTP client creation
Prevent Seastar from retrying HTTP requests to avoid buffer double-feed
issues when an entire request is retried. This could cause data
corruption in `chunked_download_source`. The change is global for every
instance of `s3_client`, but it is still safe because:
* Seastar's `http_client` resets connections regardless of retry behavior
* `s3_client` retry logic handles all error types—exceptions, HTTP errors,
  and AWS-specific errors—via `http_retryable_client`

(cherry picked from commit fc2c9dd290)
2025-07-22 16:46:54 +00:00
Ernest Zaslavsky
fdf706a6eb s3_test: Validate handling of non-aws_error exceptions
Inject exceptions not wrapped in `aws_error` from request callback
lambda to verify they are properly caught and handled.

(cherry picked from commit ba910b29ce)
2025-07-22 16:46:53 +00:00
Ernest Zaslavsky
2bc3accf9c s3_client: Improve error handling in chunked_download_source
Create aws_error from raised exceptions when possible and respond
appropriately. Previously, non-aws_exception types leaked from the
request handler and were treated as non-retryable, causing potential
data corruption during download.

(cherry picked from commit b7ae6507cd)
2025-07-22 16:46:53 +00:00
Ernest Zaslavsky
0106d132bd aws_error: Add factory method for aws_error from exception
Move `aws_error` creation logic out of `retryable_http_client` and
into the `aws_error` class to support reuse across components.

(cherry picked from commit d53095d72f)
2025-07-22 16:46:53 +00:00
Pavel Emelyanov
53637fdf61 Merge '[Backport 2025.3] storage: add make_data_or_index_source to the storages' from Scylladb[bot]
Add `make_data_or_index_source` to the storages to utilize new S3 based data source which should improve restore performance

* Introduce the `encrypted_data_source` class that wraps an existing data source to read and decrypt data on the fly using block encryption. Also add unit tests to verify correct decryption behavior.
* Add `make_data_or_index_source` to the `storage` interface, implement it  for `filesystem_storage` storage which just creates `data_source` from a file and for the `s3_storage` create a (maybe) decrypting source from s3 make_download_source. This change should solve performance improvement for reading large objects from S3 and should not affect anything for the `filesystem_storage`

Fixes: https://github.com/scylladb/scylladb/issues/22458

- (cherry picked from commit 211daeaa40)

- (cherry picked from commit 7e5e3c5569)

- (cherry picked from commit 0de61f56a2)

- (cherry picked from commit 8ac2978239)

- (cherry picked from commit dff9a229a7)

- (cherry picked from commit 8d49bb8af2)

Parent PR: #23695

Closes scylladb/scylladb#25016

* github.com:scylladb/scylladb:
  sstables: Start using `make_data_or_index_source` in `sstable`
  sstables: refactor readers and sources to use coroutines
  sstables: coroutinize futurized readers
  sstables: add `make_data_or_index_source` to the `storage`
  encryption: refactor key retrieval
  encryption: add `encrypted_data_source` class
2025-07-21 18:05:53 +03:00
Piotr Dulikowski
fdfcd67a6e Merge '[Backport 2025.3] cdc: Forbid altering columns of CDC log tables directly' from Scylladb[bot]
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

Backport: we should backport the change to all affected
branches to prevent the consequences that may affect the user.

- (cherry picked from commit 20d0050f4e)

- (cherry picked from commit 59800b1d66)

Parent PR: #25008

Closes scylladb/scylladb#25108

* github.com:scylladb/scylladb:
  cdc: Forbid altering columns of inactive CDC log table
  cdc: Forbid altering columns of CDC log tables directly
2025-07-21 16:22:31 +02:00
Dawid Mędrek
dc6cb5cfad cdc: Forbid altering columns of inactive CDC log table
When CDC becomes disabled on the base table, the CDC log table
still exsits (cf. scylladb/scylladb@adda43edc7).
If it continues to exist up to the point when CDC is re-enabled
on the base table, no new log table will be created -- instead,
the old olg table will be *re-attached*.

Since we want to avoid situations when the definition of the log
table has become misaligned with the definition of the base table
due to actions of the user, we forbid modifying the set of columns
or renaming them in CDC log tables, even when they're inactive.

Validation tests are provided.

(cherry picked from commit 59800b1d66)
2025-07-21 11:43:49 +00:00
Dawid Mędrek
10a9ced4d1 cdc: Forbid altering columns of CDC log tables directly
The set of columns of a CDC log table should be managed automatically
by Scylla, and the user should not have the ability to manipulate them
directly. That could lead to disastrous consequences such as a
segmentation fault.

In this commit, we're restricting those operations. We also provide two
validation tests.

One of the existing tests had to be adjusted as it modified the type
of a column in a CDC log table. Since the test simply verifies that
the user has sufficient permissions to perform `ALTER TABLE` on the log
table, the test is still valid.

Fixes scylladb/scylladb#24643

(cherry picked from commit 20d0050f4e)
2025-07-21 11:43:49 +00:00
Ernest Zaslavsky
934359ea28 s3_client: parse multipart response XML defensively
Ensure robust handling of XML responses when initiating multipart
uploads. Check for the existence of required nodes before access,
and throw an exception if the XML is empty or malformed.

Refs: https://github.com/scylladb/scylladb/issues/24676

Closes scylladb/scylladb#24990

(cherry picked from commit 342e94261f)

Closes scylladb/scylladb#25057
2025-07-21 12:03:00 +02:00
Piotr Dulikowski
74d97711fd Merge '[Backport 2025.3] cdc: throw error if column doesn't exist' from Scylladb[bot]
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.

When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.

However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.

If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.

Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
  before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
  while it's being dropped.

Both cases might as well fail with an error because the column is not
found in the base table.

Fixes scylladb/scylladb#/24952

backport needed - simple fix for a node crash

- (cherry picked from commit b336f282ae)

- (cherry picked from commit 86dfa6324f)

Parent PR: #24986

Closes scylladb/scylladb#25067

* github.com:scylladb/scylladb:
  test: cdc: add test_cdc_with_alter
  cdc: throw error if column doesn't exist
2025-07-21 11:18:06 +02:00
Jenkins Promoter
fc7a6b66e2 Update ScyllaDB version to: 2025.3.0-rc2 2025-07-20 15:44:21 +03:00
Michael Litvak
594ec7d66d test: cdc: add test_cdc_with_alter
Add a test that tests adding and dropping a column to a table with CDC
enabled while writing to it.

(cherry picked from commit 86dfa6324f)
2025-07-20 09:04:00 +02:00
Michael Litvak
338ff18dfe cdc: throw error if column doesn't exist
in the CDC log transformer, when creating a CDC mutation based on some
base table mutation, for each value of a base column we set the value in
the CDC column with the same name.

When looking up the column in the CDC schema by name, we may get a null
pointer if a column by that name is not found. This shouldn't happen
normally because the base schema and CDC schema should be compatible,
and for each base column there should be a CDC column with the same
name.

However, there are scenarios where the base schema and CDC schema are
incompatible for a short period of time when they are being altered.
When a base column is being added or dropped, we could get a base
mutation with this column set, and then the CDC transformer picks up the
latest CDC schema which doesn't have this column.

If such thing happens, we fix the code to throw an exception instead of
crashing on null pointer dereference. Currently we don't have a safer
approach to handle this, but this might be changed in the future. The
other alternative is dropping that data silently which we prefer not to
do.

Throwing an error is acceptable because this scenario most likely
indicates this behavior by the user:
* The user adds a new column, and start writing values to the column
  before the ALTER is complete. or,
* The user drops a column, and continues writing values to the column
  while it's being dropped.

Both cases might as well fail with an error because the column is not
found in the base table.

Fixes scylladb/scylladb#24952

(cherry picked from commit b336f282ae)
2025-07-18 10:36:44 +00:00
Tomasz Grabiec
888e92c969 streaming: Avoid deadlock by running view checks in a separate scheduling group
This issue happens with removenode, when RBNO is disabled, so range
streamer is used.

The deadlock happens in a scenario like this:
1. Start 3 nodes: {A, B, C}, RF=2
2. Node A is lost
3. removenode A
4. Both B and C gain ownership of ranges.
5. Streaming sessions are started with crossed directions: B->C, C->B

Readers created by sender side exhaust streaming semaphore on B and C.
Receiver side attempts to obtain a permit indirectly by calling
check_needs_view_update_path(), which reads local tables. That read is
blocked and times-out, causing streaming to fail. The streaming writer
is already using a tracking-only permit.

To avoid that, run the query under a different scheduling group, which
translates to the system semaphore instead of the maintenance
semaphore, to break the dependency. The gossip group was chosen
because it shouldn't be contended and this change should not interfere
with it much.

Fixes: #24807
(cherry picked from commit dff2b01237)
2025-07-17 17:25:44 +00:00
Tomasz Grabiec
f424c773a4 service: migration_manager: Run group0 barrier in gossip scheduling group
Fixes two issues.

One is potential priority inversion. The barrier will be executed
using scheduling group of the first fiber which triggers it, the rest
will block waiting on it. For example, CQL statements which need to
sync the schema on replica side can block on the barrier triggered by
streaming. That's undesirable. This is theoretical, not proved in the
field.

The second problem is blocking the error path. This barrier is called
from the streaming error handling path. If the streaming concurrency
semaphore is exhausted, and streaming fails due to timeout on
obtaining the permit in check_needs_view_update_path(), the error path
will block too because it will also attempt to obtain the permit as
part of the group0 barrier. Running it in the gossip scheduling group
prevents this.

Fixes #24925

(cherry picked from commit ee2fa58bd6)
2025-07-17 17:25:44 +00:00
Piotr Dulikowski
e49b312be9 auth: fix crash when migration code runs parallel with raft upgrade
The functions password_authenticator::start and
standard_role_manager::start have a similar structure: they spawn a
fiber which invokes a callback that performs some migration until that
migration succeeds. Both handlers set a shared promise called
_superuser_created_promise (those are actually two promises, one for the
password authenticator and the other for the role manager).

The handlers are similar in both cases. They check if auth is in legacy
mode, and behave differently depending on that. If in legacy mode, the
promise is set (if it was not set before), and some legacy migration
actions follow. In auth-on-raft mode, the superuser is attempted to be
created, and if it succeeds then the promise is _unconditionally_ set.

While it makes sense at a glance to set the promise unconditionally,
there is a non-obvious corner case during upgrade to topology on raft.
During the upgrade, auth switches from the legacy mode to auth on raft
mode. Thus, if the callback didn't succeed in legacy mode and then tries
to run in auth-on-raft mode and succeds, it will unconditionally set a
promise that was already set - this is a bug and triggers an assertion
in seastar.

Fix the issue by surrounding the `shared_promise::set_value` call with
an `if` - like it is already done for the legacy case.

Fixes: scylladb/scylladb#24975

Closes scylladb/scylladb#24976

(cherry picked from commit a14b7f71fe)

Closes scylladb/scylladb#25019
2025-07-17 13:32:35 +02:00
Ernest Zaslavsky
549d139e84 sstables: Start using make_data_or_index_source in sstable
Convert all necessary methods to be awaitable. Start using `make_data_or_index_source`
when creating data_source for data and index components.

For proper working of compressed/checksummed input streams, start passing
stream creator functors to `make_(checksummed/compressed)_file_(k_l/m)_format_input_stream`.

(cherry picked from commit 8d49bb8af2)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
4a47262167 sstables: refactor readers and sources to use coroutines
Refactor readers and sources to support coroutine usage in
preparation for integration with `make_data_or_index_source`.
Move coroutine-based member initialization out of constructors
where applicable, and defer initialization until first use.

(cherry picked from commit dff9a229a7)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
81d356315b sstables: coroutinize futurized readers
Coroutinize futurized readers and sources to get ready for using `make_data_or_index_source` in `sstable`

(cherry picked from commit 8ac2978239)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
4ffd72e597 sstables: add make_data_or_index_source to the storage
Add `make_data_or_index_source` to the `storage` interface, implement it
for `filesystem_storage` storage which just creates `data_source` from a
file and for the `s3_storage` create a (maybe) decrypting source from s3
make_download_source.

This change should solve performance improvement for reading large objects
from S3 and should not affect anything for the `filesystem_storage`.

(cherry picked from commit 0de61f56a2)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
8998f221ab encryption: refactor key retrieval
Get the encryption schema extension retrieval code out of
`wrap_file` method to make it reusable elsewhere

(cherry picked from commit 7e5e3c5569)
2025-07-16 12:45:58 +00:00
Ernest Zaslavsky
243ba1fb66 encryption: add encrypted_data_source class
Introduce the `encrypted_data_source` class that wraps an existing data
source to read and decrypt data on the fly using block encryption. Also add
unit tests to verify correct decryption behavior.
NOTE: The wrapped source MUST read from offset 0, `encrypted_data_source` assumes it is

Co-authored-by: Calle Wilund <calle@scylladb.com>
(cherry picked from commit 211daeaa40)
2025-07-16 12:45:58 +00:00
Patryk Jędrzejczak
7caacf958b test: test_zero_token_nodes_multidc: properly handle reads with CL=ONE
The test could fail with RF={DC1: 2, DC2: 0} and CL=ONE when:
- both writes succeeded with the same replica responding first,
- one of the following reads succeeded with the other replica
  responding before it applied mutations from any of the writes.

We fix the test by not expecting reads with CL=ONE to return a row.

We also harden the test by inserting different rows for every pair
(CL, coordinator), where one of the two coordinators is a normal
node from DC1, and the other one is a zero-token node from DC2.
This change makes sure that, for example, every write really
inserts a row.

Fixes scylladb/scylladb#22967

The fix addresses CI flakiness and only changes the test, so it
should be backported.

Closes scylladb/scylladb#23518

(cherry picked from commit 21edec1ace)

Closes scylladb/scylladb#24985
2025-07-15 15:47:43 +02:00
Botond Dénes
489e4fdb4e Merge '[Backport 2025.3] S3 chunked download source bug fixes' from Scylladb[bot]
- Fix missing negation in the `if` in the background downloading fiber
- Add test to catch this case
- Improve the s3 proxy to inject errors if the same resource requested more than once
- Suppress client retry since retrying the same request when each produces multiple buffers may lead to the same data appear more than once in the buffer deque
- Inject exception from the test to simulate response callback failure in the middle

No need to backport anything since this class in not used yet

- (cherry picked from commit f1d0690194)

- (cherry picked from commit e73b83e039)

- (cherry picked from commit 6d9cec558a)

- (cherry picked from commit ec59fcd5e4)

- (cherry picked from commit c75acd274c)

- (cherry picked from commit d2d69cbc8c)

- (cherry picked from commit e50f247bf1)

- (cherry picked from commit 49e8c14a86)

- (cherry picked from commit a5246bbe53)

- (cherry picked from commit acf15eba8e)

Parent PR: #24657

Closes scylladb/scylladb#24943

* github.com:scylladb/scylladb:
  s3_test: Add s3_client test for non-retryable error handling
  s3_test: Add trace logging for default_retry_strategy
  s3_client: Fix edge case when the range is exhausted
  s3_client: Fix indentation in try..catch block
  s3_client: Stop retries in chunked download source
  s3_client: Enhance test coverage for retry logic
  s3_client: Add test for Content-Range fix
  s3_client: Fix missing negation
  s3_client: Refine logging
  s3_client: Improve logging placement for current_range output
2025-07-15 15:28:48 +03:00
Michael Litvak
26738588db tablets: stop storage group on deallocation
When a tablet transitions to a post-cleanup stage on the leaving replica
we deallocate its storage group. Before the storage can be deallocated
and destroyed, we must make sure it's cleaned up and stopped properly.

Normally this happens during the tablet cleanup stage, when
table::cleanup_table is called, so by the time we transition to the next
stage the storage group is already stopped.

However, it's possible that tablet cleanup did not run in some scenario:
1. The topology coordinator runs tablet cleanup on the leaving replica.
2. The leaving replica is restarted.
3. When the leaving replica starts, still in `cleanup` stage, it
   allocates a storage group for the tablet.
4. The topology coordinator moves to the next stage.
5. The leaving replica deallocates the storage group, but it was not
   stopped.

To address this scenario, we always stop the storage group when
deallocating it. Usually it will be already stopped and complete
immediately, and otherwise it will be stopped in the background.

Fixes scylladb/scylladb#24857
Fixes scylladb/scylladb#24828

Closes scylladb/scylladb#24896

(cherry picked from commit fa24fd7cc3)

Closes scylladb/scylladb#24909
2025-07-15 13:14:35 +03:00
Aleksandra Martyniuk
f69f59afbd repair: Reduce max row buf size when small table optimization is on
If small_table_optimization is on, a repair works on a whole table
simultaneously. It may be distributed across the whole cluster and
all nodes might participate in repair.

On a repair master, row buffer is copied for each repair peer.
This means that the memory scales with the number of peers.

In large clusters, repair with small_table_optimization leads to OOM.

Divide the max_row_buf_size by the number of repair peers if
small_table_optimization is on.

Use max_row_buf_size to calculate number of units taken from mem_sem.

Fixes: https://github.com/scylladb/scylladb/issues/22244.

Closes scylladb/scylladb#24868

(cherry picked from commit 17272c2f3b)

Closes scylladb/scylladb#24907
2025-07-15 13:13:49 +03:00
Łukasz Paszkowski
e1e0c721e7 test.py: Fix test_compactionhistory_rows_merged_time_window_compaction_strategy
The test has two major problems
1. Wrongly computed time windows. Data was not spread across two 1-minute
   windows causing the test to generate even three sstables instead
   of two
2. Timestamp was not propagated to the prepared CQL statements. So
   in fact, a current time was used implicitly
3. Because of the incorrect timestamp issue, the remaining tests
   testing purged tombstones were affected as well.

Fixes https://github.com/scylladb/scylladb/issues/24532

Closes scylladb/scylladb#24609

(cherry picked from commit a22d1034af)

Closes scylladb/scylladb#24791
2025-07-15 13:12:39 +03:00
Yaron Kaikov
05a6d4da23 dist/common/scripts/scylla_sysconfig_setup: fix SyntaxWarning: invalid escape sequence
There are invalid escape sequence warnings where raw strings should be used for the regex patterns

Fixes: https://github.com/scylladb/scylladb/issues/24915

Closes scylladb/scylladb#24916

(cherry picked from commit fdcaa9a7e7)

Closes scylladb/scylladb#24970
2025-07-15 11:01:28 +02:00
Yaron Kaikov
1e1aeed3cd auto-backport.py: Avoid bot push to existing backport branches
Changed the backport logic so that the bot only pushes the backport branch if it does not already exist in the remote fork.
If the branch exists, the bot skips the push, allowing only users to update (force-push) the branch after the backport PR is open.

Fixes: https://github.com/scylladb/scylladb/issues/24953

Closes scylladb/scylladb#24954

(cherry picked from commit ed7c7784e4)

Closes scylladb/scylladb#24969
2025-07-15 10:25:30 +02:00
Jenkins Promoter
af10d6f03b Update pgo profiles - aarch64 2025-07-15 05:21:25 +03:00
Jenkins Promoter
0d3742227d Update pgo profiles - x86_64 2025-07-15 04:58:36 +03:00
Yaron Kaikov
c6987e3fed packaging: add ps command to dependancies
ScyllaDB container image doesn't have ps command installed, while this command is used by perftune.py script shipped within the same image. This breaks node and container tuning in Scylla Operator.

Fixes: #24827

Closes scylladb/scylladb#24830

(cherry picked from commit 66ff6ab6f9)

Closes scylladb/scylladb#24956
2025-07-14 14:19:17 +03:00
Ernest Zaslavsky
873c8503cd s3_test: Add s3_client test for non-retryable error handling
Introduce a test that injects a non-retryable error and verifies
that the chunked download source throws an exception as expected.

(cherry picked from commit acf15eba8e)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
dbf4bd162e s3_test: Add trace logging for default_retry_strategy
Introduce trace-level logging for `default_retry_strategy` in
`s3_test` to improve visibility into retry logic during test
execution.

(cherry picked from commit a5246bbe53)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
7f303bfda3 s3_client: Fix edge case when the range is exhausted
Handle case where the download loop exits after consuming all data,
but before receiving an empty buffer signaling EOF. Without this, the
next request is sent with a non-zero offset and zero length, resulting
in "Range request cannot be satisfied" errors. Now, an empty buffer is
pushed to indicate completion and exit the fiber properly.

(cherry picked from commit 49e8c14a86)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
22739df69f s3_client: Fix indentation in try..catch block
Correct indentation in the `try..catch` block to improve code
readability and maintain consistent formatting.

(cherry picked from commit e50f247bf1)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
54db6ca088 s3_client: Stop retries in chunked download source
Disable retries for S3 requests in the chunked download source to
prevent duplicate chunks from corrupting the buffer queue. The
response handler now throws an exception to bypass the retry
strategy, allowing the next range to be attempted cleanly.

This exception is only triggered for retryable errors; unretryable
ones immediately halt further requests.

(cherry picked from commit d2d69cbc8c)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
c841ffe398 s3_client: Enhance test coverage for retry logic
Extend the S3 proxy to support error injection when the client
makes multiple requests to the same resource—useful for testing
retry behavior and failure handling.

(cherry picked from commit c75acd274c)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
c748a97170 s3_client: Add test for Content-Range fix
Introduce a test that accurately verifies the Content-Range
behavior, ensuring the previous fix is properly validated.

(cherry picked from commit ec59fcd5e4)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
00f10e7f1d s3_client: Fix missing negation
Restore a missing `not` in a conditional check that caused
incorrect behavior during S3 client execution.

(cherry picked from commit 6d9cec558a)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
4cd1792528 s3_client: Refine logging
Fix typo in log message to improve clarity and accuracy during
S3 operations.

(cherry picked from commit e73b83e039)
2025-07-13 13:17:14 +00:00
Ernest Zaslavsky
115e8c85e4 s3_client: Improve logging placement for current_range output
Relocated logging to occur after determining the `current_range`,
ensuring more relevant output during S3 client operations.

(cherry picked from commit f1d0690194)
2025-07-13 13:17:14 +00:00
Gleb Natapov
087d3bb957 api: unregister raft_topology_get_cmd_status on shutdown
In c8ce9d1c60 we introduced
raft_topology_get_cmd_status REST api but the commit forgot to
unregister the handler during shutdown.

Fixes #24910

Closes scylladb/scylladb#24911

(cherry picked from commit 89f2edf308)

Closes scylladb/scylladb#24923
2025-07-13 15:15:52 +03:00
Avi Kivity
f3297824e3 Revert "config: decrease default large allocation warning threshold to 128k"
This reverts commit 04fb2c026d. 2025.3 got
the reduced threshold, but won't get many of the fixes the warning will
generate, leaving it very noisy. Better to avoid the noise for this release.

Fixes #24384.
2025-07-10 14:12:14 +03:00
Avi Kivity
4eb220d3ab service: tablet_allocator: avoid large contiguous vector in make_repair_plan()
make_repair_plan() allocates a temporary vector which can grow larger
than our 128k basic allocation unit. Use a chunked vector to avoid
stalls due to large allocations.

Fixes #24713.

Closes scylladb/scylladb#24801

(cherry picked from commit 0138afa63b)

Closes scylladb/scylladb#24902
2025-07-10 12:41:35 +03:00
Patryk Jędrzejczak
c9de7d68f2 Merge '[Backport 2025.3] Make it easier to debug stuck raft topology operation.' from Scylladb[bot]
The series adds more logging and provides new REST api around topology command rpc execution to allow easier debugging of stuck topology operations.

Backport since we want to have in the production as quick as possible.

Fixes #24860

- (cherry picked from commit c8ce9d1c60)

- (cherry picked from commit 4e6369f35b)

Parent PR: #24799

Closes scylladb/scylladb#24881

* https://github.com/scylladb/scylladb:
  topology coordinator: log a start and an end of topology coordinator command execution at info level
  topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
2025-07-09 12:55:48 +02:00
Piotr Dulikowski
b535f44db2 Merge '[Backport 2025.3] batchlog_manager: abort replay of a failed batch on shutdown or node down' from Scylladb[bot]
When replaying a failed batch and sending the mutation to all replicas, make the write response handler cancellable and abort it on shutdown or if some target is marked down. also set a reasonable timeout so it gets aborted if it's stuck for some other unexpected reason.

Previously, the write response handler is not cancellable and has no timeout. This can cause a scenario where some write operation by the batchlog manager is stuck indefinitely, and node shutdown gets stuck as well because it waits for the batchlog manager to complete, without aborting the operation.

backport to relevant versions since the issue can cause node shutdown to hang

Fixes scylladb/scylladb#24599

- (cherry picked from commit 8d48b27062)

- (cherry picked from commit fc5ba4a1ea)

- (cherry picked from commit 7150632cf2)

- (cherry picked from commit 74a3fa9671)

- (cherry picked from commit a9b476e057)

- (cherry picked from commit d7af26a437)

Parent PR: #24595

Closes scylladb/scylladb#24882

* github.com:scylladb/scylladb:
  test: test_batchlog_manager: batchlog replay includes cdc
  test: test_batchlog_manager: test batch replay when a node is down
  batchlog_manager: set timeout on writes
  batchlog_manager: abort writes on shutdown
  batchlog_manager: create cancellable write response handler
  storage_proxy: add write type parameter to mutate_internal
2025-07-08 12:35:55 +02:00
Michael Litvak
ec1dd1bf31 test: test_batchlog_manager: batchlog replay includes cdc
Add a new test that verifies that when replaying batch mutations from
the batchlog, the mutations include cdc augmentation if needed.

This is done in order to verify that it works currently as expected and
doesn't break in the future.

(cherry picked from commit d7af26a437)
2025-07-08 06:25:36 +00:00
Michael Litvak
7b30f487dd test: test_batchlog_manager: test batch replay when a node is down
Add a test of the batchlog manager replay loop applying failed batches
while some replica is down.

The test reproduces an issue where the batchlog manager tries to replay
a failed batch, doesn't get a response from some replica, and becomes
stuck.

It verifies that the batchlog manager can eventually recover from this
situation and continue applying failed batches.

(cherry picked from commit a9b476e057)
2025-07-08 06:25:36 +00:00
Michael Litvak
c3c489d3d4 batchlog_manager: set timeout on writes
Set a timeout on writes of replayed batches by the batchlog manager.

We want to avoid having infinite timeout for the writes in case it gets
stuck for some unexpected reason.

The timeout is set to be high enough to allow any reasonable write to
complete.

(cherry picked from commit 74a3fa9671)
2025-07-08 06:25:36 +00:00
Michael Litvak
6fb6bb8dc7 batchlog_manager: abort writes on shutdown
On shutdown of batchlog manager, abort all writes of replayed batches
by the batchlog manager.

To achieve this we set the appropriate write_type to BATCH, and on
shutdown cancel all write handlers with this type.

(cherry picked from commit 7150632cf2)
2025-07-08 06:25:36 +00:00
Michael Litvak
02c038efa8 batchlog_manager: create cancellable write response handler
When replaying a batch mutation from the batchlog manager and sending it
to all replicas, create the write response handler as cancellable.

To achieve this we define a new wrapper type for batchlog mutations -
batchlog_replay_mutation, and this allows us to overload
create_write_response_handler for this type. This is similar to how it's
done with hint_wrapper and read_repair_mutation.

(cherry picked from commit fc5ba4a1ea)
2025-07-08 06:25:36 +00:00
Michael Litvak
d3175671b7 storage_proxy: add write type parameter to mutate_internal
Currently mutate_internal has a boolean parameter `counter_write` that
indicates whether the write is of counter type or not.

We replace it with a more general parameter that allows to indicate the
write type.

It is compatible with the previous behavior - for a counter write, the
type COUNTER is passed, and otherwise a default value will be used
as before.

(cherry picked from commit 8d48b27062)
2025-07-08 06:25:36 +00:00
Gleb Natapov
4651c44747 topology coordinator: log a start and an end of topology coordinator command execution at info level
Those calls a relatively rare and the output may help to analyze issues
in production.

(cherry picked from commit 4e6369f35b)
2025-07-08 06:24:22 +00:00
Gleb Natapov
0e67f6f6c2 topology coordinator: add REST endpoint to query the status of ongoing topology cmd rpc
The topology coordinator executes several topology cmd rpc against some nodes
during a topology change. A topology operation will not proceed unless
rpc completes (successfully or not), but sometimes it appears that it
hangs and it is hard to tell on which nodes it did not complete yet.
Introduce new REST endpoint that can help with debugging such cases.
If executed on the topology coordinator it returns currently running
topology rpc (if any) and a list of nodes that did not reply yet.

(cherry picked from commit c8ce9d1c60)
2025-07-08 06:24:21 +00:00
Avi Kivity
859d9dd3b1 Merge '[Backport 2025.3] Improve background disposal of tablet_metadata' from Scylladb[bot]
As seen in #23284, when the tablet_metadata contains many tables, even empty ones,
we're seeing a long queue of seastar tasks coming from the individual destruction of
`tablet_map_ptr = foreign_ptr<lw_shared_ptr<const tablet_map>>`.

This change improves `tablet_metadata::clear_gently` to destroy the `tablet_map_ptr` objects
on their owner shard by sorting them into vectors, per- owner shard.

Also, background call to clear_gently was added to `~token_metadata`, as it is destroyed
arbitrarily when automatic token_metadata_ptr variables go out of scope, so that the
contained tablet_metadata would be cleared gently.

Finally, a unit test was added to reproduce the `Too long queue accumulated for gossip` symptom
and verify that it is gone with this change.

Fixes #24814
Refs #23284

This change is not marked as fixing the issue since we still need to verify that there is no impact on query performance, reactor stalls, or large allocations, with a large number of tablet-based tables.

* Since the issue exists in 2025.1, requesting backport to 2025.1 and upwards

- (cherry picked from commit 3acca0aa63)

- (cherry picked from commit 493a2303da)

- (cherry picked from commit e0a19b981a)

- (cherry picked from commit 2b2cfaba6e)

- (cherry picked from commit 2c0bafb934)

- (cherry picked from commit 4a3d14a031)

- (cherry picked from commit 6e4803a750)

Parent PR: #24618

Closes scylladb/scylladb#24864

* github.com:scylladb/scylladb:
  token_metadata_impl: clear_gently: release version tracker early
  test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables
  token_metadata: clear_and_destroy_impl when destroyed
  token_metadata: keep a reference to shared_token_metadata
  token_metadata: move make_token_metadata_ptr into shared_token_metadata class
  replica: database: get and expose a mutable locator::shared_token_metadata
  locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
2025-07-07 14:02:19 +03:00
Gleb Natapov
a25bd068bf topology coordinator: do not set request_type field for truncation command if topology_global_request_queue feature is not enabled yet
Old nodes do not expect global topology request names to be in
request_type field, so set it only if a cluster is fully upgraded
already.

Closes scylladb/scylladb#24731

(cherry picked from commit ca7837550d)

Closes scylladb/scylladb#24833
2025-07-07 11:50:55 +02:00
Benny Halevy
9bc487e79e token_metadata_impl: clear_gently: release version tracker early
No need to wait for all members to be cleared gently.
We can release the version earlier since the
held version may be awaited for in barriers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 6e4803a750)
2025-07-07 09:42:29 +03:00
Benny Halevy
41dc86ffa8 test: cluster: test_tablets_merge: add test_tablet_split_merge_with_many_tables
Reproduces #23284

Currently skipped in release mode since it requires
the `short_tablet_stats_refresh_interval` interval.
Ref #24641

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 4a3d14a031)
2025-07-07 09:42:26 +03:00
Benny Halevy
f78a352a29 token_metadata: clear_and_destroy_impl when destroyed
We have a lot of places in the code where
a token_metadata_ptr is kept in an automatic
variable and destroyed when it leaves the scope.
since it's a referenced counted lw_shared_ptr,
the token_metadata object is rarely destroyed in
those cases, but when it is, it doesn't go through
clear_gently, and in particular its tablet_metadata
is not cleared gently, leading to inefficient destruction
of potentially many foreign_ptr:s.

This patch calls clear_and_destroy_impl that gently
clears and destroys the impl object in the background
using the shared_token_metadata.

Fixes #13381

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2c0bafb934)
2025-07-07 09:38:17 +03:00
Benny Halevy
b647dbd547 token_metadata: keep a reference to shared_token_metadata
To be used by a following patch to gently clean and destroy
the token_data_impl in the background.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2b2cfaba6e)
2025-07-07 09:34:10 +03:00
Benny Halevy
0e7d3b4eb9 token_metadata: move make_token_metadata_ptr into shared_token_metadata class
So we can use the local shared_token_metadata instance
for safe background destroy of token_metadata_impl:s.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit e0a19b981a)
2025-07-07 09:30:01 +03:00
Benny Halevy
c8043e05c1 replica: database: get and expose a mutable locator::shared_token_metadata
Prepare for next patch, the will use this shared_token_metadata
to make mutable_token_metadata_ptr:s

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 493a2303da)
2025-07-07 09:27:06 +03:00
Benny Halevy
54fb9ed03b locator: tablets: tablet_metadata: clear_gently: optimize foreign ptr destruction
Sort all tablet_map_ptr:s by shard_id
and then destroy them on each shard to prevent
long cross-shard task queues for foreign_ptr destructions.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 3acca0aa63)
2025-07-07 09:27:01 +03:00
Avi Kivity
f60c54df77 storage_proxy: avoid large allocation when storing batch in system.batchlog
Currently, when computing the mutation to be stored in system.batchlog,
we go through data_value. In turn this goes through `bytes` type
(#24810), so it causes a large contiguous allocation if the batch is
large.

Fix by going through the more primitive, but less contiguous,
atomic_cell API.

Fixes #24809.

Closes scylladb/scylladb#24811

(cherry picked from commit 60f407bff4)

Closes scylladb/scylladb#24846
2025-07-05 00:37:09 +03:00
Patryk Jędrzejczak
f1ec51133e docs: handling-node-failures: fix typo
Replacing "from" is incorrect. The typo comes from recently
merged #24583.

Fixes #24732

Requires backport to 2025.2 since #24583 has been backported to 2025.2.

Closes scylladb/scylladb#24733

(cherry picked from commit fa982f5579)

Closes scylladb/scylladb#24832
2025-07-04 19:35:00 +02:00
Jenkins Promoter
648fe6a4e8 Update ScyllaDB version to: 2025.3.0-rc1 2025-07-03 11:35:01 +03:00
Michał Chojnowski
1bd536a228 utils/alien_worker: fix a data race in submit()
We move a `seastar::promise` on the external worker thread,
after the matching `seastar::future` was returned to the shard.

That's illegal. If the `promise` move occurs concurrently with some
operation (move, await) on the `future`, it becomes a data race
which could cause various kinds of corruption.

This patch fixes that by keeping the promise at a stable address
on the shard (inside a coroutine frame) and only passing through
the worker.

Fixes #24751

Closes scylladb/scylladb#24752

(cherry picked from commit a29724479a)

Closes scylladb/scylladb#24780
2025-07-03 10:45:51 +03:00
Avi Kivity
d5b11098e8 repair: row_level: unstall to_repair_rows_on_wire() destroying its input
to_repair_rows_on_wire() moves the contents of its input std::list
and is careful to yield after each element, but the final destruction
of the input list still deals with all of the list elements without
yielding. This is expensive as not all contents of repair_row are moved
(_dk_with_hash is of type lw_shared_ptr<const decorated_key_with_hash>).

To fix, destroy each row element as we move along. This is safe as we
own the input and don't reference row_list other than for the iteration.

Fixes #24725.

Closes scylladb/scylladb#24726

(cherry picked from commit 6aa71205d8)

Closes scylladb/scylladb#24771
2025-07-03 10:44:58 +03:00
Tomasz Grabiec
775916132e Merge '[Backport 2025.3] repair: postpone repair until topology is not busy ' from Scylladb[bot]
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization.

Hence, if:
- topology is in the tablet_resize_finalization state;
- repair starts (as there is no tablet transitions) and holds the erm;
- resize finalization finishes;

then the repair sees a topology state different than the actual -
it does not see that the storage groups were already split.
Repair code does not handle this case and it results with
on_internal_error.

Start repair when topology is not busy. The check isn't atomic,
as it's done on a shard 0. Thus, we compare the topology versions
to ensure that the business check is valid.

Fixes: https://github.com/scylladb/scylladb/issues/24195.

Needs backport to all branches since they are affected

- (cherry picked from commit df152d9824)

- (cherry picked from commit 83c9af9670)

Parent PR: #24202

Closes scylladb/scylladb#24783

* github.com:scylladb/scylladb:
  test: add test for repair and resize finalization
  repair: postpone repair until topology is not busy
2025-07-02 13:17:08 +02:00
Calle Wilund
46e3794bde encryption_at_rest_test: Add exception handler to ensure proxy stop
If boost test is run such that we somehow except even in a test macro
such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy
used, causing a use after free.

(cherry picked from commit 8d37e5e24b)
2025-07-02 10:13:08 +00:00
Calle Wilund
b7a82898f0 encryption: Ensure stopping timers in provider cache objects
utils::loading cache has a timer that can, if we're unlucky, be runnnig
while the encryption context/extensions referencing the various host
objects containing them are destroyed in the case of unit testing.

Add a stop phase in encryption context shutdown closing the caches.

(cherry picked from commit ee98f5d361)
2025-07-02 10:13:08 +00:00
Jenkins Promoter
76bf279e0e Update pgo profiles - aarch64 2025-07-02 13:06:18 +03:00
Jenkins Promoter
61364624e3 Update pgo profiles - x86_64 2025-07-02 12:34:58 +03:00
Botond Dénes
6e6c00dcfe docs: cql/types.rst: remove reference to frozen-only UDTs
ScyllaDB supports non-frozen UDTs since 3.2, no need to keep referencing
this limitation in the current docs. Replace the description of the
limitation with general description of frozen semantics for UDTs.

Fixes: #22929

Closes scylladb/scylladb#24763

(cherry picked from commit 37ef9efb4e)

Closes scylladb/scylladb#24784
2025-07-02 12:11:25 +03:00
Aleksandra Martyniuk
c26eb8ef14 test: add test for repair and resize finalization
Add test that checks whether repair does not start if there is an
ongoing resize finalization.

(cherry picked from commit 83c9af9670)
2025-07-01 20:26:53 +00:00
Aleksandra Martyniuk
8a1d09862e repair: postpone repair until topology is not busy
Currently, repair_service::repair_tablets starts repair if there
is no ongoing tablet operations. The check does not consider global
topology operations, like tablet resize finalization. This may cause
a data race and unexpected behavior.

Start repair when topology is not busy.

(cherry picked from commit df152d9824)
2025-07-01 20:26:53 +00:00
Yaron Kaikov
e64bb3819c Update ScyllaDB version to: 2025.3.0-rc0 2025-07-01 10:34:39 +03:00
109 changed files with 2357 additions and 698 deletions

View File

@@ -112,10 +112,15 @@ def backport(repo, pr, version, commits, backport_base_branch, is_collaborator):
is_draft = True
repo_local.git.add(A=True)
repo_local.git.cherry_pick('--continue')
repo_local.git.push(fork_repo, new_branch_name, force=True)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft, is_collaborator)
# Check if the branch already exists in the remote fork
remote_refs = repo_local.git.ls_remote('--heads', fork_repo, new_branch_name)
if not remote_refs:
# Branch does not exist, create it with a regular push
repo_local.git.push(fork_repo, new_branch_name)
create_pull_request(repo, new_branch_name, backport_base_branch, pr, backport_pr_title, commits,
is_draft, is_collaborator)
else:
logging.info(f"Remote branch {new_branch_name} already exists in fork. Skipping push.")
except GitCommandError as e:
logging.warning(f"GitCommandError: {e}")

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2025.3.0-dev
VERSION=2025.3.0-rc2
if test -f version
then

View File

@@ -38,7 +38,6 @@
#include <optional>
#include "utils/assert.hh"
#include "utils/overloaded_functor.hh"
#include <seastar/json/json_elements.hh>
#include "collection_mutation.hh"
#include "schema/schema.hh"
#include "db/tags/extension.hh"
@@ -121,47 +120,50 @@ static lw_shared_ptr<stats> get_stats_from_schema(service::storage_proxy& sp, co
}
}
make_jsonable::make_jsonable(rjson::value&& value)
: _value(std::move(value))
{}
std::string make_jsonable::to_json() const {
return rjson::print(_value);
}
json::json_return_type make_streamed(rjson::value&& value) {
// CMH. json::json_return_type uses std::function, not noncopyable_function.
// Need to make a copyable version of value. Gah.
auto rs = make_shared<rjson::value>(std::move(value));
std::function<future<>(output_stream<char>&&)> func = [rs](output_stream<char>&& os) mutable -> future<> {
// move objects to coroutine frame.
auto los = std::move(os);
auto lrs = std::move(rs);
executor::body_writer make_streamed(rjson::value&& value) {
return [value = std::move(value)](output_stream<char>&& _out) mutable -> future<> {
auto out = std::move(_out);
std::exception_ptr ex;
try {
co_await rjson::print(*lrs, los);
co_await rjson::print(value, out);
} catch (...) {
// at this point, we cannot really do anything. HTTP headers and return code are
// already written, and quite potentially a portion of the content data.
// just log + rethrow. It is probably better the HTTP server closes connection
// abruptly or something...
ex = std::current_exception();
elogger.error("Exception during streaming HTTP response: {}", ex);
}
co_await los.close();
co_await rjson::destroy_gently(std::move(*lrs));
co_await out.close();
co_await rjson::destroy_gently(std::move(value));
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
co_return;
};
return func;
}
json_string::json_string(std::string&& value)
: _value(std::move(value))
{}
std::string json_string::to_json() const {
return _value;
// make_streamed_with_extra_array() is variant of make_streamed() above, which
// builds a streaming response (a function writing to an output stream) from a
// JSON object (rjson::value) but adds to it at the end an additional array.
// The extra array is given a separate chunked_vector to avoid putting it
// inside the rjson::value - because RapidJSON does contiguous allocations for
// arrays which we want to avoid for potentially long arrays in Query/Scan
// responses (see #23535).
// If we ever fix RapidJSON to avoid contiguous allocations for arrays, or
// replace it entirely (#24458), we can remove this function and the function
// rjson::print_with_extra_array() which it calls.
executor::body_writer make_streamed_with_extra_array(rjson::value&& value,
std::string array_name, utils::chunked_vector<rjson::value>&& array) {
return [value = std::move(value), array_name = std::move(array_name), array = std::move(array)](output_stream<char>&& _out) mutable -> future<> {
auto out = std::move(_out);
std::exception_ptr ex;
try {
co_await rjson::print_with_extra_array(value, array_name, array, out);
} catch (...) {
ex = std::current_exception();
}
co_await out.close();
co_await rjson::destroy_gently(std::move(value));
// TODO: can/should we also destroy the array gently?
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
};
}
// This function throws api_error::validation if input value is not an object.
@@ -764,7 +766,7 @@ future<executor::request_return_type> executor::describe_table(client_state& cli
rjson::value response = rjson::empty_object();
rjson::add(response, "Table", std::move(table_description));
elogger.trace("returning {}", response);
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
// Check CQL's Role-Based Access Control (RBAC) permission_to_check (MODIFY,
@@ -881,7 +883,7 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
rjson::value response = rjson::empty_object();
rjson::add(response, "TableDescription", std::move(table_description));
elogger.trace("returning {}", response);
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
static data_type parse_key_type(std::string_view type) {
@@ -1165,7 +1167,7 @@ future<executor::request_return_type> executor::tag_resource(client_state& clien
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
});
co_return json_string("");
co_return ""; // empty response
}
future<executor::request_return_type> executor::untag_resource(client_state& client_state, service_permit permit, rjson::value request) {
@@ -1186,7 +1188,7 @@ future<executor::request_return_type> executor::untag_resource(client_state& cli
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
});
co_return json_string("");
co_return ""; // empty response
}
future<executor::request_return_type> executor::list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request) {
@@ -1212,7 +1214,7 @@ future<executor::request_return_type> executor::list_tags_of_resource(client_sta
rjson::push_back(tags, std::move(new_entry));
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
struct billing_mode_type {
@@ -1674,7 +1676,7 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
rjson::value status = rjson::empty_object();
executor::supplement_table_info(request, *schema, sp);
rjson::add(status, "TableDescription", std::move(request));
co_return make_jsonable(std::move(status));
co_return rjson::print(std::move(status));
}
future<executor::request_return_type> executor::create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request) {
@@ -1951,7 +1953,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
rjson::value status = rjson::empty_object();
supplement_table_info(request, *schema, p.local());
rjson::add(status, "TableDescription", std::move(request));
co_return make_jsonable(std::move(status));
co_return rjson::print(std::move(status));
});
}
@@ -2417,7 +2419,7 @@ static future<executor::request_return_type> rmw_operation_return(rjson::value&&
if (!attributes.IsNull()) {
rjson::add(ret, "Attributes", std::move(attributes));
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
static future<std::unique_ptr<rjson::value>> get_previous_item(
@@ -3009,7 +3011,7 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
rjson::add(ret, "ConsumedCapacity", std::move(consumed_capacity));
}
_stats.api_operations.batch_write_item_latency.mark(std::chrono::steady_clock::now() - start_time);
co_return make_jsonable(std::move(ret));
co_return rjson::print(std::move(ret));
}
static const std::string_view get_item_type_string(const rjson::value& v) {
@@ -4249,18 +4251,17 @@ future<executor::request_return_type> executor::get_item(client_state& client_st
verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "GetItem");
rcu_consumed_capacity_counter add_capacity(request, cl == db::consistency_level::LOCAL_QUORUM);
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::SELECT);
co_return co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), std::move(permit), client_state, trace_state)).then(
[per_table_stats, this, schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = std::move(attrs_to_get), start_time = std::move(start_time), add_capacity=std::move(add_capacity)] (service::storage_proxy::coordinator_query_result qr) mutable {
per_table_stats->api_operations.get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
_stats.api_operations.get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
uint64_t rcu_half_units = 0;
auto res = make_ready_future<executor::request_return_type>(make_jsonable(describe_item(schema, partition_slice, *selection, *qr.query_result, std::move(attrs_to_get), add_capacity, rcu_half_units)));
per_table_stats->rcu_half_units_total += rcu_half_units;
_stats.rcu_half_units_total += rcu_half_units;
return res;
});
service::storage_proxy::coordinator_query_result qr =
co_await _proxy.query(
schema, std::move(command), std::move(partition_ranges), cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), std::move(permit), client_state, trace_state));
per_table_stats->api_operations.get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
_stats.api_operations.get_item_latency.mark(std::chrono::steady_clock::now() - start_time);
uint64_t rcu_half_units = 0;
rjson::value res = describe_item(schema, partition_slice, *selection, *qr.query_result, std::move(attrs_to_get), add_capacity, rcu_half_units);
per_table_stats->rcu_half_units_total += rcu_half_units;
_stats.rcu_half_units_total += rcu_half_units;
co_return rjson::print(std::move(res));
}
static void check_big_object(const rjson::value& val, int& size_left);
@@ -4505,7 +4506,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
if (is_big(response)) {
co_return make_streamed(std::move(response));
} else {
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
}
@@ -4649,7 +4650,11 @@ class describe_items_visitor {
const filter& _filter;
typename columns_t::const_iterator _column_it;
rjson::value _item;
rjson::value _items;
// _items is a chunked_vector<rjson::value> instead of a RapidJson array
// (rjson::value) because unfortunately RapidJson arrays are stored
// contiguously in memory, and cause large allocations when a Query/Scan
// returns a long list of short items (issue #23535).
utils::chunked_vector<rjson::value> _items;
size_t _scanned_count;
public:
@@ -4659,7 +4664,6 @@ public:
, _filter(filter)
, _column_it(columns.begin())
, _item(rjson::empty_object())
, _items(rjson::empty_array())
, _scanned_count(0)
{
// _filter.check() may need additional attributes not listed in
@@ -4738,13 +4742,13 @@ public:
rjson::remove_member(_item, attr);
}
rjson::push_back(_items, std::move(_item));
_items.push_back(std::move(_item));
}
_item = rjson::empty_object();
++_scanned_count;
}
rjson::value get_items() && {
utils::chunked_vector<rjson::value> get_items() && {
return std::move(_items);
}
@@ -4753,13 +4757,25 @@ public:
}
};
static future<std::tuple<rjson::value, size_t>> describe_items(const cql3::selection::selection& selection, std::unique_ptr<cql3::result_set> result_set, std::optional<attrs_to_get>&& attrs_to_get, filter&& filter) {
// describe_items() returns a JSON object that includes members "Count"
// and "ScannedCount", but *not* "Items" - that is returned separately
// as a chunked_vector to avoid large contiguous allocations which
// RapidJSON does of its array. The caller should add "Items" to the
// returned JSON object if needed, or print it separately.
// The returned chunked_vector (the items) is std::optional<>, because
// the user may have requested only to count items, and not return any
// items - which is different from returning an empty list of items.
static future<std::tuple<rjson::value, std::optional<utils::chunked_vector<rjson::value>>, size_t>> describe_items(
const cql3::selection::selection& selection,
std::unique_ptr<cql3::result_set> result_set,
std::optional<attrs_to_get>&& attrs_to_get,
filter&& filter) {
describe_items_visitor visitor(selection.get_columns(), attrs_to_get, filter);
co_await result_set->visit_gently(visitor);
auto scanned_count = visitor.get_scanned_count();
rjson::value items = std::move(visitor).get_items();
utils::chunked_vector<rjson::value> items = std::move(visitor).get_items();
rjson::value items_descr = rjson::empty_object();
auto size = items.Size();
auto size = items.size();
rjson::add(items_descr, "Count", rjson::value(size));
rjson::add(items_descr, "ScannedCount", rjson::value(scanned_count));
// If attrs_to_get && attrs_to_get->empty(), this means the user asked not
@@ -4769,10 +4785,11 @@ static future<std::tuple<rjson::value, size_t>> describe_items(const cql3::selec
// In that case, we currently build a list of empty items and here drop
// it. We could just count the items and not bother with the empty items.
// (However, remember that when we do have a filter, we need the items).
std::optional<utils::chunked_vector<rjson::value>> opt_items;
if (!attrs_to_get || !attrs_to_get->empty()) {
rjson::add(items_descr, "Items", std::move(items));
opt_items = std::move(items);
}
co_return std::tuple<rjson::value, size_t>{std::move(items_descr), size};
co_return std::tuple(std::move(items_descr), std::move(opt_items), size);
}
static rjson::value encode_paging_state(const schema& schema, const service::pager::paging_state& paging_state) {
@@ -4810,6 +4827,12 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
return last_evaluated_key;
}
// RapidJSON allocates arrays contiguously in memory, so we want to avoid
// returning a large number of items as a single rapidjson array, and use
// a chunked_vector instead. The following constant is an arbitrary cutoff
// point for when to switch from a rapidjson array to a chunked_vector.
static constexpr int max_items_for_rapidjson_array = 256;
static future<executor::request_return_type> do_query(service::storage_proxy& proxy,
schema_ptr table_schema,
const rjson::value* exclusive_start_key,
@@ -4882,19 +4905,35 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
}
auto paging_state = rs->get_metadata().paging_state();
bool has_filter = filter;
auto [items, size] = co_await describe_items(*selection, std::move(rs), std::move(attrs_to_get), std::move(filter));
auto [items_descr, opt_items, size] = co_await describe_items(*selection, std::move(rs), std::move(attrs_to_get), std::move(filter));
if (paging_state) {
rjson::add(items, "LastEvaluatedKey", encode_paging_state(*table_schema, *paging_state));
rjson::add(items_descr, "LastEvaluatedKey", encode_paging_state(*table_schema, *paging_state));
}
if (has_filter){
cql_stats.filtered_rows_read_total += p->stats().rows_read_total;
// update our "filtered_row_matched_total" for all the rows matched, despited the filter
cql_stats.filtered_rows_matched_total += size;
}
if (is_big(items)) {
co_return executor::request_return_type(make_streamed(std::move(items)));
if (opt_items) {
if (opt_items->size() >= max_items_for_rapidjson_array) {
// There are many items, better print the JSON and the array of
// items (opt_items) separately to avoid RapidJSON's contiguous
// allocation of arrays.
co_return make_streamed_with_extra_array(std::move(items_descr), "Items", std::move(*opt_items));
}
// There aren't many items in the chunked vector opt_items,
// let's just insert them into the JSON object and print the
// full JSON normally.
rjson::value items_json = rjson::empty_array();
for (auto& item : *opt_items) {
rjson::push_back(items_json, std::move(item));
}
rjson::add(items_descr, "Items", std::move(items_json));
}
co_return executor::request_return_type(make_jsonable(std::move(items)));
if (is_big(items_descr)) {
co_return make_streamed(std::move(items_descr));
}
co_return rjson::print(std::move(items_descr));
}
static dht::token token_for_segment(int segment, int total_segments) {
@@ -5489,7 +5528,7 @@ future<executor::request_return_type> executor::list_tables(client_state& client
std::string exclusive_start = exclusive_start_json ? exclusive_start_json->GetString() : "";
int limit = limit_json ? limit_json->GetInt() : 100;
if (limit < 1 || limit > 100) {
return make_ready_future<request_return_type>(api_error::validation("Limit must be greater than 0 and no greater than 100"));
co_return api_error::validation("Limit must be greater than 0 and no greater than 100");
}
auto tables = _proxy.data_dictionary().get_tables(); // hold on to temporary, table_names isn't a container, it's a view
@@ -5531,7 +5570,7 @@ future<executor::request_return_type> executor::list_tables(client_state& client
rjson::add(response, "LastEvaluatedTableName", rjson::copy(last_table_name));
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(response)));
co_return rjson::print(std::move(response));
}
future<executor::request_return_type> executor::describe_endpoints(client_state& client_state, service_permit permit, rjson::value request, std::string host_header) {
@@ -5542,8 +5581,8 @@ future<executor::request_return_type> executor::describe_endpoints(client_state&
if (!override.empty()) {
if (override == "disabled") {
_stats.unsupported_operations++;
return make_ready_future<request_return_type>(api_error::unknown_operation(
"DescribeEndpoints disabled by configuration (alternator_describe_endpoints=disabled)"));
co_return api_error::unknown_operation(
"DescribeEndpoints disabled by configuration (alternator_describe_endpoints=disabled)");
}
host_header = std::move(override);
}
@@ -5555,13 +5594,13 @@ future<executor::request_return_type> executor::describe_endpoints(client_state&
// A "Host:" header includes both host name and port, exactly what we need
// to return.
if (host_header.empty()) {
return make_ready_future<request_return_type>(api_error::validation("DescribeEndpoints needs a 'Host:' header in request"));
co_return api_error::validation("DescribeEndpoints needs a 'Host:' header in request");
}
rjson::add(response, "Endpoints", rjson::empty_array());
rjson::push_back(response["Endpoints"], rjson::empty_object());
rjson::add(response["Endpoints"][0], "Address", rjson::from_string(host_header));
rjson::add(response["Endpoints"][0], "CachePeriodInMinutes", rjson::value(1440));
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(response)));
co_return rjson::print(std::move(response));
}
static std::map<sstring, sstring> get_network_topology_options(service::storage_proxy& sp, gms::gossiper& gossiper, int rf) {
@@ -5596,7 +5635,7 @@ future<executor::request_return_type> executor::describe_continuous_backups(clie
rjson::add(desc, "PointInTimeRecoveryDescription", std::move(pitr));
rjson::value response = rjson::empty_object();
rjson::add(response, "ContinuousBackupsDescription", std::move(desc));
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
// Create the metadata for the keyspace in which we put the alternator

View File

@@ -10,8 +10,8 @@
#include <seastar/core/future.hh>
#include "seastarx.hh"
#include <seastar/json/json_elements.hh>
#include <seastar/core/sharded.hh>
#include <seastar/util/noncopyable_function.hh>
#include "service/migration_manager.hh"
#include "service/client_state.hh"
@@ -58,29 +58,6 @@ namespace alternator {
class rmw_operation;
struct make_jsonable : public json::jsonable {
rjson::value _value;
public:
explicit make_jsonable(rjson::value&& value);
std::string to_json() const override;
};
/**
* Make return type for serializing the object "streamed",
* i.e. direct to HTTP output stream. Note: only useful for
* (very) large objects as there are overhead issues with this
* as well, but for massive lists of return objects this can
* help avoid large allocations/many re-allocs
*/
json::json_return_type make_streamed(rjson::value&&);
struct json_string : public json::jsonable {
std::string _value;
public:
explicit json_string(std::string&& value);
std::string to_json() const override;
};
namespace parsed {
class path;
};
@@ -169,7 +146,19 @@ class executor : public peering_sharded_service<executor> {
public:
using client_state = service::client_state;
using request_return_type = std::variant<json::json_return_type, api_error>;
// request_return_type is the return type of the executor methods, which
// can be one of:
// 1. A string, which is the response body for the request.
// 2. A body_writer, an asynchronous function (returning future<>) that
// takes an output_stream and writes the response body into it.
// 3. An api_error, which is an error response that should be returned to
// the client.
// The body_writer is used for streaming responses, where the response body
// is written in chunks to the output_stream. This allows for efficient
// handling of large responses without needing to allocate a large buffer
// in memory.
using body_writer = noncopyable_function<future<>(output_stream<char>&&)>;
using request_return_type = std::variant<std::string, body_writer, api_error>;
stats _stats;
// The metric_groups object holds this stat object's metrics registered
// as long as the stats object is alive.
@@ -275,4 +264,13 @@ bool is_big(const rjson::value& val, int big_size = 100'000);
// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(bool enforce_authorization, const service::client_state&, const schema_ptr&, auth::permission);
/**
* Make return type for serializing the object "streamed",
* i.e. direct to HTTP output stream. Note: only useful for
* (very) large objects as there are overhead issues with this
* as well, but for massive lists of return objects this can
* help avoid large allocations/many re-allocs
*/
executor::body_writer make_streamed(rjson::value&&);
}

View File

@@ -13,7 +13,6 @@
#include <seastar/http/function_handlers.hh>
#include <seastar/http/short_streams.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/json/json_elements.hh>
#include <seastar/util/defer.hh>
#include <seastar/util/short_streams.hh>
#include "seastarx.hh"
@@ -124,22 +123,22 @@ public:
}
auto res = resf.get();
std::visit(overloaded_functor {
[&] (const json::json_return_type& json_return_value) {
slogger.trace("api_handler success case");
if (json_return_value._body_writer) {
// Unfortunately, write_body() forces us to choose
// from a fixed and irrelevant list of "mime-types"
// at this point. But we'll override it with the
// one (application/x-amz-json-1.0) below.
rep->write_body("json", std::move(json_return_value._body_writer));
} else {
rep->_content += json_return_value._res;
}
},
[&] (const api_error& err) {
generate_error_reply(*rep, err);
}
}, res);
[&] (std::string&& str) {
// Note that despite the move, there is a copy here -
// as str is std::string and rep->_content is sstring.
rep->_content = std::move(str);
},
[&] (executor::body_writer&& body_writer) {
// Unfortunately, write_body() forces us to choose
// from a fixed and irrelevant list of "mime-types"
// at this point. But we'll override it with the
// correct one (application/x-amz-json-1.0) below.
rep->write_body("json", std::move(body_writer));
},
[&] (const api_error& err) {
generate_error_reply(*rep, err);
}
}, std::move(res));
return make_ready_future<std::unique_ptr<reply>>(std::move(rep));
});

View File

@@ -217,7 +217,7 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
rjson::add(ret, "LastEvaluatedStreamArn", *last);
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
struct shard_id {
@@ -491,7 +491,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
if (!opts.enabled()) {
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
// TODO: label
@@ -617,7 +617,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
rjson::add(stream_desc, "Shards", std::move(shards));
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
}
@@ -770,7 +770,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
auto ret = rjson::empty_object();
rjson::add(ret, "ShardIterator", iter);
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
struct event_id {
@@ -1021,7 +1021,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
// will notice end end of shard and not return NextShardIterator.
rjson::add(ret, "NextShardIterator", next_iter);
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
// ugh. figure out if we are and end-of-shard
@@ -1047,7 +1047,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
if (is_big(ret)) {
return make_ready_future<executor::request_return_type>(make_streamed(std::move(ret)));
}
return make_ready_future<executor::request_return_type>(make_jsonable(std::move(ret)));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
});
}

View File

@@ -118,7 +118,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
// basically identical to the request's
rjson::value response = rjson::empty_object();
rjson::add(response, "TimeToLiveSpecification", std::move(*spec));
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
@@ -135,7 +135,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta
}
rjson::value response = rjson::empty_object();
rjson::add(response, "TimeToLiveDescription", std::move(desc));
co_return make_jsonable(std::move(response));
co_return rjson::print(std::move(response));
}
// expiration_service is a sharded service responsible for cleaning up expired

View File

@@ -3161,6 +3161,22 @@
]
}
]
},
{
"path":"/storage_service/raft_topology/cmd_rpc_status",
"operations":[
{
"method":"GET",
"summary":"Get information about currently running topology cmd rpc",
"type":"string",
"nickname":"raft_topology_get_cmd_status",
"produces":[
"application/json"
],
"parameters":[
]
}
]
}
],
"models":{

View File

@@ -1670,6 +1670,18 @@ rest_raft_topology_upgrade_status(sharded<service::storage_service>& ss, std::un
co_return sstring(format("{}", ustate));
}
static
future<json::json_return_type>
rest_raft_topology_get_cmd_status(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
const auto status = co_await ss.invoke_on(0, [] (auto& ss) {
return ss.get_topology_cmd_status();
});
if (status.active_dst.empty()) {
co_return sstring("none");
}
co_return sstring(fmt::format("{}[{}]: {}", status.current, status.index, fmt::join(status.active_dst, ",")));
}
static
future<json::json_return_type>
rest_move_tablet(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
@@ -1902,6 +1914,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::reload_raft_topology_state.set(r, rest_bind(rest_reload_raft_topology_state, ss, group0_client));
ss::upgrade_to_raft_topology.set(r, rest_bind(rest_upgrade_to_raft_topology, ss));
ss::raft_topology_upgrade_status.set(r, rest_bind(rest_raft_topology_upgrade_status, ss));
ss::raft_topology_get_cmd_status.set(r, rest_bind(rest_raft_topology_get_cmd_status, ss));
ss::move_tablet.set(r, rest_bind(rest_move_tablet, ctx, ss));
ss::add_tablet_replica.set(r, rest_bind(rest_add_tablet_replica, ctx, ss));
ss::del_tablet_replica.set(r, rest_bind(rest_del_tablet_replica, ctx, ss));
@@ -1983,6 +1996,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::reload_raft_topology_state.unset(r);
ss::upgrade_to_raft_topology.unset(r);
ss::raft_topology_upgrade_status.unset(r);
ss::raft_topology_get_cmd_status.unset(r);
ss::move_tablet.unset(r);
ss::add_tablet_replica.unset(r);
ss::del_tablet_replica.unset(r);

View File

@@ -227,7 +227,9 @@ future<> password_authenticator::start() {
utils::get_local_injector().inject("password_authenticator_start_pause", utils::wait_for_message(5min)).get();
if (!legacy_mode(_qp)) {
maybe_create_default_password_with_retries().get();
_superuser_created_promise.set_value();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
}
});
});

View File

@@ -321,7 +321,9 @@ future<> standard_role_manager::start() {
}
if (!legacy) {
co_await maybe_create_default_role_with_retries();
_superuser_created_promise.set_value();
if (!_superuser_created_promise.available()) {
_superuser_created_promise.set_value();
}
}
};

View File

@@ -960,8 +960,12 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to the given value for the given row.
void set_value(const clustering_key& log_ck, const column_definition& base_cdef, const managed_bytes_view& value) {
auto& log_cdef = *_log_schema.get_column_definition(log_data_column_name_bytes(base_cdef.name()));
_log_mut.set_cell(log_ck, log_cdef, atomic_cell::make_live(*base_cdef.type, _ts, value, _ttl));
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
_log_mut.set_cell(log_ck, *log_cdef_ptr, atomic_cell::make_live(*base_cdef.type, _ts, value, _ttl));
}
// Each regular and static column in the base schema has a corresponding column in the log schema
@@ -969,7 +973,13 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to `true` for the given row. If not called, the column will be `null`.
void set_deleted(const clustering_key& log_ck, const column_definition& base_cdef) {
_log_mut.set_cell(log_ck, log_data_column_deleted_name_bytes(base_cdef.name()), data_value(true), _ts, _ttl);
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_deleted_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
auto& log_cdef = *log_cdef_ptr;
_log_mut.set_cell(log_ck, *log_cdef_ptr, atomic_cell::make_live(*log_cdef.type, _ts, log_cdef.type->decompose(true), _ttl));
}
// Each regular and static non-atomic column in the base schema has a corresponding column in the log schema
@@ -978,7 +988,12 @@ public:
// Given a reference to such a column from the base schema, this function sets the corresponding column
// in the log to the given set of keys for the given row.
void set_deleted_elements(const clustering_key& log_ck, const column_definition& base_cdef, const managed_bytes& deleted_elements) {
auto& log_cdef = *_log_schema.get_column_definition(log_data_column_deleted_elements_name_bytes(base_cdef.name()));
auto log_cdef_ptr = _log_schema.get_column_definition(log_data_column_deleted_elements_name_bytes(base_cdef.name()));
if (!log_cdef_ptr) {
throw exceptions::invalid_request_exception(format("CDC log schema for {}.{} does not have base column {}",
_log_schema.ks_name(), _log_schema.cf_name(), base_cdef.name_as_text()));
}
auto& log_cdef = *log_cdef_ptr;
_log_mut.set_cell(log_ck, log_cdef, atomic_cell::make_live(*log_cdef.type, _ts, deleted_elements, _ttl));
}
@@ -1865,5 +1880,10 @@ bool cdc::cdc_service::needs_cdc_augmentation(const std::vector<mutation>& mutat
future<std::tuple<std::vector<mutation>, lw_shared_ptr<cdc::operation_result_tracker>>>
cdc::cdc_service::augment_mutation_call(lowres_clock::time_point timeout, std::vector<mutation>&& mutations, tracing::trace_state_ptr tr_state, db::consistency_level write_cl) {
if (utils::get_local_injector().enter("sleep_before_cdc_augmentation")) {
return seastar::sleep(std::chrono::milliseconds(100)).then([this, timeout, mutations = std::move(mutations), tr_state = std::move(tr_state), write_cl] () mutable {
return _impl->augment_mutation_call(timeout, std::move(mutations), std::move(tr_state), write_cl);
});
}
return _impl->augment_mutation_call(timeout, std::move(mutations), std::move(tr_state), write_cl);
}

View File

@@ -245,12 +245,18 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
qp.db().real_database().validate_keyspace_update(*ks_md_update);
service::topology_mutation_builder builder(ts);
service::topology_request_tracking_mutation_builder rtbuilder{global_request_id, qp.proxy().features().topology_requests_type_column};
rtbuilder.set("done", false)
.set("start_time", db_clock::now());
if (!qp.proxy().features().topology_global_request_queue) {
builder.set_global_topology_request(service::global_topology_request::keyspace_rf_change);
builder.set_global_topology_request_id(global_request_id);
builder.set_new_keyspace_rf_change_data(_name, ks_options);
} else {
builder.queue_global_topology_request_id(global_request_id);
rtbuilder.set("request_type", service::global_topology_request::keyspace_rf_change)
.set_new_keyspace_rf_change_data(_name, ks_options);
};
service::topology_change change{{builder.build()}};
@@ -259,13 +265,6 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
return cm.to_mutation(topo_schema);
});
service::topology_request_tracking_mutation_builder rtbuilder{global_request_id, qp.proxy().features().topology_requests_type_column};
rtbuilder.set("done", false)
.set("start_time", db_clock::now())
.set("request_type", service::global_topology_request::keyspace_rf_change);
if (qp.proxy().features().topology_global_request_queue) {
rtbuilder.set_new_keyspace_rf_change_data(_name, ks_options);
}
service::topology_change req_change{{rtbuilder.build()}};
auto topo_req_schema = qp.db().find_schema(db::system_keyspace::NAME, db::system_keyspace::TOPOLOGY_REQUESTS);

View File

@@ -8,6 +8,7 @@
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#include "cdc/log.hh"
#include "utils/assert.hh"
#include <seastar/core/coroutine.hh>
#include "cql3/query_options.hh"
@@ -27,6 +28,7 @@
#include "db/view/view.hh"
#include "cql3/query_processor.hh"
#include "cdc/cdc_extension.hh"
#include "cdc/cdc_partitioner.hh"
namespace cql3 {
@@ -290,6 +292,53 @@ std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_sche
throw exceptions::invalid_request_exception("Cannot use ALTER TABLE on Materialized View");
}
const bool is_cdc_log_table = cdc::is_log_for_some_table(db.real_database(), s->ks_name(), s->cf_name());
// Only a CDC log table will have this partitioner name. User tables should
// not be able to set this. Note that we perform a similar check when trying to
// re-enable CDC for a table, when the log table has been replaced by a user table.
// For better visualization of the above, consider this
//
// cqlsh> CREATE TABLE ks.t (p int PRIMARY KEY, v int) WITH cdc = {'enabled': true};
// cqlsh> INSERT INTO ks.t (p, v) VALUES (1, 2);
// cqlsh> ALTER TABLE ks.t WITH cdc = {'enabled': false};
// cqlsh> DESC TABLE ks.t_scylla_cdc_log WITH INTERNALS; # Save this output!
// cqlsh> DROP TABLE ks.t_scylla_cdc_log;
// cqlsh> [Recreate the log table using the received statement]
// cqlsh> ALTER TABLE ks.t WITH cdc = {'enabled': true};
//
// InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot create CDC log
// table for table ks.t because a table of name ks.t_scylla_cdc_log already exists"
//
// See commit adda43edc75b901b2329bca8f3eb74596698d05f for more information on THAT case.
// We reuse the same technique here.
const bool was_cdc_log_table = s->get_partitioner().name() == cdc::cdc_partitioner::classname;
if (_column_changes.size() != 0 && is_cdc_log_table) {
throw exceptions::invalid_request_exception(
"You cannot modify the set of columns of a CDC log table directly. "
"Modify the base table instead.");
}
if (_column_changes.size() != 0 && was_cdc_log_table) {
throw exceptions::invalid_request_exception(
"You cannot modify the set of columns of a CDC log table directly. "
"Although the base table has deactivated CDC, this table will continue being "
"a CDC log table until it is dropped. If you want to modify the columns in it, "
"you can only do that by reenabling CDC on the base table, which will reattach "
"this log table. Then you will be able to modify the columns in the base table, "
"and that will have effect on the log table too. Modifying the columns of a CDC "
"log table directly is never allowed.");
}
if (_renames.size() != 0 && is_cdc_log_table) {
throw exceptions::invalid_request_exception("Cannot rename a column of a CDC log table.");
}
if (_renames.size() != 0 && was_cdc_log_table) {
throw exceptions::invalid_request_exception(
"You cannot rename a column of a CDC log table. Although the base table "
"has deactivated CDC, this table will continue being a CDC log table until it "
"is dropped.");
}
auto cfm = schema_builder(s);
if (_properties->get_id()) {

View File

@@ -36,7 +36,7 @@
static logging::logger blogger("batchlog_manager");
const uint32_t db::batchlog_manager::replay_interval;
const std::chrono::seconds db::batchlog_manager::replay_interval;
const uint32_t db::batchlog_manager::page_size;
db::batchlog_manager::batchlog_manager(cql3::query_processor& qp, db::system_keyspace& sys_ks, batchlog_manager_config config)
@@ -116,7 +116,8 @@ future<> db::batchlog_manager::batchlog_replay_loop() {
} catch (...) {
blogger.error("Exception in batch replay: {}", std::current_exception());
}
delay = std::chrono::milliseconds(replay_interval);
delay = utils::get_local_injector().is_enabled("short_batchlog_manager_replay_interval") ?
std::chrono::seconds(1) : replay_interval;
}
}
@@ -132,6 +133,8 @@ future<> db::batchlog_manager::drain() {
_sem.broken();
}
co_await _qp.proxy().abort_batch_writes();
co_await std::move(_loop_done);
blogger.info("Drained");
}
@@ -173,6 +176,11 @@ future<> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cle
return make_ready_future<stop_iteration>(stop_iteration::no);
}
if (utils::get_local_injector().is_enabled("skip_batch_replay")) {
blogger.debug("Skipping batch replay due to skip_batch_replay injection");
return make_ready_future<stop_iteration>(stop_iteration::no);
}
// check version of serialization format
if (!row.has("version")) {
blogger.warn("Skipping logged batch because of unknown version");
@@ -242,7 +250,8 @@ future<> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cle
// send to partially or wholly fail in actually sending stuff. Since we don't
// have hints (yet), send with CL=ALL, and hope we can re-do this soon.
// See below, we use retry on write failure.
return _qp.proxy().mutate(mutations, db::consistency_level::ALL, db::no_timeout, nullptr, empty_service_permit(), db::allow_per_partition_rate_limit::no);
auto timeout = db::timeout_clock::now() + write_timeout;
return _qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
});
}).then_wrapped([this, id](future<> batch_result) {
try {

View File

@@ -43,8 +43,9 @@ public:
using post_replay_cleanup = bool_class<class post_replay_cleanup_tag>;
private:
static constexpr uint32_t replay_interval = 60 * 1000; // milliseconds
static constexpr std::chrono::seconds replay_interval = std::chrono::seconds(60);
static constexpr uint32_t page_size = 128; // same as HHOM, for now, w/out using any heuristics. TODO: set based on avg batch size.
static constexpr std::chrono::seconds write_timeout = std::chrono::seconds(300);
using clock_type = lowres_clock;

View File

@@ -1230,7 +1230,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, sstable_summary_ratio(this, "sstable_summary_ratio", value_status::Used, 0.0005, "Enforces that 1 byte of summary is written for every N (2000 by default)"
"bytes written to data file. Value must be between 0 and 1.")
, components_memory_reclaim_threshold(this, "components_memory_reclaim_threshold", liveness::LiveUpdate, value_status::Used, .2, "Ratio of available memory for all in-memory components of SSTables in a shard beyond which the memory will be reclaimed from components until it falls back under the threshold. Currently, this limit is only enforced for bloom filters.")
, large_memory_allocation_warning_threshold(this, "large_memory_allocation_warning_threshold", value_status::Used, (size_t(128) << 10) + 1, "Warn about memory allocations above this size; set to zero to disable.")
, large_memory_allocation_warning_threshold(this, "large_memory_allocation_warning_threshold", value_status::Used, size_t(1) << 20, "Warn about memory allocations above this size; set to zero to disable.")
, enable_deprecated_partitioners(this, "enable_deprecated_partitioners", value_status::Used, false, "Enable the byteordered and random partitioners. These partitioners are deprecated and will be removed in a future version.")
, enable_keyspace_column_family_metrics(this, "enable_keyspace_column_family_metrics", value_status::Used, false, "Enable per keyspace and per column family metrics reporting.")
, enable_node_aggregated_table_metrics(this, "enable_node_aggregated_table_metrics", value_status::Used, true, "Enable aggregated per node, per keyspace and per table metrics reporting, applicable if enable_keyspace_column_family_metrics is false.")

View File

@@ -86,9 +86,9 @@ if __name__ == '__main__':
ethpciid = ''
if network_mode == 'dpdk':
dpdk_status = out('/opt/scylladb/scripts/dpdk-devbind.py --status')
match = re.search('if={} drv=(\S+)'.format(ifname), dpdk_status, flags=re.MULTILINE)
match = re.search(r'if={} drv=(\S+)'.format(ifname), dpdk_status, flags=re.MULTILINE)
ethdrv = match.group(1)
match = re.search('^(\\S+:\\S+:\\S+\.\\S+) [^\n]+ if={} '.format(ifname), dpdk_status, flags=re.MULTILINE)
match = re.search(r'^(\S+:\S+:\S+\.\S+) [^\n]+ if={} '.format(ifname), dpdk_status, flags=re.MULTILINE)
ethpciid = match.group(1)
if args.mode:

View File

@@ -18,7 +18,7 @@ Breaks: scylla-enterprise-conf (<< 2025.1.0~)
Package: %{product}-server
Architecture: any
Depends: ${misc:Depends}, %{product}-conf (= ${binary:Version}), %{product}-python3 (= ${binary:Version})
Depends: ${misc:Depends}, %{product}-conf (= ${binary:Version}), %{product}-python3 (= ${binary:Version}), procps
Replaces: %{product}-tools (<<5.5), scylla-enterprise-tools (<< 2024.2.0~), scylla-enterprise-server (<< 2025.1.0~)
Breaks: %{product}-tools (<<5.5), scylla-enterprise-tools (<< 2024.2.0~), scylla-enterprise-server (<< 2025.1.0~)
Description: Scylla database server binaries

View File

@@ -88,7 +88,7 @@ bcp LICENSE-ScyllaDB-Source-Available.md /licenses/
run microdnf clean all
run microdnf --setopt=tsflags=nodocs -y update
run microdnf --setopt=tsflags=nodocs -y install hostname python3 python3-pip kmod
run microdnf --setopt=tsflags=nodocs -y install hostname kmod procps-ng python3 python3-pip
run microdnf clean all
run pip3 install --no-cache-dir --prefix /usr supervisor
run bash -ec "echo LANG=C.UTF-8 > /etc/locale.conf"

View File

@@ -76,6 +76,7 @@ Group: Applications/Databases
Summary: The Scylla database server
Requires: %{product}-conf = %{version}-%{release}
Requires: %{product}-python3 = %{version}-%{release}
Requires: procps-ng
AutoReqProv: no
Provides: %{product}-tools:%{_bindir}/nodetool
Provides: %{product}-tools:%{_sysconfigdir}/bash_completion.d/nodetool-completion

View File

@@ -172,4 +172,7 @@
/stable/upgrade/upgrade-opensource/upgrade-guide-from-4.5-to-4.6/metric-update-4.5-to-4.6.html: /stable/upgrade/index.html
# Divide API reference to smaller files
# /stable/reference/api-reference.html: /stable/reference/api/index.html
# /stable/reference/api-reference.html: /stable/reference/api/index.html
# Fixed typo in the file name
/stable/operating-scylla/nodetool-commands/enbleautocompaction.html: /stable/operating-scylla/nodetool-commands/enableautocompaction.html

View File

@@ -481,7 +481,8 @@ Creating a new user-defined type is done using a ``CREATE TYPE`` statement defin
field_definition: `identifier` `cql_type`
A UDT has a name (``udt_name``), which is used to declare columns of that type and is a set of named and typed fields. The ``udt_name`` can be any
type, including collections or other UDTs. UDTs and collections inside collections must always be frozen (no matter which version of ScyllaDB you are using).
type, including collections or other UDTs.
Similar to collections, a UDT can be frozen or non-frozen. A frozen UDT is immutable and can only be updated as a whole. Nested UDTs or UDTs used in keys must always be frozen.
For example::
@@ -506,26 +507,15 @@ For example::
CREATE TABLE superheroes (
name frozen<full_name> PRIMARY KEY,
home frozen<address>
home address
);
.. note::
- Attempting to create an already existing type will result in an error unless the ``IF NOT EXISTS`` option is used. If it is used, the statement will be a no-op if the type already exists.
- A type is intrinsically bound to the keyspace in which it is created and can only be used in that keyspace. At creation, if the type name is prefixed by a keyspace name, it is created in that keyspace. Otherwise, it is created in the current keyspace.
- As of ScyllaDB Open Source 3.2, UDTs not inside collections do not have to be frozen, but in all versions prior to ScyllaDB Open Source 3.2, and in all ScyllaDB Enterprise versions, UDTs **must** be frozen.
A non-frozen UDT example with ScyllaDB Open Source 3.2 and higher::
CREATE TYPE ut (a int, b int);
CREATE TABLE cf (a int primary key, b ut);
Same UDT in versions prior::
CREATE TYPE ut (a int, b int);
CREATE TABLE cf (a int primary key, b frozen<ut>);
UDT literals
~~~~~~~~~~~~

View File

@@ -26,6 +26,7 @@ Syntax
--table <table>
[--nowait]
[--scope <scope>]
[--sstables-file-list <file>]
<sstables>...
Example
@@ -51,6 +52,7 @@ Options
* ``--table`` - Name of the table to load SSTables into
* ``--nowait`` - Don't wait on the restore process
* ``--scope <scope>`` - Use specified load-and-stream scope
* ``--sstables-file-list <file>`` - restore the sstables listed in the given <file>. the list should be new-line seperated.
* ``<sstables>`` - Remainder of keys of the TOC (Table of Contents) components of SSTables to restore, relative to the specified prefix
The `scope` parameter describes the subset of cluster nodes where you want to load data:
@@ -60,6 +62,8 @@ The `scope` parameter describes the subset of cluster nodes where you want to lo
* `dc` - In the datacenter (DC) where the local node lives.
* `all` (default) - Everywhere across the cluster.
`--sstables-file-list <file>` and `<sstable>` can be combined together, `nodetool restore` will attempt to restore the combined list. duplicates are _not_ removed
To fully restore a cluster, you should combine the ``scope`` parameter with the correct list of
SStables to restore to each node.
On one extreme, one node is given all SStables with the scope ``all``; on the other extreme, all

View File

@@ -25,7 +25,7 @@ Nodetool
nodetool-commands/disablebinary
nodetool-commands/disablegossip
nodetool-commands/drain
nodetool-commands/enbleautocompaction
nodetool-commands/enableautocompaction
nodetool-commands/enablebackup
nodetool-commands/enablebinary
nodetool-commands/enablegossip
@@ -98,7 +98,7 @@ Operations that are not listed below are currently not available.
* :doc:`disablebinary </operating-scylla/nodetool-commands/disablebinary/>` - Disable native transport (binary protocol).
* :doc:`disablegossip </operating-scylla/nodetool-commands/disablegossip/>` - Disable gossip (effectively marking the node down).
* :doc:`drain </operating-scylla/nodetool-commands/drain/>` - Drain the node (stop accepting writes and flush all column families).
* :doc:`enableautocompaction </operating-scylla/nodetool-commands/enbleautocompaction/>` - Enable automatic compaction of a keyspace or table.
* :doc:`enableautocompaction </operating-scylla/nodetool-commands/enableautocompaction/>` - Enable automatic compaction of a keyspace or table.
* :doc:`enablebackup </operating-scylla/nodetool-commands/enablebackup/>` - Enable incremental backup.
* :doc:`enablebinary </operating-scylla/nodetool-commands/enablebinary/>` - Re-enable native transport (binary protocol).
* :doc:`enablegossip </operating-scylla/nodetool-commands/enablegossip/>` - Re-enable gossip.

View File

@@ -157,7 +157,7 @@ will leave the recovery mode and remove the obsolete internal Raft data.
After completing this step, Raft should be fully functional.
#. Replace all dead nodes from the cluster using the
#. Replace all dead nodes in the cluster using the
:doc:`node replacement procedure </operating-scylla/procedures/cluster-management/replace-dead-node/>`.
.. note::

View File

@@ -701,5 +701,97 @@ std::unique_ptr<data_sink_impl> make_encrypted_sink(data_sink sink, shared_ptr<s
return std::make_unique<encrypted_data_sink>(std::move(sink), std::move(k));
}
class encrypted_data_source : public data_source_impl, public block_encryption_base {
input_stream<char> _input;
temporary_buffer<char> _next;
size_t _current_position = 0;
size_t _skip = 0;
public:
encrypted_data_source(data_source source, shared_ptr<symmetric_key> k)
: block_encryption_base(std::move(k))
, _input(std::move(source))
{}
future<temporary_buffer<char>> get() override {
// First, get as much as we can get now (or the remainder of previous call)
auto buf1 = _next.empty()
? co_await _input.read()
: std::exchange(_next, {})
;
// eof?
if (buf1.empty()) {
co_return buf1;
}
// now we need one page more to be able to save one for next lap
auto fill_size = align_up(buf1.size(), block_size) + block_size - buf1.size();
auto buf2 = co_await _input.read_exactly(fill_size);
temporary_buffer<char> output(buf1.size() + buf2.size());
// we copy data even for the part we will cache. this to
// fix block alignment of the resulting shared buffer.
std::copy(buf1.begin(), buf1.end(), output.get_write());
std::copy(buf2.begin(), buf2.end(), output.get_write() + buf1.size());
// we need to keep one page buffered (beyond the input stream buffer - would be neat
// to share it), to be able to detect actual eof stream size. We always need
// at least _two_ pages of data to process, to be able to handle the case where
// actual size is <aligned block size> - <less than key block size>.
// I.e. stream size = 8180, encrypted data size will be 8192, and data stream
// will be 8196. So we need to make sure buf1 == [4096-8192], and buf2 == [8192-8196].
if (is_aligned(output.size(), block_size) && output.size() >= 2*block_size) {
_next = output.share(output.size() - block_size, block_size);
output.trim(output.size() - block_size);
}
const size_t key_block_size = _key->block_size();
// decrypt all blocks we have to return. might include the last, partial block
for (size_t offset = 0; offset < output.size(); offset += block_size, _current_position += block_size) {
auto iv = iv_for(_current_position);
auto rem = std::min(block_size, output.size() - offset);
_key->transform_unpadded(mode::decrypt, output.get() + offset, align_down(rem, key_block_size), output.get_write() + offset, iv.data());
}
// now, if the output buffer is not aligned, we are at eof, and
// also need to trim result.
if (!is_aligned(output.size(), key_block_size)) {
output.trim(output.size() - std::min(output.size(), key_block_size));
}
assert(is_aligned(_current_position, block_size));
// finally trim front to handle any skip remainders
output.trim_front(std::min(std::exchange(_skip, 0), output.size()));
co_return output;
}
future<temporary_buffer<char>> skip(uint64_t n) override {
if (n >= block_size) {
// since we only give back data aligned to block_size chunks,
// a client would only ever skip from a block boundary.
auto to_skip = align_down(n, block_size);
assert(is_aligned(_next.size(), block_size));
co_await _input.skip(to_skip - _next.size());
n -= to_skip;
_current_position += to_skip;
_next = {};
}
_skip = n;
co_return temporary_buffer<char>{};
}
future<> close() override {
return _input.close();
}
};
std::unique_ptr<data_source_impl> make_encrypted_source(data_source source, shared_ptr<symmetric_key> k) {
return std::make_unique<encrypted_data_source>(std::move(source), std::move(k));
}
}

View File

@@ -25,4 +25,6 @@ shared_ptr<file_impl> make_delayed_encrypted_file(file, size_t, get_key_func);
std::unique_ptr<data_sink_impl> make_encrypted_sink(data_sink, ::shared_ptr<symmetric_key>);
std::unique_ptr<data_source_impl> make_encrypted_source(data_source source, shared_ptr<symmetric_key> k);
}

View File

@@ -472,6 +472,14 @@ public:
for (auto&& [id, h] : _per_thread_kmip_host_cache[this_shard_id()]) {
co_await h->disconnect();
}
static auto stop_all = [](auto&& cache) -> future<> {
for (auto& [k, host] : cache) {
co_await host->stop();
}
};
co_await stop_all(_per_thread_kms_host_cache[this_shard_id()]);
co_await stop_all(_per_thread_gcp_host_cache[this_shard_id()]);
_per_thread_provider_cache[this_shard_id()].clear();
_per_thread_system_key_cache[this_shard_id()].clear();
_per_thread_kmip_host_cache[this_shard_id()].clear();
@@ -676,6 +684,33 @@ public:
return res;
}
std::tuple<opt_bytes, shared_ptr<encryption_schema_extension>> get_encryption_schema_extension(const sstables::sstable& sst,
sstables::component_type type) const {
const auto& sc = sst.get_shared_components();
if (!sc.scylla_metadata) {
return {};
}
const auto* ext_attr = sc.scylla_metadata->get_extension_attributes();
if (!ext_attr) {
return {};
}
bool ok = ext_attr->map.contains(encryption_attribute_ds);
if (ok && type != sstables::component_type::Data) {
ok = (ser::deserialize_from_buffer(ext_attr->map.at(encrypted_components_attribute_ds).value, std::type_identity<uint32_t>{}, 0) & (1 << static_cast<int>(type))) > 0;
}
if (!ok) {
return {};
}
auto esx = encryption_schema_extension::create(*_ctxt, ext_attr->map.at(encryption_attribute_ds).value);
opt_bytes id;
if (ext_attr->map.contains(key_id_attribute_ds)) {
id = ext_attr->map.at(key_id_attribute_ds).value;
}
return {std::move(id), std::move(esx)};
}
future<file> wrap_file(const sstables::sstable& sst, sstables::component_type type, file f, open_flags flags) override {
switch (type) {
case sstables::component_type::Scylla:
@@ -688,44 +723,21 @@ public:
if (flags == open_flags::ro) {
// open existing. check read opts.
auto& sc = sst.get_shared_components();
if (sc.scylla_metadata) {
auto* exta = sc.scylla_metadata->get_extension_attributes();
if (exta) {
auto i = exta->map.find(encryption_attribute_ds);
// note: earlier builds of encryption extension would only encrypt data component,
// so iff we are opening old sstables we need to check if this component is actually
// encrypted. We use a bitmask attribute for this.
auto [id, esx] = get_encryption_schema_extension(sst, type);
if (esx) {
if (esx->should_delay_read(id)) {
logg.debug("Encrypted sstable component {} using delayed opening {} (id: {})", sst.component_basename(type), *esx, id);
bool ok = i != exta->map.end();
if (ok && type != sstables::component_type::Data) {
ok = exta->map.count(encrypted_components_attribute_ds) &&
(ser::deserialize_from_buffer(exta->map.at(encrypted_components_attribute_ds).value, std::type_identity<uint32_t>{}, 0) & (1 << int(type)));
}
if (ok) {
auto esx = encryption_schema_extension::create(*_ctxt, i->second.value);
opt_bytes id;
if (exta->map.count(key_id_attribute_ds)) {
id = exta->map.at(key_id_attribute_ds).value;
}
if (esx->should_delay_read(id)) {
logg.debug("Encrypted sstable component {} using delayed opening {} (id: {})", sst.component_basename(type), *esx, id);
co_return make_delayed_encrypted_file(f, esx->key_block_size(), [esx, comp = sst.component_basename(type), id = std::move(id)] {
logg.trace("Delayed component {} using {} (id: {}) resolve", comp, *esx, id);
return esx->key_for_read(id);
});
}
logg.debug("Open encrypted sstable component {} using {} (id: {})", sst.component_basename(type), *esx, id);
auto k = co_await esx->key_for_read(std::move(id));
co_return make_encrypted_file(f, std::move(k));
}
co_return make_delayed_encrypted_file(f, esx->key_block_size(), [esx, comp = sst.component_basename(type), id = std::move(id)] {
logg.trace("Delayed component {} using {} (id: {}) resolve", comp, *esx, id);
return esx->key_for_read(id);
});
}
logg.debug("Open encrypted sstable component {} using {} (id: {})", sst.component_basename(type), *esx, id);
auto k = co_await esx->key_for_read(std::move(id));
co_return make_encrypted_file(f, std::move(k));
}
} else {
if (co_await wrap_writeonly(sst, type, [&f](shared_ptr<symmetric_key> k) { f = make_encrypted_file(std::move(f), std::move(k)); })) {
@@ -823,6 +835,36 @@ public:
});
co_return sink;
}
future<data_source> wrap_source(const sstables::sstable& sst,
sstables::component_type type,
sstables::data_source_creator_fn data_source_creator,
uint64_t offset,
uint64_t len) override {
switch (type) {
case sstables::component_type::Scylla:
case sstables::component_type::TemporaryTOC:
case sstables::component_type::TOC:
co_return data_source_creator(offset, len);
case sstables::component_type::CompressionInfo:
case sstables::component_type::CRC:
case sstables::component_type::Data:
case sstables::component_type::Digest:
case sstables::component_type::Filter:
case sstables::component_type::Index:
case sstables::component_type::Statistics:
case sstables::component_type::Summary:
case sstables::component_type::TemporaryStatistics:
case sstables::component_type::Unknown:
auto [id, esx] = get_encryption_schema_extension(sst, type);
if (esx) {
auto key = co_await esx->key_for_read(std::move(id));
auto block_size = key->block_size();
co_return data_source(make_encrypted_source(data_source_creator(align_down(offset, block_size), align_up(len, block_size)), std::move(key)));
}
co_return data_source_creator(offset, len);
}
}
};
std::string encryption_provider(const sstables::sstable& sst) {

View File

@@ -97,6 +97,7 @@ public:
~impl() = default;
future<> init();
future<> stop();
const host_options& options() const {
return _options;
}
@@ -827,6 +828,11 @@ future<> encryption::gcp_host::impl::init() {
_initialized = true;
}
future<> encryption::gcp_host::impl::stop() {
co_await _attr_cache.stop();
co_await _id_cache.stop();
}
std::tuple<std::string, std::string> encryption::gcp_host::impl::parse_key(std::string_view spec) {
auto i = spec.find_last_of('/');
if (i == std::string_view::npos) {
@@ -989,6 +995,10 @@ future<> encryption::gcp_host::init() {
return _impl->init();
}
future<> encryption::gcp_host::stop() {
return _impl->stop();
}
const encryption::gcp_host::host_options& encryption::gcp_host::options() const {
return _impl->options();
}

View File

@@ -65,6 +65,8 @@ public:
~gcp_host();
future<> init();
future<> stop();
const host_options& options() const;
struct option_override : public t_credentials_source<std::optional<std::string>> {

View File

@@ -724,9 +724,11 @@ future<> kmip_host::impl::connect() {
}
future<> kmip_host::impl::disconnect() {
return do_for_each(_options.hosts, [this](const sstring& host) {
co_await do_for_each(_options.hosts, [this](const sstring& host) {
return clear_connections(host);
});
co_await _attr_cache.stop();
co_await _id_cache.stop();
}
static unsigned from_str(unsigned (*f)(char*, int, int*), const sstring& s, const sstring& what) {

View File

@@ -154,6 +154,8 @@ public:
~impl() = default;
future<> init();
future<> stop();
const host_options& options() const {
return _options;
}
@@ -826,6 +828,11 @@ future<> encryption::kms_host::impl::init() {
_initialized = true;
}
future<> encryption::kms_host::impl::stop() {
co_await _attr_cache.stop();
co_await _id_cache.stop();
}
future<encryption::kms_host::impl::key_and_id_type> encryption::kms_host::impl::create_key(const attr_cache_key& k) {
auto& master_key = k.master_key;
auto& aws_assume_role_arn = k.aws_assume_role_arn;
@@ -988,6 +995,10 @@ future<> encryption::kms_host::init() {
return _impl->init();
}
future<> encryption::kms_host::stop() {
return _impl->stop();
}
const encryption::kms_host::host_options& encryption::kms_host::options() const {
return _impl->options();
}

View File

@@ -63,6 +63,8 @@ public:
~kms_host();
future<> init();
future<> stop();
const host_options& options() const;
struct option_override {

View File

@@ -335,18 +335,25 @@ void tablet_metadata::drop_tablet_map(table_id id) {
}
future<> tablet_metadata::clear_gently() {
for (auto&& [id, map] : _tablets) {
const auto shard = map.get_owner_shard();
co_await smp::submit_to(shard, [map = std::move(map)] () mutable {
auto map_ptr = map.release();
// Others copies exist, we simply drop ours, no need to clear anything.
if (map_ptr.use_count() > 1) {
return make_ready_future<>();
}
return const_cast<tablet_map&>(*map_ptr).clear_gently().finally([map_ptr = std::move(map_ptr)] { });
});
tablet_logger.debug("tablet_metadata::clear_gently {}", fmt::ptr(this));
// First, Sort the tablet maps per shard to avoid destruction of all foreign tablet map ptrs
// on this shard. We don't use sharded<> here since it will require a similar
// submit_to to each shard owner per tablet-map.
std::vector<std::vector<tablet_map_ptr>> tablet_maps_per_shard;
tablet_maps_per_shard.resize(smp::count);
for (auto& [_, map_ptr] : _tablets) {
tablet_maps_per_shard[map_ptr.get_owner_shard()].emplace_back(std::move(map_ptr));
}
_tablets.clear();
// Now destroy the foreign tablet map pointers on each shard.
co_await smp::invoke_on_all([&] -> future<> {
for (auto& map_ptr : tablet_maps_per_shard[this_shard_id()]) {
auto map = map_ptr.release();
co_await utils::clear_gently(map);
}
});
co_return;
}

View File

@@ -357,6 +357,7 @@ future<std::unique_ptr<token_metadata_impl>> token_metadata_impl::clone_only_tok
}
future<> token_metadata_impl::clear_gently() noexcept {
_version_tracker = {};
co_await utils::clear_gently(_token_to_endpoint_map);
co_await utils::clear_gently(_normal_token_owners);
co_await utils::clear_gently(_bootstrap_tokens);
@@ -834,16 +835,30 @@ token_metadata::token_metadata(std::unique_ptr<token_metadata_impl> impl)
{
}
token_metadata::token_metadata(config cfg)
: _impl(std::make_unique<token_metadata_impl>(cfg))
token_metadata::token_metadata(shared_token_metadata& stm, config cfg)
: _shared_token_metadata(&stm)
, _impl(std::make_unique<token_metadata_impl>(std::move(cfg)))
{
}
token_metadata::~token_metadata() = default;
token_metadata::~token_metadata() {
clear_and_dispose_impl();
}
token_metadata::token_metadata(token_metadata&&) noexcept = default;
token_metadata& token_metadata::token_metadata::operator=(token_metadata&&) noexcept = default;
token_metadata& token_metadata::token_metadata::operator=(token_metadata&& o) noexcept {
if (this != &o) {
clear_and_dispose_impl();
_shared_token_metadata = std::exchange(o._shared_token_metadata, nullptr);
_impl = std::exchange(o._impl, nullptr);
}
return *this;
}
void token_metadata::set_shared_token_metadata(shared_token_metadata& stm) {
_shared_token_metadata = &stm;
}
const std::vector<token>&
token_metadata::sorted_tokens() const {
@@ -1027,6 +1042,15 @@ token_metadata::clone_after_all_left() const noexcept {
co_return token_metadata(co_await _impl->clone_after_all_left());
}
void token_metadata::clear_and_dispose_impl() noexcept {
if (!_shared_token_metadata) {
return;
}
if (auto impl = std::exchange(_impl, nullptr)) {
_shared_token_metadata->clear_and_dispose(std::move(impl));
}
}
future<> token_metadata::clear_gently() noexcept {
return _impl->clear_gently();
}
@@ -1143,6 +1167,17 @@ version_tracker shared_token_metadata::new_tracker(token_metadata::version_t ver
return tracker;
}
future<> shared_token_metadata::stop() noexcept {
co_await _background_dispose_gate.close();
}
void shared_token_metadata::clear_and_dispose(std::unique_ptr<token_metadata_impl> impl) noexcept {
// Safe to drop the future since the gate is closed in stop()
if (auto gh = _background_dispose_gate.try_hold()) {
(void)impl->clear_gently().finally([i = std::move(impl), gh = std::move(gh)] {});
}
}
void shared_token_metadata::set(mutable_token_metadata_ptr tmptr) noexcept {
if (_shared->get_ring_version() >= tmptr->get_ring_version()) {
on_internal_error(tlogger, format("shared_token_metadata: must not set non-increasing ring_version: {} -> {}", _shared->get_ring_version(), tmptr->get_ring_version()));
@@ -1154,6 +1189,7 @@ void shared_token_metadata::set(mutable_token_metadata_ptr tmptr) noexcept {
_stale_versions_in_use = _versions_barrier.advance_and_await();
}
tmptr->set_shared_token_metadata(*this);
_shared = std::move(tmptr);
_shared->set_version_tracker(new_tracker(_shared->get_version()));
@@ -1216,7 +1252,7 @@ future<> shared_token_metadata::mutate_on_all_shards(sharded<shared_token_metada
std::vector<mutable_token_metadata_ptr> pending_token_metadata_ptr;
pending_token_metadata_ptr.resize(smp::count);
auto tmptr = make_token_metadata_ptr(co_await stm.local().get()->clone_async());
auto tmptr = stm.local().make_token_metadata_ptr(co_await stm.local().get()->clone_async());
auto& tm = *tmptr;
// bump the token_metadata ring_version
// to invalidate cached token/replication mappings
@@ -1227,7 +1263,7 @@ future<> shared_token_metadata::mutate_on_all_shards(sharded<shared_token_metada
// Apply the mutated token_metadata only after successfully cloning it on all shards.
pending_token_metadata_ptr[base_shard] = tmptr;
co_await smp::invoke_on_others(base_shard, [&] () -> future<> {
pending_token_metadata_ptr[this_shard_id()] = make_token_metadata_ptr(co_await tm.clone_async());
pending_token_metadata_ptr[this_shard_id()] = stm.local().make_token_metadata_ptr(co_await tm.clone_async());
});
co_await stm.invoke_on_all([&] (shared_token_metadata& stm) {

View File

@@ -47,7 +47,7 @@ class abstract_replication_strategy;
using token = dht::token;
class token_metadata;
class shared_token_metadata;
class tablet_metadata;
struct host_id_or_endpoint {
@@ -166,6 +166,7 @@ private:
};
class token_metadata final {
shared_token_metadata* _shared_token_metadata = nullptr;
std::unique_ptr<token_metadata_impl> _impl;
private:
friend class token_metadata_ring_splitter;
@@ -178,7 +179,7 @@ public:
using version_t = service::topology::version_t;
using version_tracker_t = version_tracker;
token_metadata(config cfg);
token_metadata(shared_token_metadata& stm, config cfg);
explicit token_metadata(std::unique_ptr<token_metadata_impl> impl);
token_metadata(token_metadata&&) noexcept; // Can't use "= default;" - hits some static_assert in unique_ptr
token_metadata& operator=(token_metadata&&) noexcept;
@@ -355,6 +356,11 @@ public:
friend class shared_token_metadata;
private:
void set_version_tracker(version_tracker_t tracker);
void set_shared_token_metadata(shared_token_metadata& stm);
// Clears and disposes the token metadata impl in the background, if present.
void clear_and_dispose_impl() noexcept;
};
struct topology_change_info {
@@ -371,12 +377,8 @@ struct topology_change_info {
using token_metadata_lock = semaphore_units<>;
using token_metadata_lock_func = noncopyable_function<future<token_metadata_lock>() noexcept>;
template <typename... Args>
mutable_token_metadata_ptr make_token_metadata_ptr(Args... args) {
return make_lw_shared<token_metadata>(std::forward<Args>(args)...);
}
class shared_token_metadata {
class shared_token_metadata : public peering_sharded_service<shared_token_metadata> {
named_gate _background_dispose_gate{"shared_token_metadata::background_dispose_gate"};
mutable_token_metadata_ptr _shared;
token_metadata_lock_func _lock_func;
std::chrono::steady_clock::duration _stall_detector_threshold = std::chrono::seconds(2);
@@ -408,7 +410,7 @@ public:
// used to construct the shared object as a sharded<> instance
// lock_func returns semaphore_units<>
explicit shared_token_metadata(token_metadata_lock_func lock_func, token_metadata::config cfg)
: _shared(make_token_metadata_ptr(std::move(cfg)))
: _shared(make_lw_shared<token_metadata>(*this, cfg))
, _lock_func(std::move(lock_func))
, _versions_barrier("shared_token_metadata::versions_barrier")
{
@@ -418,6 +420,17 @@ public:
shared_token_metadata(const shared_token_metadata& x) = delete;
shared_token_metadata(shared_token_metadata&& x) = default;
future<> stop() noexcept;
mutable_token_metadata_ptr make_token_metadata_ptr() {
return make_lw_shared<token_metadata>(*this, token_metadata::config{_shared->get_topology().get_config()});
}
mutable_token_metadata_ptr make_token_metadata_ptr(token_metadata&& tm) {
tm.set_shared_token_metadata(*this);
return make_lw_shared<token_metadata>(std::move(tm));
}
token_metadata_ptr get() const noexcept {
return _shared;
}
@@ -467,6 +480,8 @@ public:
// Must be called on shard 0.
static future<> mutate_on_all_shards(sharded<shared_token_metadata>& stm, seastar::noncopyable_function<future<> (token_metadata&)> func);
void clear_and_dispose(std::unique_ptr<token_metadata_impl> impl) noexcept;
private:
// for testing only, unsafe to be called without awaiting get_lock() first
void mutate_token_metadata_for_test(seastar::noncopyable_function<void (token_metadata&)> func);

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:f69e30ac03713e439d4f9fe347aafe2201d8605880358d3142b6f6bc706c3014
size 5966816
oid sha256:0e0682133ded3055e64eb2f4224c3791528b8c3e3bcf492f1dafb8a43f25e50d
size 5996784

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:9ec68edb2980fae1fcf63046b399f30b882fc7b77b4bc316c7055f75820d26f1
size 5975376
oid sha256:a3d21e0b2e84a8bc346bb974b3d827a012e9c3f0317d965f60eb7e254932cd0d
size 6013532

View File

@@ -1452,10 +1452,6 @@ future<std::optional<double>> repair::user_requested_repair_task_impl::expected_
co_return _ranges.size() * _cfs.size() * smp::count;
}
std::optional<double> repair::user_requested_repair_task_impl::expected_children_number() const {
return smp::count;
}
future<int> repair_start(seastar::sharded<repair_service>& repair, sharded<gms::gossip_address_map>& am,
sstring keyspace, std::unordered_map<sstring, sstring> options) {
return repair.invoke_on(0, [keyspace = std::move(keyspace), options = std::move(options), &am] (repair_service& local_repair) {
@@ -1624,10 +1620,6 @@ future<std::optional<double>> repair::data_sync_repair_task_impl::expected_total
co_return _cfs_size ? std::make_optional<double>(_ranges.size() * _cfs_size * smp::count) : std::nullopt;
}
std::optional<double> repair::data_sync_repair_task_impl::expected_children_number() const {
return smp::count;
}
future<> repair_service::bootstrap_with_repair(locator::token_metadata_ptr tmptr, std::unordered_set<dht::token> bootstrap_tokens) {
SCYLLA_ASSERT(this_shard_id() == 0);
return seastar::async([this, tmptr = std::move(tmptr), tokens = std::move(bootstrap_tokens)] () mutable {
@@ -2244,7 +2236,7 @@ future<> repair_service::replace_with_repair(std::unordered_map<sstring, locator
auto reason = streaming::stream_reason::replace;
// update a cloned version of tmptr
// no need to set the original version
auto cloned_tmptr = make_token_metadata_ptr(std::move(cloned_tm));
auto cloned_tmptr = _db.local().get_shared_token_metadata().make_token_metadata_ptr(std::move(cloned_tm));
cloned_tmptr->update_topology(tmptr->get_my_id(), myloc, locator::node::state::replacing);
co_await cloned_tmptr->update_normal_tokens(replacing_tokens, tmptr->get_my_id());
auto source_dc = utils::optional_param(myloc.dc);
@@ -2283,7 +2275,8 @@ future<> repair_service::repair_tablets(repair_uniq_id rid, sstring keyspace_nam
}
table_id tid = t->schema()->id();
// Invoke group0 read barrier before obtaining erm pointer so that it sees all prior metadata changes
auto dropped = co_await streaming::table_sync_and_check(_db.local(), _mm, tid);
auto dropped = !utils::get_local_injector().enter("repair_tablets_no_sync") &&
co_await streaming::table_sync_and_check(_db.local(), _mm, tid);
if (dropped) {
rlogger.debug("repair[{}] Table {}.{} does not exist anymore", rid.uuid(), keyspace_name, table_name);
continue;
@@ -2292,11 +2285,15 @@ future<> repair_service::repair_tablets(repair_uniq_id rid, sstring keyspace_nam
while (true) {
_repair_module->check_in_shutdown();
erm = t->get_effective_replication_map();
auto local_version = erm->get_token_metadata().get_version();
const locator::tablet_map& tmap = erm->get_token_metadata_ptr()->tablets().get_tablet_map(tid);
if (!tmap.has_transitions()) {
if (!tmap.has_transitions() && co_await container().invoke_on(0, [local_version] (repair_service& rs) {
// We need to ensure that there is no ongoing global request.
return local_version == rs._tsm.local()._topology.version && !rs._tsm.local()._topology.is_busy();
})) {
break;
}
rlogger.info("repair[{}] Table {}.{} has tablet transitions, waiting for topology to quiesce", rid.uuid(), keyspace_name, table_name);
rlogger.info("repair[{}] Topology is busy, waiting for it to quiesce", rid.uuid());
erm = nullptr;
co_await container().invoke_on(0, [] (repair_service& rs) {
return rs._tsm.local().await_not_busy();
@@ -2677,10 +2674,6 @@ future<std::optional<double>> repair::tablet_repair_task_impl::expected_total_wo
co_return sz ? std::make_optional<double>(sz) : std::nullopt;
}
std::optional<double> repair::tablet_repair_task_impl::expected_children_number() const {
return get_metas_size();
}
node_ops_cmd_category categorize_node_ops_cmd(node_ops_cmd cmd) noexcept {
switch (cmd) {
case node_ops_cmd::removenode_prepare:

View File

@@ -1448,7 +1448,9 @@ private:
size_t row_bytes = co_await get_repair_rows_size(row_list);
_metrics.tx_row_nr += row_list.size();
_metrics.tx_row_bytes += row_bytes;
for (repair_row& r : row_list) {
while (!row_list.empty()) {
repair_row r = std::move(row_list.front());
row_list.pop_front();
const auto& dk_with_hash = r.get_dk_with_hash();
// No need to search from the beginning of the rows. Look at the end of repair_rows_on_wire is enough.
if (rows.empty()) {
@@ -2762,7 +2764,12 @@ private:
size_t get_max_row_buf_size(row_level_diff_detect_algorithm algo) {
// Max buffer size per repair round
return is_rpc_stream_supported(algo) ? repair::task_manager_module::max_repair_memory_per_range : 256 * 1024;
size_t size = is_rpc_stream_supported(algo) ? repair::task_manager_module::max_repair_memory_per_range : 256 * 1024;
if (_small_table_optimization) {
// For small table optimization, we reduce the buffer size to reduce memory consumption.
size /= _all_live_peer_nodes.size();
}
return size;
}
// Step A: Negotiate sync boundary to use
@@ -3096,7 +3103,7 @@ public:
auto& mem_sem = _shard_task.rs.memory_sem();
auto max = _shard_task.rs.max_repair_memory();
auto wanted = (_all_live_peer_nodes.size() + 1) * repair::task_manager_module::max_repair_memory_per_range;
auto wanted = (_all_live_peer_nodes.size() + 1) * max_row_buf_size;
wanted = std::min(max, wanted);
rlogger.trace("repair[{}]: Started to get memory budget, wanted={}, available={}, max_repair_memory={}",
_shard_task.global_repair_id.uuid(), wanted, mem_sem.current(), max);

View File

@@ -74,7 +74,6 @@ protected:
future<> run() override;
virtual future<std::optional<double>> expected_total_workload() const override;
virtual std::optional<double> expected_children_number() const override;
};
class data_sync_repair_task_impl : public repair_task_impl {
@@ -103,7 +102,6 @@ protected:
future<> run() override;
virtual future<std::optional<double>> expected_total_workload() const override;
virtual std::optional<double> expected_children_number() const override;
};
class tablet_repair_task_impl : public repair_task_impl {
@@ -145,7 +143,6 @@ protected:
future<> run() override;
virtual future<std::optional<double>> expected_total_workload() const override;
virtual std::optional<double> expected_children_number() const override;
};
class shard_repair_task_impl : public repair_task_impl {

View File

@@ -355,7 +355,7 @@ database::view_update_read_concurrency_sem() {
return *sem;
}
database::database(const db::config& cfg, database_config dbcfg, service::migration_notifier& mn, gms::feature_service& feat, const locator::shared_token_metadata& stm,
database::database(const db::config& cfg, database_config dbcfg, service::migration_notifier& mn, gms::feature_service& feat, locator::shared_token_metadata& stm,
compaction_manager& cm, sstables::storage_manager& sstm, lang::manager& langm, sstables::directory_semaphore& sst_dir_sem, sstable_compressor_factory& scf, const abort_source& abort, utils::cross_shard_barrier barrier)
: _stats(make_lw_shared<db_stats>())
, _user_types(std::make_shared<db_user_types_storage>(*this))

View File

@@ -1599,7 +1599,7 @@ private:
service::migration_notifier& _mnotifier;
gms::feature_service& _feat;
std::vector<std::any> _listeners;
const locator::shared_token_metadata& _shared_token_metadata;
locator::shared_token_metadata& _shared_token_metadata;
lang::manager& _lang_manager;
reader_concurrency_semaphore_group _reader_concurrency_semaphores_group;
@@ -1684,7 +1684,7 @@ public:
// (keyspace/table definitions, column mappings etc.)
future<> parse_system_tables(distributed<service::storage_proxy>&, sharded<db::system_keyspace>&);
database(const db::config&, database_config dbcfg, service::migration_notifier& mn, gms::feature_service& feat, const locator::shared_token_metadata& stm,
database(const db::config&, database_config dbcfg, service::migration_notifier& mn, gms::feature_service& feat, locator::shared_token_metadata& stm,
compaction_manager& cm, sstables::storage_manager& sstm, lang::manager& langm, sstables::directory_semaphore& sst_dir_sem, sstable_compressor_factory&, const abort_source& abort,
utils::cross_shard_barrier barrier = utils::cross_shard_barrier(utils::cross_shard_barrier::solo{}) /* for single-shard usage */);
database(database&&) = delete;
@@ -1719,7 +1719,7 @@ public:
return _compaction_manager;
}
const locator::shared_token_metadata& get_shared_token_metadata() const { return _shared_token_metadata; }
locator::shared_token_metadata& get_shared_token_metadata() const { return _shared_token_metadata; }
locator::token_metadata_ptr get_token_metadata_ptr() const { return _shared_token_metadata.get(); }
const locator::token_metadata& get_token_metadata() const { return *_shared_token_metadata.get(); }

View File

@@ -2737,7 +2737,17 @@ void tablet_storage_group_manager::update_effective_replication_map(const locato
_storage_groups[tid.value()] = allocate_storage_group(*new_tablet_map, tid, std::move(range));
tablet_migrating_in = true;
} else if (_storage_groups.contains(tid.value()) && locator::is_post_cleanup(this_replica, new_tablet_map->get_tablet_info(tid), transition_info)) {
// The storage group should be cleaned up and stopped at this point usually by the tablet cleanup stage,
// unless the storage group was allocated after tablet cleanup was completed for this node. This could
// happen if the node was restarted after tablet cleanup was run but before moving to the next stage. To
// handle this case we stop the storage group here if it's not stopped already.
auto sg = _storage_groups[tid.value()];
remove_storage_group(tid.value());
(void) with_gate(_t.async_gate(), [sg] {
return sg->stop("tablet post-cleanup").then([sg] {});
});
}
}

View File

@@ -999,6 +999,8 @@ class managed_bytes:
inf = gdb.selected_inferior()
def to_bytes(data, size):
if size == 0:
return b''
return bytes(inf.read_memory(data, size))
if self.is_inline():

View File

@@ -56,7 +56,9 @@ migration_manager::migration_manager(migration_notifier& notifier, gms::feature_
, _group0_barrier(this_shard_id() == 0 ?
std::function<future<>()>([this] () -> future<> {
// This will run raft barrier and will sync schema with the leader
(void)co_await start_group0_operation();
return with_scheduling_group(_storage_proxy.get_db().local().get_gossip_scheduling_group(), [this] {
return start_group0_operation().discard_result();
});
}) :
std::function<future<>()>([this] () -> future<> {
co_await container().invoke_on(0, [] (migration_manager& mm) -> future<> {

View File

@@ -414,12 +414,6 @@ future<group0_info> persistent_discovery::run(
}
future<> raft_group0::abort() {
if (_aborted) {
co_return;
}
_aborted = true;
group0_log.debug("Raft group0 service is aborting...");
co_await smp::invoke_on_all([this]() {
return uninit_rpc_verbs(_ms.local());
});
@@ -431,8 +425,6 @@ future<> raft_group0::abort() {
co_await std::move(_leadership_monitor);
co_await stop_group0();
group0_log.debug("Raft group0 service is aborted");
}
future<> raft_group0::start_server_for_group0(raft::group_id group0_id, service::storage_service& ss, cql3::query_processor& qp, service::migration_manager& mm, bool topology_change_enabled) {

View File

@@ -133,7 +133,6 @@ class raft_group0 {
future<> _leadership_monitor = make_ready_future<>();
abort_source _leadership_monitor_as;
utils::updateable_value_source<bool> _leadership_observable;
bool _aborted = false;
public:
// Passed to `setup_group0` when replacing a node.

View File

@@ -600,6 +600,8 @@ private:
++p->get_stats().received_mutations;
p->get_stats().forwarded_mutations += forward_host_id.size();
co_await utils::get_local_injector().inject("storage_proxy_write_response_pause", utils::wait_for_message(5min));
if (auto stale = _sp.apply_fence(fence, src_addr)) {
errors.count += (forward_host_id.size() + 1);
errors.local = std::move(*stale);
@@ -1101,26 +1103,23 @@ private:
global_request_id = guard.new_group0_state_id();
std::vector<canonical_mutation> updates;
topology_mutation_builder builder(guard.write_timestamp());
topology_request_tracking_mutation_builder trbuilder(global_request_id, _sp._features.topology_requests_type_column);
trbuilder.set_truncate_table_data(table_id)
.set("done", false)
.set("start_time", db_clock::now());
if (!_sp._features.topology_global_request_queue) {
builder.set_global_topology_request(global_topology_request::truncate_table)
.set_global_topology_request_id(global_request_id);
} else {
builder.queue_global_topology_request_id(global_request_id);
trbuilder.set("request_type", global_topology_request::truncate_table);
}
updates.emplace_back(builder.build());
updates.emplace_back(topology_request_tracking_mutation_builder(global_request_id, _sp._features.topology_requests_type_column)
.set_truncate_table_data(table_id)
.set("done", false)
.set("start_time", db_clock::now())
.set("request_type", global_topology_request::truncate_table)
.build());
slogger.info("Creating TRUNCATE global topology request for table {}.{}", ks_name, cf_name);
topology_change change{std::move(updates)};
topology_change change{{builder.build(), trbuilder.build()}};
sstring reason = "Truncating table";
group0_command g0_cmd = _group0_client.prepare_command(std::move(change), guard, reason);
try {
@@ -1615,6 +1614,10 @@ public:
return _type == db::write_type::VIEW;
}
bool is_batch() const noexcept {
return _type == db::write_type::BATCH;
}
void set_cdc_operation_result_tracker(lw_shared_ptr<cdc::operation_result_tracker> tracker) {
_cdc_operation_result_tracker = std::move(tracker);
}
@@ -2120,7 +2123,7 @@ paxos_response_handler::begin_and_repair_paxos(client_state& cs, unsigned& conte
// create_write_response_handler is overloaded for paxos::proposal and will
// create cas_mutation holder, which consequently will ensure paxos::learn is
// used.
auto f = _proxy->mutate_internal(std::move(m), db::consistency_level::ANY, false, tr_state, _permit, _timeout)
auto f = _proxy->mutate_internal(std::move(m), db::consistency_level::ANY, tr_state, _permit, _timeout)
.then(utils::result_into_future<result<>>);
// TODO: provided commits did not invalidate the prepare we just did above (which they
@@ -2472,7 +2475,7 @@ future<> paxos_response_handler::learn_decision(lw_shared_ptr<paxos::proposal> d
return v.schema()->id() == base_tbl_id;
});
if (!mutations.empty()) {
f_cdc = _proxy->mutate_internal(std::move(mutations), _cl_for_learn, false, tr_state, _permit, _timeout, std::move(tracker))
f_cdc = _proxy->mutate_internal(std::move(mutations), _cl_for_learn, tr_state, _permit, _timeout, {}, std::move(tracker))
.then(utils::result_into_future<result<>>);
}
}
@@ -2480,7 +2483,7 @@ future<> paxos_response_handler::learn_decision(lw_shared_ptr<paxos::proposal> d
// Path for the "base" mutations
std::array<std::tuple<lw_shared_ptr<paxos::proposal>, schema_ptr, shared_ptr<paxos_response_handler>, dht::token>, 1> m{std::make_tuple(std::move(decision), _schema, shared_from_this(), _key.token())};
future<> f_lwt = _proxy->mutate_internal(std::move(m), _cl_for_learn, false, tr_state, _permit, _timeout)
future<> f_lwt = _proxy->mutate_internal(std::move(m), _cl_for_learn, tr_state, _permit, _timeout)
.then(utils::result_into_future<result<>>);
co_await when_all_succeed(std::move(f_cdc), std::move(f_lwt)).discard_result();
@@ -3071,6 +3074,10 @@ struct hint_wrapper {
mutation mut;
};
struct batchlog_replay_mutation {
mutation mut;
};
struct read_repair_mutation {
std::unordered_map<locator::host_id, std::optional<mutation>> value;
locator::effective_replication_map_ptr ermp;
@@ -3084,6 +3091,12 @@ template <> struct fmt::formatter<service::hint_wrapper> : fmt::formatter<string
}
};
template <> struct fmt::formatter<service::batchlog_replay_mutation> : fmt::formatter<string_view> {
auto format(const service::batchlog_replay_mutation& h, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "batchlog_replay_mutation{{{}}}", h.mut);
}
};
template <>
struct fmt::formatter<service::read_repair_mutation> : fmt::formatter<string_view> {
auto format(const service::read_repair_mutation& m, fmt::format_context& ctx) const {
@@ -3449,6 +3462,12 @@ storage_proxy::create_write_response_handler(const hint_wrapper& h, db::consiste
std::move(permit), allow_limit, is_cancellable::yes);
}
result<storage_proxy::response_id_type>
storage_proxy::create_write_response_handler(const batchlog_replay_mutation& m, db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit, db::allow_per_partition_rate_limit allow_limit) {
return create_write_response_handler_helper(m.mut.schema(), m.mut.token(), std::make_unique<shared_mutation>(m.mut), cl, type, tr_state,
std::move(permit), allow_limit, is_cancellable::yes);
}
result<storage_proxy::response_id_type>
storage_proxy::create_write_response_handler(const read_repair_mutation& mut, db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit, db::allow_per_partition_rate_limit allow_limit) {
host_id_vector_replica_set endpoints;
@@ -3843,7 +3862,7 @@ future<result<>> storage_proxy::do_mutate(std::vector<mutation> mutations, db::c
}).begin();
return seastar::when_all_succeed(
mutate_counters(std::ranges::subrange(mutations.begin(), mid), cl, tr_state, permit, timeout),
mutate_internal(std::ranges::subrange(mid, mutations.end()), cl, false, tr_state, permit, timeout, std::move(cdc_tracker), allow_limit)
mutate_internal(std::ranges::subrange(mid, mutations.end()), cl, tr_state, permit, timeout, {}, std::move(cdc_tracker), allow_limit)
).then([] (std::tuple<result<>> res) {
// For now, only mutate_internal returns a result<>
return std::get<0>(std::move(res));
@@ -3852,8 +3871,10 @@ future<result<>> storage_proxy::do_mutate(std::vector<mutation> mutations, db::c
future<> storage_proxy::replicate_counter_from_leader(mutation m, db::consistency_level cl, tracing::trace_state_ptr tr_state,
clock_type::time_point timeout, service_permit permit) {
// we need to pass correct db::write_type in case of a timeout so that
// client doesn't attempt to retry the request.
// FIXME: do not send the mutation to itself, it has already been applied (it is not incorrect to do so, though)
return mutate_internal(std::array<mutation, 1>{std::move(m)}, cl, true, std::move(tr_state), std::move(permit), timeout)
return mutate_internal(std::array<mutation, 1>{std::move(m)}, cl, std::move(tr_state), std::move(permit), timeout, db::write_type::COUNTER)
.then(utils::result_into_future<result<>>);
}
@@ -3864,8 +3885,8 @@ future<> storage_proxy::replicate_counter_from_leader(mutation m, db::consistenc
*/
template<typename Range>
future<result<>>
storage_proxy::mutate_internal(Range mutations, db::consistency_level cl, bool counters, tracing::trace_state_ptr tr_state, service_permit permit,
std::optional<clock_type::time_point> timeout_opt, lw_shared_ptr<cdc::operation_result_tracker> cdc_tracker,
storage_proxy::mutate_internal(Range mutations, db::consistency_level cl, tracing::trace_state_ptr tr_state, service_permit permit,
std::optional<clock_type::time_point> timeout_opt, std::optional<db::write_type> type_opt, lw_shared_ptr<cdc::operation_result_tracker> cdc_tracker,
db::allow_per_partition_rate_limit allow_limit) {
if (std::ranges::empty(mutations)) {
return make_ready_future<result<>>(bo::success());
@@ -3874,12 +3895,10 @@ storage_proxy::mutate_internal(Range mutations, db::consistency_level cl, bool c
slogger.trace("mutate cl={}", cl);
mlogger.trace("mutations={}", mutations);
// If counters is set it means that we are replicating counter shards. There
// is no need for special handling anymore, since the leader has already
// done its job, but we need to return correct db::write_type in case of
// a timeout so that client doesn't attempt to retry the request.
auto type = counters ? db::write_type::COUNTER
: (std::next(std::begin(mutations)) == std::end(mutations) ? db::write_type::SIMPLE : db::write_type::UNLOGGED_BATCH);
// the parameter type_opt allows to pass a specific type if needed for
// special handling, e.g. counters. otherwise, a default type is used.
auto type = type_opt.value_or(std::next(std::begin(mutations)) == std::end(mutations) ? db::write_type::SIMPLE : db::write_type::UNLOGGED_BATCH);
utils::latency_counter lc;
lc.start();
@@ -4065,6 +4084,7 @@ storage_proxy::mutate_atomically_result(std::vector<mutation> mutations, db::con
};
future<> async_remove_from_batchlog() {
// delete batch
utils::get_local_injector().inject("storage_proxy_fail_remove_from_batchlog", [] { throw std::runtime_error("Error injection: failing remove from batchlog"); });
auto key = partition_key::from_exploded(*_schema, {uuid_type->decompose(_batch_uuid)});
auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
mutation m(_schema, key);
@@ -4136,13 +4156,15 @@ mutation storage_proxy::do_get_batchlog_mutation_for(schema_ptr schema, const st
for (auto& m : fm) {
ser::serialize(out, m);
}
return to_bytes(out.linearize());
return std::move(out).to_managed_bytes();
}();
mutation m(schema, key);
m.set_cell(clustering_key_prefix::make_empty(), to_bytes("version"), version, timestamp);
m.set_cell(clustering_key_prefix::make_empty(), to_bytes("written_at"), now, timestamp);
m.set_cell(clustering_key_prefix::make_empty(), to_bytes("data"), data_value(std::move(data)), timestamp);
// Avoid going through data_value and therefore `bytes`, as it can be large (#24809).
auto cdef_data = schema->get_column_definition(to_bytes("data"));
m.set_cell(clustering_key_prefix::make_empty(), *cdef_data, atomic_cell::make_live(*cdef_data->type, timestamp, std::move(data)));
return m;
}
@@ -4248,7 +4270,16 @@ future<> storage_proxy::send_hint_to_endpoint(frozen_mutation_and_schema fm_a_s,
future<> storage_proxy::send_hint_to_all_replicas(frozen_mutation_and_schema fm_a_s) {
std::array<hint_wrapper, 1> ms{hint_wrapper { fm_a_s.fm.unfreeze(fm_a_s.s) }};
return mutate_internal(std::move(ms), db::consistency_level::ALL, false, nullptr, empty_service_permit())
return mutate_internal(std::move(ms), db::consistency_level::ALL, nullptr, empty_service_permit())
.then(utils::result_into_future<result<>>);
}
future<> storage_proxy::send_batchlog_replay_to_all_replicas(std::vector<mutation> mutations, clock_type::time_point timeout) {
std::vector<batchlog_replay_mutation> ms = mutations | std::views::transform([] (auto&& m) {
return batchlog_replay_mutation(std::move(m));
}) | std::ranges::to<std::vector<batchlog_replay_mutation>>();
return mutate_internal(std::move(ms), db::consistency_level::ALL, nullptr, empty_service_permit(), timeout, db::write_type::BATCH)
.then(utils::result_into_future<result<>>);
}
@@ -4431,7 +4462,7 @@ future<result<>> storage_proxy::schedule_repair(locator::effective_replication_m
std::views::transform([ermp] (auto& v) { return read_repair_mutation{std::move(v), ermp}; }) |
// The transform above is destructive, materialize into a vector to make the range re-iterable.
std::ranges::to<std::vector<read_repair_mutation>>()
, cl, false, std::move(trace_state), std::move(permit));
, cl, std::move(trace_state), std::move(permit));
}
class abstract_read_resolver {
@@ -6953,7 +6984,7 @@ future<> storage_proxy::drain_on_shutdown() {
//NOTE: the thread is spawned here because there are delicate lifetime issues to consider
// and writing them down with plain futures is error-prone.
return async([this] {
cancel_write_handlers([] (const abstract_write_response_handler&) { return true; });
cancel_all_write_response_handlers().get();
_hints_resource_manager.stop().get();
});
}
@@ -6964,6 +6995,12 @@ future<> storage_proxy::abort_view_writes() {
});
}
future<> storage_proxy::abort_batch_writes() {
return async([this] {
cancel_write_handlers([] (const abstract_write_response_handler& handler) { return handler.is_batch(); });
});
}
future<>
storage_proxy::stop() {
return make_ready_future<>();
@@ -6977,4 +7014,13 @@ future<utils::chunked_vector<dht::token_range_endpoints>> storage_proxy::describ
return locator::describe_ring(_db.local(), _remote->gossiper(), keyspace, include_only_local_dc);
}
future<> storage_proxy::cancel_all_write_response_handlers() {
while (!_response_handlers.empty()) {
_response_handlers.begin()->second->timeout_cb();
if (!_response_handlers.empty()) {
co_await maybe_yield();
}
}
}
}

View File

@@ -87,6 +87,7 @@ class mutation_holder;
class client_state;
class migration_manager;
struct hint_wrapper;
struct batchlog_replay_mutation;
struct read_repair_mutation;
using replicas_per_token_range = std::unordered_map<dht::token_range, std::vector<locator::host_id>>;
@@ -340,6 +341,7 @@ private:
const host_id_vector_topology_change& pending_endpoints, host_id_vector_topology_change, tracing::trace_state_ptr tr_state, storage_proxy::write_stats& stats, service_permit permit, db::per_partition_rate_limit::info rate_limit_info, is_cancellable);
result<response_id_type> create_write_response_handler(const mutation&, db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit, db::allow_per_partition_rate_limit allow_limit);
result<response_id_type> create_write_response_handler(const hint_wrapper&, db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit, db::allow_per_partition_rate_limit allow_limit);
result<response_id_type> create_write_response_handler(const batchlog_replay_mutation&, db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit, db::allow_per_partition_rate_limit allow_limit);
result<response_id_type> create_write_response_handler(const read_repair_mutation&, db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit, db::allow_per_partition_rate_limit allow_limit);
result<response_id_type> create_write_response_handler(const std::tuple<lw_shared_ptr<paxos::proposal>, schema_ptr, shared_ptr<paxos_response_handler>, dht::token>& proposal,
db::consistency_level cl, db::write_type type, tracing::trace_state_ptr tr_state, service_permit permit, db::allow_per_partition_rate_limit allow_limit);
@@ -427,7 +429,7 @@ private:
void unthrottle();
void handle_read_error(std::variant<exceptions::coordinator_exception_container, std::exception_ptr> failure, bool range);
template<typename Range>
future<result<>> mutate_internal(Range mutations, db::consistency_level cl, bool counter_write, tracing::trace_state_ptr tr_state, service_permit permit, std::optional<clock_type::time_point> timeout_opt = { }, lw_shared_ptr<cdc::operation_result_tracker> cdc_tracker = { }, db::allow_per_partition_rate_limit allow_limit = db::allow_per_partition_rate_limit::no);
future<result<>> mutate_internal(Range mutations, db::consistency_level cl, tracing::trace_state_ptr tr_state, service_permit permit, std::optional<clock_type::time_point> timeout_opt = { }, std::optional<db::write_type> type = { }, lw_shared_ptr<cdc::operation_result_tracker> cdc_tracker = { }, db::allow_per_partition_rate_limit allow_limit = db::allow_per_partition_rate_limit::no);
future<rpc::tuple<foreign_ptr<lw_shared_ptr<reconcilable_result>>, cache_temperature>> query_nonsingular_mutations_locally(
schema_ptr s, lw_shared_ptr<query::read_command> cmd, const dht::partition_range_vector&& pr, tracing::trace_state_ptr trace_state,
clock_type::time_point timeout);
@@ -521,6 +523,8 @@ public:
bool is_me(gms::inet_address addr) const noexcept;
bool is_me(const locator::effective_replication_map& erm, locator::host_id id) const noexcept;
future<> cancel_all_write_response_handlers();
private:
bool only_me(const locator::effective_replication_map& erm, const host_id_vector_replica_set& replicas) const noexcept;
@@ -631,6 +635,8 @@ public:
future<> send_hint_to_all_replicas(frozen_mutation_and_schema fm_a_s);
future<> send_batchlog_replay_to_all_replicas(std::vector<mutation> mutations, clock_type::time_point timeout);
// Send a mutation to one specific remote target.
// Inspired by Cassandra's StorageProxy.sendToHintedEndpoints but without
// hinted handoff support, and just one target. See also
@@ -705,6 +711,7 @@ public:
void allow_replaying_hints() noexcept;
future<> drain_hints_for_left_nodes();
future<> abort_view_writes();
future<> abort_batch_writes();
future<> change_hints_host_filter(db::hints::host_filter new_filter);
const db::hints::host_filter& get_hints_host_filter() const;

View File

@@ -111,7 +111,6 @@
#include "node_ops/task_manager_module.hh"
#include "service/task_manager_module.hh"
#include "service/topology_mutation.hh"
#include "service/topology_coordinator.hh"
#include "cql3/query_processor.hh"
#include "service/qos/service_level_controller.hh"
#include "service/qos/standard_service_level_distributed_data_accessor.hh"
@@ -740,9 +739,7 @@ future<> storage_service::topology_state_load(state_change_hint hint) {
auto saved_tmpr = get_token_metadata_ptr();
{
auto tmlock = co_await get_token_metadata_lock();
auto tmptr = make_token_metadata_ptr(token_metadata::config {
get_token_metadata().get_topology().get_config()
});
auto tmptr = _shared_token_metadata.make_token_metadata_ptr();
tmptr->invalidate_cached_rings();
tmptr->set_version(_topology_state_machine._topology.version);
@@ -817,10 +814,6 @@ future<> storage_service::topology_state_load(state_change_hint hint) {
for (const auto& gen_id : _topology_state_machine._topology.committed_cdc_generations) {
rtlogger.trace("topology_state_load: process committed cdc generation {}", gen_id);
co_await utils::get_local_injector().inject("topology_state_load_before_update_cdc", [](auto& handler) -> future<> {
rtlogger.info("topology_state_load_before_update_cdc hit, wait for message");
co_await handler.wait_for_message(db::timeout_clock::now() + std::chrono::minutes(5));
});
co_await _cdc_gens.local().handle_cdc_generation(gen_id);
if (gen_id == _topology_state_machine._topology.committed_cdc_generations.back()) {
co_await _sys_ks.local().update_cdc_generation_id(gen_id);
@@ -1134,7 +1127,8 @@ future<> storage_service::raft_state_monitor_fiber(raft::server& raft, gate::hol
_tablet_allocator.local(),
get_ring_delay(),
_lifecycle_notifier,
_feature_service);
_feature_service,
_topology_cmd_rpc_tracker);
}
} catch (...) {
rtlogger.info("raft_state_monitor_fiber aborted with {}", std::current_exception());
@@ -3146,9 +3140,10 @@ future<> storage_service::replicate_to_all_cores(mutable_token_metadata_ptr tmpt
try {
auto base_shard = this_shard_id();
pending_token_metadata_ptr[base_shard] = tmptr;
auto& sharded_token_metadata = _shared_token_metadata.container();
// clone a local copy of updated token_metadata on all other shards
co_await smp::invoke_on_others(base_shard, [&, tmptr] () -> future<> {
pending_token_metadata_ptr[this_shard_id()] = make_token_metadata_ptr(co_await tmptr->clone_async());
pending_token_metadata_ptr[this_shard_id()] = sharded_token_metadata.local().make_token_metadata_ptr(co_await tmptr->clone_async());
});
// Precalculate new effective_replication_map for all keyspaces
@@ -4701,17 +4696,13 @@ future<> storage_service::drain() {
}
future<> storage_service::do_drain() {
// Need to stop transport before group0, otherwise RPCs may fail with raft_group_not_found.
co_await stop_transport();
// group0 persistence relies on local storage, so we need to stop group0 first.
// This must be kept in sync with defer_verbose_shutdown for group0 in main.cc to
// handle the case when initialization fails before reaching drain_on_shutdown for ss.
_sl_controller.local().abort_group0_operations();
// Drain view builder before group0, because the view builder uses group0 to coordinate view building.
// Drain after transport is stopped, because view_builder::drain aborts view writes for user writes as well.
co_await _view_builder.invoke_on_all(&db::view::view_builder::drain);
co_await wait_for_group0_stop();
if (_group0) {
co_await _group0->abort();
}
co_await tracing::tracing::tracing_instance().invoke_on_all(&tracing::tracing::shutdown);
@@ -4719,7 +4710,6 @@ future<> storage_service::do_drain() {
return bm.drain();
});
co_await _view_builder.invoke_on_all(&db::view::view_builder::drain);
co_await _db.invoke_on_all(&replica::database::drain);
co_await _sys_ks.invoke_on_all(&db::system_keyspace::shutdown);
co_await _repair.invoke_on_all(&repair_service::shutdown);
@@ -5747,7 +5737,7 @@ future<> storage_service::snitch_reconfigured() {
future<raft_topology_cmd_result> storage_service::raft_topology_cmd_handler(raft::term_t term, uint64_t cmd_index, const raft_topology_cmd& cmd) {
raft_topology_cmd_result result;
rtlogger.debug("topology cmd rpc {} is called", cmd.cmd);
rtlogger.info("topology cmd rpc {} is called index={}", cmd.cmd, cmd_index);
try {
auto& raft_server = _group0->group0_server();
@@ -5816,6 +5806,7 @@ future<raft_topology_cmd_result> storage_service::raft_topology_cmd_handler(raft
}
break;
case raft_topology_cmd::command::barrier_and_drain: {
co_await utils::get_local_injector().inject("pause_before_barrier_and_drain", utils::wait_for_message(std::chrono::minutes(5)));
if (_topology_state_machine._topology.tstate == topology::transition_state::write_both_read_old) {
for (auto& n : _topology_state_machine._topology.transition_nodes) {
if (!_address_map.find(locator::host_id{n.first.uuid()})) {
@@ -6077,6 +6068,9 @@ future<raft_topology_cmd_result> storage_service::raft_topology_cmd_handler(raft
} catch (...) {
rtlogger.error("raft_topology_cmd {} failed with: {}", cmd.cmd, std::current_exception());
}
rtlogger.info("topology cmd rpc {} completed with status={} index={}",
cmd.cmd, (result.status == raft_topology_cmd_result::command_status::success) ? "suceeded" : "failed", cmd_index);
co_return result;
}

View File

@@ -48,6 +48,7 @@
#include "timestamp.hh"
#include "utils/user_provided_param.hh"
#include "utils/sequenced_set.hh"
#include "service/topology_coordinator.hh"
class node_ops_cmd_request;
class node_ops_cmd_response;
@@ -282,12 +283,12 @@ private:
future<> snitch_reconfigured();
future<mutable_token_metadata_ptr> get_mutable_token_metadata_ptr() noexcept {
return _shared_token_metadata.get()->clone_async().then([] (token_metadata tm) {
return _shared_token_metadata.get()->clone_async().then([this] (token_metadata tm) {
// bump the token_metadata ring_version
// to invalidate cached token/replication mappings
// when the modified token_metadata is committed.
tm.invalidate_cached_rings();
return make_ready_future<mutable_token_metadata_ptr>(make_token_metadata_ptr(std::move(tm)));
return _shared_token_metadata.make_token_metadata_ptr(std::move(tm));
});
}
@@ -873,6 +874,11 @@ private:
std::optional<shared_future<>> _rebuild_result;
std::unordered_map<raft::server_id, std::optional<shared_future<>>> _remove_result;
tablet_op_registry _tablet_ops;
// This tracks active topology cmd rpc. There can be only one active
// cmd running and by inspecting this structure it can be checked which
// cmd is current executing and which nodes are still did not reply.
// Needed for debugging.
topology_coordinator_cmd_rpc_tracker _topology_cmd_rpc_tracker;
struct {
raft::term_t term{0};
uint64_t last_index{0};
@@ -941,6 +947,10 @@ public:
// Waits for topology state in which none of tablets has replaced_id as a replica.
// Must be called on shard 0.
future<> await_tablets_rebuilt(raft::server_id replaced_id);
topology_coordinator_cmd_rpc_tracker get_topology_cmd_status() {
return _topology_cmd_rpc_tracker;
}
private:
// Tracks progress of the upgrade to topology coordinator.
future<> _upgrade_to_topology_coordinator_fiber = make_ready_future<>();

View File

@@ -842,7 +842,7 @@ public:
db_clock::duration repair_time_diff;
};
std::vector<repair_plan> plans;
utils::chunked_vector<repair_plan> plans;
auto migration_tablet_ids = co_await mplan.get_migration_tablet_ids();
for (auto&& [table, tmap_] : _tm->tablets().all_tables()) {
auto& tmap = *tmap_;

View File

@@ -147,6 +147,8 @@ class topology_coordinator : public endpoint_lifecycle_subscriber {
group0_voter_handler _voter_handler;
topology_coordinator_cmd_rpc_tracker& _topology_cmd_rpc_tracker;
const locator::token_metadata& get_token_metadata() const noexcept {
return *_shared_tm.get();
}
@@ -389,6 +391,9 @@ class topology_coordinator : public endpoint_lifecycle_subscriber {
future<> exec_direct_command_helper(raft::server_id id, uint64_t cmd_index, const raft_topology_cmd& cmd) {
rtlogger.debug("send {} command with term {} and index {} to {}",
cmd.cmd, _term, cmd_index, id);
_topology_cmd_rpc_tracker.active_dst.emplace(id);
auto _ = seastar::defer([this, id] { _topology_cmd_rpc_tracker.active_dst.erase(id); });
auto result = _db.get_token_metadata().get_topology().is_me(to_host_id(id)) ?
co_await _raft_topology_cmd_handler(_term, cmd_index, cmd) :
co_await ser::storage_service_rpc_verbs::send_raft_topology_cmd(
@@ -403,12 +408,16 @@ class topology_coordinator : public endpoint_lifecycle_subscriber {
auto id = node.id;
release_node(std::move(node));
const auto cmd_index = ++_last_cmd_index;
_topology_cmd_rpc_tracker.current = cmd.cmd;
_topology_cmd_rpc_tracker.index = cmd_index;
co_await exec_direct_command_helper(id, cmd_index, cmd);
co_return retake_node(co_await start_operation(), id);
};
future<> exec_global_command_helper(auto nodes, const raft_topology_cmd& cmd) {
const auto cmd_index = ++_last_cmd_index;
_topology_cmd_rpc_tracker.current = cmd.cmd;
_topology_cmd_rpc_tracker.index = cmd_index;
auto f = co_await coroutine::as_future(
seastar::parallel_for_each(std::move(nodes), [this, &cmd, cmd_index] (raft::server_id id) {
return exec_direct_command_helper(id, cmd_index, cmd);
@@ -1510,7 +1519,13 @@ class topology_coordinator : public endpoint_lifecycle_subscriber {
}
rtlogger.info("Initiating tablet cleanup of {} on {}", gid, dst);
return ser::storage_service_rpc_verbs::send_tablet_cleanup(&_messaging,
dst.host, _as, raft::server_id(dst.host.uuid()), gid);
dst.host, _as, raft::server_id(dst.host.uuid()), gid)
.then([] {
return utils::get_local_injector().inject("wait_after_tablet_cleanup", [] (auto& handler) -> future<> {
rtlogger.info("Waiting after tablet cleanup");
return handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::seconds{60});
});
});
})) {
transition_to(locator::tablet_transition_stage::end_migration);
}
@@ -1730,6 +1745,11 @@ class topology_coordinator : public endpoint_lifecycle_subscriber {
}
future<> handle_tablet_resize_finalization(group0_guard g) {
co_await utils::get_local_injector().inject("handle_tablet_resize_finalization_wait", [] (auto& handler) -> future<> {
rtlogger.info("handle_tablet_resize_finalization: waiting");
co_await handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::seconds{60});
});
// Executes a global barrier to guarantee that any process (e.g. repair) holding stale version
// of token metadata will complete before we update topology.
auto guard = co_await global_tablet_token_metadata_barrier(std::move(g));
@@ -2988,7 +3008,8 @@ public:
raft_topology_cmd_handler_type raft_topology_cmd_handler,
tablet_allocator& tablet_allocator,
std::chrono::milliseconds ring_delay,
gms::feature_service& feature_service)
gms::feature_service& feature_service,
topology_coordinator_cmd_rpc_tracker& topology_cmd_rpc_tracker)
: _sys_dist_ks(sys_dist_ks), _gossiper(gossiper), _messaging(messaging)
, _shared_tm(shared_tm), _sys_ks(sys_ks), _db(db)
, _group0(group0), _topo_sm(topo_sm), _as(as)
@@ -3000,6 +3021,7 @@ public:
, _ring_delay(ring_delay)
, _group0_holder(_group0.hold_group0_gate())
, _voter_handler(group0, topo_sm._topology, gossiper, feature_service)
, _topology_cmd_rpc_tracker(topology_cmd_rpc_tracker)
, _async_gate("topology_coordinator")
{}
@@ -3614,7 +3636,8 @@ future<> run_topology_coordinator(
tablet_allocator& tablet_allocator,
std::chrono::milliseconds ring_delay,
endpoint_lifecycle_notifier& lifecycle_notifier,
gms::feature_service& feature_service) {
gms::feature_service& feature_service,
topology_coordinator_cmd_rpc_tracker& topology_cmd_rpc_tracker) {
topology_coordinator coordinator{
sys_dist_ks, gossiper, messaging, shared_tm,
@@ -3622,7 +3645,8 @@ future<> run_topology_coordinator(
std::move(raft_topology_cmd_handler),
tablet_allocator,
ring_delay,
feature_service};
feature_service,
topology_cmd_rpc_tracker};
std::exception_ptr ex;
lifecycle_notifier.register_subscriber(&coordinator);

View File

@@ -62,6 +62,12 @@ future<> wait_for_gossiper(raft::server_id id, const gms::gossiper& g, seastar::
using raft_topology_cmd_handler_type = noncopyable_function<future<raft_topology_cmd_result>(
raft::term_t, uint64_t, const raft_topology_cmd&)>;
struct topology_coordinator_cmd_rpc_tracker {
raft_topology_cmd::command current;
uint64_t index;
std::set<raft::server_id> active_dst;
};
future<> run_topology_coordinator(
seastar::sharded<db::system_distributed_keyspace>& sys_dist_ks, gms::gossiper& gossiper,
netw::messaging_service& messaging, locator::shared_token_metadata& shared_tm,
@@ -71,6 +77,7 @@ future<> run_topology_coordinator(
tablet_allocator& tablet_allocator,
std::chrono::milliseconds ring_delay,
endpoint_lifecycle_notifier& lifecycle_notifier,
gms::feature_service& feature_service);
gms::feature_service& feature_service,
topology_coordinator_cmd_rpc_tracker& topology_cmd_rpc_tracker);
}

View File

@@ -27,6 +27,7 @@ extern logging::logger sstlog;
// data and no compression
template <ChecksumUtils ChecksumType, bool check_digest>
class checksummed_file_data_source_impl : public data_source_impl {
std::function<future<input_stream<char>>()> _stream_creator;
std::optional<input_stream<char>> _input_stream;
const checksum& _checksum;
[[no_unique_address]] digest_members<check_digest> _digests;
@@ -38,7 +39,7 @@ class checksummed_file_data_source_impl : public data_source_impl {
uint64_t _beg_pos;
uint64_t _end_pos;
public:
checksummed_file_data_source_impl(file f, uint64_t file_len,
checksummed_file_data_source_impl(stream_creator_fn stream_creator, uint64_t file_len,
const checksum& checksum, uint64_t pos, size_t len,
file_input_stream_options options,
std::optional<uint32_t> digest,
@@ -87,14 +88,20 @@ public:
}
auto start = align_down(_beg_pos, chunk_size);
auto end = std::min(_file_len, align_up(_end_pos, chunk_size));
_input_stream = make_file_input_stream(std::move(f), start, end - start, std::move(options));
_stream_creator = [stream_creator, start, length = end - start, options] mutable {
return stream_creator(start, length, std::move(options));
};
_underlying_pos = start;
}
virtual future<temporary_buffer<char>> get() override {
uint64_t chunk_size = _checksum.chunk_size;
if (_pos >= _end_pos) {
return make_ready_future<temporary_buffer<char>>();
co_return temporary_buffer<char>();
}
if (!_input_stream) {
_input_stream = co_await _stream_creator();
}
// Read the next chunk. We need to skip part of the first
// chunk, but then continue to read from beginning of chunks.
@@ -103,47 +110,46 @@ public:
if (_pos != _beg_pos && (_pos & (chunk_size - 1)) != 0) {
throw std::runtime_error(format("Checksummed reader not aligned to chunk boundary: pos={}, chunk_size={}", _pos, chunk_size));
}
return _input_stream->read_exactly(chunk_size).then([this, chunk_size](temporary_buffer<char> buf) {
uint32_t chunk_index = _pos >> _chunk_size_trailing_zeros;
if (buf.size() != chunk_size) {
auto actual_end = _underlying_pos + buf.size();
if (chunk_index + 1 < _checksum.checksums.size()) {
throw malformed_sstable_exception(seastar::format("Checksummed reader hit premature end-of-file at file offset {}: expected {} chunks of size {} but data file has {}",
actual_end, _checksum.checksums.size(), chunk_size, chunk_index + 1));
} else if (actual_end < _file_len) {
// Truncation on last chunk. Update _end_pos so that future
// calls to get() return immediately.
_end_pos = actual_end;
}
}
if (chunk_index >= _checksum.checksums.size()) {
throw malformed_sstable_exception(seastar::format("Chunk count mismatch between CRC and Data.db: expected {} but data file has more", _checksum.checksums.size()));
}
auto expected_checksum = _checksum.checksums[chunk_index];
auto actual_checksum = ChecksumType::checksum(buf.get(), buf.size());
if (expected_checksum != actual_checksum) {
_error_handler(seastar::format(
"Checksummed chunk of size {} at file offset {} failed checksum: expected={}, actual={}",
buf.size(), _underlying_pos, expected_checksum, actual_checksum));
auto buf = co_await _input_stream->read_exactly(chunk_size);
uint32_t chunk_index = _pos >> _chunk_size_trailing_zeros;
if (buf.size() != chunk_size) {
auto actual_end = _underlying_pos + buf.size();
if (chunk_index + 1 < _checksum.checksums.size()) {
throw malformed_sstable_exception(seastar::format("Checksummed reader hit premature end-of-file at file offset {}: expected {} chunks of size {} but data file has {}",
actual_end, _checksum.checksums.size(), chunk_size, chunk_index + 1));
} else if (actual_end < _file_len) {
// Truncation on last chunk. Update _end_pos so that future
// calls to get() return immediately.
_end_pos = actual_end;
}
}
if (chunk_index >= _checksum.checksums.size()) {
throw malformed_sstable_exception(seastar::format("Chunk count mismatch between CRC and Data.db: expected {} but data file has more", _checksum.checksums.size()));
}
auto expected_checksum = _checksum.checksums[chunk_index];
auto actual_checksum = ChecksumType::checksum(buf.get(), buf.size());
if (expected_checksum != actual_checksum) {
_error_handler(seastar::format(
"Checksummed chunk of size {} at file offset {} failed checksum: expected={}, actual={}",
buf.size(), _underlying_pos, expected_checksum, actual_checksum));
}
if constexpr (check_digest) {
if (_digests.can_calculate_digest) {
_digests.actual_digest = checksum_combine_or_feed<ChecksumType>(_digests.actual_digest, actual_checksum, buf.begin(), buf.size());
}
if constexpr (check_digest) {
if (_digests.can_calculate_digest) {
_digests.actual_digest = checksum_combine_or_feed<ChecksumType>(_digests.actual_digest, actual_checksum, buf.begin(), buf.size());
}
}
buf.trim_front(_pos & (chunk_size - 1));
_pos += buf.size();
_underlying_pos += chunk_size;
buf.trim_front(_pos & (chunk_size - 1));
_pos += buf.size();
_underlying_pos += chunk_size;
if constexpr (check_digest) {
if (_digests.can_calculate_digest && _pos == _file_len && _digests.expected_digest != _digests.actual_digest) {
_error_handler(seastar::format("Digest mismatch: expected={}, actual={}", _digests.expected_digest, _digests.actual_digest));
}
if constexpr (check_digest) {
if (_digests.can_calculate_digest && _pos == _file_len && _digests.expected_digest != _digests.actual_digest) {
_error_handler(seastar::format("Digest mismatch: expected={}, actual={}", _digests.expected_digest, _digests.actual_digest));
}
return buf;
});
}
co_return buf;
}
virtual future<> close() override {
@@ -166,59 +172,61 @@ public:
}
_pos += n;
if (_pos == _end_pos) {
return make_ready_future<temporary_buffer<char>>();
co_return temporary_buffer<char>();
}
auto underlying_n = align_down(_pos, chunk_size) - _underlying_pos;
_beg_pos = _pos;
_underlying_pos += underlying_n;
return _input_stream->skip(underlying_n).then([] {
return make_ready_future<temporary_buffer<char>>();
});
if (!_input_stream) {
_input_stream = co_await _stream_creator();
}
co_await _input_stream->skip(underlying_n);
co_return temporary_buffer<char>();
}
};
template <ChecksumUtils ChecksumType, bool check_digest>
class checksummed_file_data_source : public data_source {
public:
checksummed_file_data_source(file f, uint64_t file_len, const checksum& checksum,
checksummed_file_data_source(stream_creator_fn stream_creator, uint64_t file_len, const checksum& checksum,
uint64_t offset, size_t len, file_input_stream_options options,
std::optional<uint32_t> digest, integrity_error_handler error_handler)
: data_source(std::make_unique<checksummed_file_data_source_impl<ChecksumType, check_digest>>(
std::move(f), file_len, checksum, offset, len, std::move(options), digest,
std::move(stream_creator), file_len, checksum, offset, len, std::move(options), digest,
error_handler))
{}
};
template <ChecksumUtils ChecksumType>
inline input_stream<char> make_checksummed_file_input_stream(
file f, uint64_t file_len, const checksum& checksum, uint64_t offset,
stream_creator_fn stream_creator, uint64_t file_len, const checksum& checksum, uint64_t offset,
size_t len, file_input_stream_options options, std::optional<uint32_t> digest,
integrity_error_handler error_handler)
{
if (digest) {
return input_stream<char>(checksummed_file_data_source<ChecksumType, true>(
std::move(f), file_len, checksum, offset, len, std::move(options), digest,
std::move(stream_creator), file_len, checksum, offset, len, std::move(options), digest,
error_handler));
}
return input_stream<char>(checksummed_file_data_source<ChecksumType, false>(
std::move(f), file_len, checksum, offset, len, std::move(options), digest, error_handler));
std::move(stream_creator), file_len, checksum, offset, len, std::move(options), digest, error_handler));
}
input_stream<char> make_checksummed_file_k_l_format_input_stream(
file f, uint64_t file_len, const checksum& checksum, uint64_t offset,
stream_creator_fn stream_creator, uint64_t file_len, const checksum& checksum, uint64_t offset,
size_t len, file_input_stream_options options, std::optional<uint32_t> digest,
integrity_error_handler error_handler)
{
return make_checksummed_file_input_stream<adler32_utils>(std::move(f), file_len,
return make_checksummed_file_input_stream<adler32_utils>(std::move(stream_creator), file_len,
checksum, offset, len, std::move(options), digest, error_handler);
}
input_stream<char> make_checksummed_file_m_format_input_stream(
file f, uint64_t file_len, const checksum& checksum, uint64_t offset,
stream_creator_fn stream_creator, uint64_t file_len, const checksum& checksum, uint64_t offset,
size_t len, file_input_stream_options options, std::optional<uint32_t> digest,
integrity_error_handler error_handler)
{
return make_checksummed_file_input_stream<crc32_utils>(std::move(f), file_len,
return make_checksummed_file_input_stream<crc32_utils>(std::move(stream_creator), file_len,
checksum, offset, len, std::move(options), digest, error_handler);
}

View File

@@ -9,24 +9,26 @@
#pragma once
#include <seastar/core/seastar.hh>
#include <seastar/core/fstream.hh>
#include <seastar/core/iostream.hh>
#include "sstables/types.hh"
namespace sstables {
using stream_creator_fn = std::function<future<input_stream<char>>(uint64_t, uint64_t, file_input_stream_options)>;
using integrity_error_handler = std::function<void(sstring)>;
void throwing_integrity_error_handler(sstring msg);
input_stream<char> make_checksummed_file_k_l_format_input_stream(file f,
input_stream<char> make_checksummed_file_k_l_format_input_stream(stream_creator_fn stream_creator,
uint64_t file_len, const sstables::checksum& checksum,
uint64_t offset, size_t len,
class file_input_stream_options options,
std::optional<uint32_t> digest,
integrity_error_handler error_handler = throwing_integrity_error_handler);
input_stream<char> make_checksummed_file_m_format_input_stream(file f,
input_stream<char> make_checksummed_file_m_format_input_stream(stream_creator_fn stream_creator,
uint64_t file_len, const sstables::checksum& checksum,
uint64_t offset, size_t len,
class file_input_stream_options options,

View File

@@ -304,6 +304,7 @@ enum class compressed_checksum_mode {
template <ChecksumUtils ChecksumType, bool check_digest, compressed_checksum_mode mode>
class compressed_file_data_source_impl : public data_source_impl {
std::function<future<input_stream<char>>()> _stream_creator;
std::optional<input_stream<char>> _input_stream;
sstables::compression* _compression_metadata;
sstables::compression::segmented_offsets::accessor _offsets;
@@ -314,7 +315,7 @@ class compressed_file_data_source_impl : public data_source_impl {
uint64_t _beg_pos;
uint64_t _end_pos;
public:
compressed_file_data_source_impl(file f, sstables::compression* cm,
compressed_file_data_source_impl(sstables::stream_creator_fn stream_creator, sstables::compression* cm,
uint64_t pos, size_t len, file_input_stream_options options,
reader_permit permit, std::optional<uint32_t> digest)
: _compression_metadata(cm)
@@ -352,15 +353,18 @@ public:
// and open a file_input_stream to read that range.
auto start = _compression_metadata->locate(_beg_pos, _offsets);
auto end = _compression_metadata->locate(_end_pos - 1, _offsets);
_input_stream = make_file_input_stream(std::move(f),
start.chunk_start,
end.chunk_start + end.chunk_len - start.chunk_start,
std::move(options));
_stream_creator = [stream_creator{std::move(stream_creator)}, start = start.chunk_start, length = end.chunk_start + end.chunk_len - start.chunk_start, options] mutable {
return stream_creator(start, length, std::move(options));
};
_underlying_pos = start.chunk_start;
}
virtual future<temporary_buffer<char>> get() override {
if (_pos >= _end_pos) {
return make_ready_future<temporary_buffer<char>>();
co_return temporary_buffer<char>();
}
if (!_input_stream) {
_input_stream = co_await _stream_creator();
}
auto addr = _compression_metadata->locate(_pos, _offsets);
// Uncompress the next chunk. We need to skip part of the first
@@ -371,58 +375,55 @@ public:
if (!addr.chunk_len) {
throw sstables::malformed_sstable_exception(format("compressed chunk_len must be greater than zero, chunk_start={}", addr.chunk_start));
}
return _input_stream->read_exactly(addr.chunk_len).then([this, addr](temporary_buffer<char> buf) {
if (buf.size() != addr.chunk_len) {
throw sstables::malformed_sstable_exception(format("compressed reader hit premature end-of-file at file offset {}, expected chunk_len={}, actual={}", _underlying_pos, addr.chunk_len, buf.size()));
auto buf = co_await _input_stream->read_exactly(addr.chunk_len);
if (buf.size() != addr.chunk_len) {
throw sstables::malformed_sstable_exception(format("compressed reader hit premature end-of-file at file offset {}, expected chunk_len={}, actual={}", _underlying_pos, addr.chunk_len, buf.size()));
}
auto res_units = co_await _permit.request_memory(_compression_metadata->uncompressed_chunk_length());
// The last 4 bytes of the chunk are the adler32/crc32 checksum
// of the rest of the (compressed) chunk.
auto compressed_len = addr.chunk_len - 4;
// FIXME: Do not always calculate checksum - Cassandra has a
// probability (defaulting to 1.0, but still...)
auto expected_checksum = read_be<uint32_t>(buf.get() + compressed_len);
auto actual_checksum = ChecksumType::checksum(buf.get(), compressed_len);
if (expected_checksum != actual_checksum) {
throw sstables::malformed_sstable_exception(format("compressed chunk of size {} at file offset {} failed checksum, expected={}, actual={}", addr.chunk_len, _underlying_pos, expected_checksum, actual_checksum));
}
if constexpr (check_digest) {
if (_digests.can_calculate_digest) {
_digests.actual_digest = checksum_combine_or_feed<ChecksumType>(_digests.actual_digest, actual_checksum, buf.get(), compressed_len);
if constexpr (mode == compressed_checksum_mode::checksum_all) {
uint32_t be_actual_checksum = cpu_to_be(actual_checksum);
_digests.actual_digest = ChecksumType::checksum(_digests.actual_digest,
reinterpret_cast<const char*>(&be_actual_checksum), sizeof(be_actual_checksum));
}
}
return _permit.request_memory(_compression_metadata->uncompressed_chunk_length()).then(
[this, addr, buf = std::move(buf)] (reader_permit::resource_units res_units) mutable {
// The last 4 bytes of the chunk are the adler32/crc32 checksum
// of the rest of the (compressed) chunk.
auto compressed_len = addr.chunk_len - 4;
// FIXME: Do not always calculate checksum - Cassandra has a
// probability (defaulting to 1.0, but still...)
auto expected_checksum = read_be<uint32_t>(buf.get() + compressed_len);
auto actual_checksum = ChecksumType::checksum(buf.get(), compressed_len);
if (expected_checksum != actual_checksum) {
throw sstables::malformed_sstable_exception(format("compressed chunk of size {} at file offset {} failed checksum, expected={}, actual={}", addr.chunk_len, _underlying_pos, expected_checksum, actual_checksum));
}
}
if constexpr (check_digest) {
if (_digests.can_calculate_digest) {
_digests.actual_digest = checksum_combine_or_feed<ChecksumType>(_digests.actual_digest, actual_checksum, buf.get(), compressed_len);
if constexpr (mode == compressed_checksum_mode::checksum_all) {
uint32_t be_actual_checksum = cpu_to_be(actual_checksum);
_digests.actual_digest = ChecksumType::checksum(_digests.actual_digest,
reinterpret_cast<const char*>(&be_actual_checksum), sizeof(be_actual_checksum));
}
}
}
// We know that the uncompressed data will take exactly
// chunk_length bytes (or less, if reading the last chunk).
temporary_buffer<char> out(
_compression_metadata->uncompressed_chunk_length());
// The compressed data is the whole chunk, minus the last 4
// bytes (which contain the checksum verified above).
// We know that the uncompressed data will take exactly
// chunk_length bytes (or less, if reading the last chunk).
temporary_buffer<char> out(
_compression_metadata->uncompressed_chunk_length());
// The compressed data is the whole chunk, minus the last 4
// bytes (which contain the checksum verified above).
auto len = _compression_metadata->get_compressor().uncompress(buf.get(), compressed_len, out.get_write(), out.size());
auto len = _compression_metadata->get_compressor().uncompress(buf.get(), compressed_len, out.get_write(), out.size());
out.trim(len);
out.trim_front(addr.offset);
_pos += out.size();
_underlying_pos += addr.chunk_len;
out.trim(len);
out.trim_front(addr.offset);
_pos += out.size();
_underlying_pos += addr.chunk_len;
if constexpr (check_digest) {
if (_digests.can_calculate_digest
&& _pos == _compression_metadata->uncompressed_file_length()
&& _digests.expected_digest != _digests.actual_digest) {
throw sstables::malformed_sstable_exception(seastar::format("Digest mismatch: expected={}, actual={}", _digests.expected_digest, _digests.actual_digest));
}
}
return make_tracked_temporary_buffer(std::move(out), std::move(res_units));
});
});
if constexpr (check_digest) {
if (_digests.can_calculate_digest
&& _pos == _compression_metadata->uncompressed_file_length()
&& _digests.expected_digest != _digests.actual_digest) {
throw sstables::malformed_sstable_exception(seastar::format("Digest mismatch: expected={}, actual={}", _digests.expected_digest, _digests.actual_digest));
}
}
co_return make_tracked_temporary_buffer(std::move(out), std::move(res_units));
}
virtual future<> close() override {
@@ -444,41 +445,42 @@ public:
}
_pos += n;
if (_pos == _end_pos) {
return make_ready_future<temporary_buffer<char>>();
co_return temporary_buffer<char>();
}
auto addr = _compression_metadata->locate(_pos, _offsets);
auto underlying_n = addr.chunk_start - _underlying_pos;
_underlying_pos = addr.chunk_start;
_beg_pos = _pos;
return _input_stream->skip(underlying_n).then([] {
return make_ready_future<temporary_buffer<char>>();
});
if (!_input_stream) {
_input_stream = co_await _stream_creator();
}
co_await _input_stream->skip(underlying_n);
co_return temporary_buffer<char>();
}
};
template <ChecksumUtils ChecksumType, bool check_digest, compressed_checksum_mode mode>
class compressed_file_data_source : public data_source {
public:
compressed_file_data_source(file f, sstables::compression* cm,
compressed_file_data_source(sstables::stream_creator_fn stream_creator, sstables::compression* cm,
uint64_t offset, size_t len, file_input_stream_options options, reader_permit permit,
std::optional<uint32_t> digest)
: data_source(std::make_unique<compressed_file_data_source_impl<ChecksumType, check_digest, mode>>(
std::move(f), cm, offset, len, std::move(options), std::move(permit), digest))
std::move(stream_creator), cm, offset, len, std::move(options), std::move(permit), digest))
{}
};
template <ChecksumUtils ChecksumType, compressed_checksum_mode mode>
inline input_stream<char> make_compressed_file_input_stream(
file f, sstables::compression *cm, uint64_t offset, size_t len,
inline input_stream<char> make_compressed_file_input_stream(sstables::stream_creator_fn stream_creator, sstables::compression *cm, uint64_t offset, size_t len,
file_input_stream_options options, reader_permit permit,
std::optional<uint32_t> digest)
{
if (digest) [[unlikely]] {
return input_stream<char>(compressed_file_data_source<ChecksumType, true, mode>(
std::move(f), cm, offset, len, std::move(options), std::move(permit), digest));
std::move(stream_creator), cm, offset, len, std::move(options), std::move(permit), digest));
}
return input_stream<char>(compressed_file_data_source<ChecksumType, false, mode>(
std::move(f), cm, offset, len, std::move(options), std::move(permit), digest));
std::move(stream_creator), cm, offset, len, std::move(options), std::move(permit), digest));
}
// compressed_file_data_sink_impl works as a filter for a file output stream,
@@ -577,21 +579,21 @@ inline output_stream<char> make_compressed_file_output_stream(output_stream<char
return output_stream<char>(compressed_file_data_sink<ChecksumType, mode>(std::move(out), cm));
}
input_stream<char> sstables::make_compressed_file_k_l_format_input_stream(file f,
input_stream<char> sstables::make_compressed_file_k_l_format_input_stream(stream_creator_fn stream_creator,
sstables::compression* cm, uint64_t offset, size_t len,
class file_input_stream_options options, reader_permit permit,
std::optional<uint32_t> digest)
{
return make_compressed_file_input_stream<adler32_utils, compressed_checksum_mode::checksum_chunks_only>(
std::move(f), cm, offset, len, std::move(options), std::move(permit), digest);
std::move(stream_creator), cm, offset, len, std::move(options), std::move(permit), digest);
}
input_stream<char> sstables::make_compressed_file_m_format_input_stream(file f,
input_stream<char> sstables::make_compressed_file_m_format_input_stream(stream_creator_fn stream_creator,
sstables::compression *cm, uint64_t offset, size_t len,
class file_input_stream_options options, reader_permit permit,
std::optional<uint32_t> digest) {
return make_compressed_file_input_stream<crc32_utils, compressed_checksum_mode::checksum_all>(
std::move(f), cm, offset, len, std::move(options), std::move(permit), digest);
std::move(stream_creator), cm, offset, len, std::move(options), std::move(permit), digest);
}
output_stream<char> sstables::make_compressed_file_m_format_output_stream(output_stream<char> out,

View File

@@ -361,17 +361,19 @@ public:
friend class sstable;
};
using stream_creator_fn = std::function<future<input_stream<char>>(uint64_t, uint64_t, file_input_stream_options)>;
// Note: compression_metadata is passed by reference; The caller is
// responsible for keeping the compression_metadata alive as long as there
// are open streams on it. This should happen naturally on a higher level -
// as long as we have *sstables* work in progress, we need to keep the whole
// sstable alive, and the compression metadata is only a part of it.
input_stream<char> make_compressed_file_k_l_format_input_stream(file f,
input_stream<char> make_compressed_file_k_l_format_input_stream(stream_creator_fn stream_creator,
sstables::compression* cm, uint64_t offset, size_t len,
class file_input_stream_options options, reader_permit permit,
std::optional<uint32_t> digest);
input_stream<char> make_compressed_file_m_format_input_stream(file f,
input_stream<char> make_compressed_file_m_format_input_stream(stream_creator_fn stream_creator,
sstables::compression* cm, uint64_t offset, size_t len,
class file_input_stream_options options, reader_permit permit,
std::optional<uint32_t> digest);

View File

@@ -454,12 +454,12 @@ class index_reader {
bool _single_page_read;
abort_source _abort;
std::unique_ptr<index_consume_entry_context<index_consumer>> make_context(uint64_t begin, uint64_t end, index_consumer& consumer) {
future<std::unique_ptr<index_consume_entry_context<index_consumer>>> make_context(uint64_t begin, uint64_t end, index_consumer& consumer) {
auto index_file = make_tracked_index_file(*_sstable, _permit, _trace_state, _use_caching);
auto input = make_file_input_stream(index_file, begin, (_single_page_read ? end : _sstable->index_size()) - begin,
get_file_input_stream_options());
auto input = input_stream<char>(co_await _sstable->get_storage().make_data_or_index_source(
*_sstable, component_type::Index, index_file, begin, (_single_page_read ? end : _sstable->index_size()) - begin, get_file_input_stream_options()));
auto trust_pi = trust_promoted_index(_sstable->has_correct_promoted_index_entries());
return std::make_unique<index_consume_entry_context<index_consumer>>(*_sstable, _permit, consumer, trust_pi, std::move(input),
co_return std::make_unique<index_consume_entry_context<index_consumer>>(*_sstable, _permit, consumer, trust_pi, std::move(input),
begin, end - begin, _sstable->get_column_translation(), _abort, _trace_state);
}
@@ -467,12 +467,12 @@ class index_reader {
assert(!bound.context || !_single_page_read);
if (!bound.context) {
bound.consumer = std::make_unique<index_consumer>(_region, _sstable->get_schema());
bound.context = make_context(begin, end, *bound.consumer);
bound.context = co_await make_context(begin, end, *bound.consumer);
bound.consumer->prepare(quantity);
return make_ready_future<>();
co_return;
}
bound.consumer->prepare(quantity);
return bound.context->fast_forward_to(begin, end);
co_return co_await bound.context->fast_forward_to(begin, end);
}
private:

View File

@@ -1343,12 +1343,12 @@ private:
if (_single_partition_read) {
_read_enabled = (begin != *end);
_context = data_consume_single_partition<DataConsumeRowsContext>(*_schema, _sst, _consumer, { begin, *end }, integrity_check::no);
_context = co_await data_consume_single_partition<DataConsumeRowsContext>(*_schema, _sst, _consumer, { begin, *end }, integrity_check::no);
} else {
sstable::disk_read_range drr{begin, *end};
auto last_end = _fwd_mr ? _sst->data_size() : drr.end;
_read_enabled = bool(drr);
_context = data_consume_rows<DataConsumeRowsContext>(*_schema, _sst, _consumer, std::move(drr), last_end, integrity_check::no);
_context = co_await data_consume_rows<DataConsumeRowsContext>(*_schema, _sst, _consumer, std::move(drr), last_end, integrity_check::no);
}
_monitor.on_read_started(_context->reader_position());
@@ -1545,6 +1545,7 @@ class sstable_full_scan_reader : public mp_row_consumer_reader_k_l {
Consumer _consumer;
std::unique_ptr<DataConsumeRowsContext> _context;
read_monitor& _monitor;
integrity_check _integrity_check;
public:
sstable_full_scan_reader(shared_sstable sst, schema_ptr schema,
reader_permit permit,
@@ -1553,9 +1554,8 @@ public:
integrity_check integrity)
: mp_row_consumer_reader_k_l(std::move(schema), permit, std::move(sst))
, _consumer(this, _schema, std::move(permit), _schema->full_slice(), std::move(trace_state), streamed_mutation::forwarding::no, _sst)
, _context(data_consume_rows<DataConsumeRowsContext>(*_schema, _sst, _consumer, integrity))
, _monitor(mon) {
_monitor.on_read_started(_context->reader_position());
, _monitor(mon)
, _integrity_check(integrity) {
}
public:
void on_out_of_clustering_range() override {
@@ -1571,14 +1571,18 @@ public:
on_internal_error(sstlog, "sstable_full_scan_reader: doesn't support next_partition()");
}
virtual future<> fill_buffer() override {
if (!_context) {
_context = co_await data_consume_rows<DataConsumeRowsContext>(*_schema, _sst, _consumer, _integrity_check);
_monitor.on_read_started(_context->reader_position());
}
if (_end_of_stream) {
return make_ready_future<>();
co_return;
}
if (_context->eof()) {
_end_of_stream = true;
return make_ready_future<>();
co_return;
}
return _context->consume_input();
co_return co_await _context->consume_input();
}
virtual future<> close() noexcept override {
if (!_context) {

View File

@@ -402,7 +402,7 @@ class partition_reversing_data_source_impl final : public data_source_impl {
FINISHED
} _state = state::RANGE_END;
private:
input_stream<char> data_stream(size_t start, size_t end) {
future<input_stream<char>> data_stream(size_t start, size_t end) {
return _sst->data_stream(start, end - start, _permit, _trace_state, {});
}
future<temporary_buffer<char>> data_read(uint64_t start, uint64_t end) {
@@ -474,7 +474,7 @@ public:
virtual future<temporary_buffer<char>> get() override {
if (!_partition_header_context) {
_partition_header_context.emplace(data_stream(_partition_start, _partition_end), _partition_start, _partition_end - _partition_start, _permit);
_partition_header_context.emplace(co_await data_stream(_partition_start, _partition_end), _partition_start, _partition_end - _partition_start, _permit);
co_await _partition_header_context->consume_input();
_clustering_range_start = _partition_header_context->header_end_pos();
co_return co_await data_read(_partition_start, _clustering_range_start);
@@ -507,7 +507,7 @@ public:
}
look_in_last_block = true;
} else {
co_await emplace_row_skipping_context(data_stream(_row_start, _row_end), _row_start, _row_end);
co_await emplace_row_skipping_context(co_await data_stream(_row_start, _row_end), _row_start, _row_end);
co_await _row_skipping_context->consume_input();
if (_row_skipping_context->end_of_partition()) {
look_in_last_block = true;
@@ -526,7 +526,7 @@ public:
_row_start = _clustering_range_start;
}
uint64_t last_row_start = _row_start;
co_await emplace_row_skipping_context(data_stream(_row_start, _partition_end), _row_start, _partition_end);
co_await emplace_row_skipping_context(co_await data_stream(_row_start, _partition_end), _row_start, _partition_end);
co_await _row_skipping_context->consume_input();
while (!_row_skipping_context->end_of_partition()) {
last_row_start = _row_start;

View File

@@ -1568,13 +1568,13 @@ private:
_context = std::move(reversed_context.the_context);
_reversed_read_sstable_position = &reversed_context.current_position_in_sstable;
} else {
_context = data_consume_single_partition<DataConsumeRowsContext>(*_schema, _sst, _consumer, { begin, *end }, _integrity);
_context = co_await data_consume_single_partition<DataConsumeRowsContext>(*_schema, _sst, _consumer, { begin, *end }, _integrity);
}
} else {
sstable::disk_read_range drr{begin, *end};
auto last_end = _fwd_mr ? _sst->data_size() : drr.end;
_read_enabled = bool(drr);
_context = data_consume_rows<DataConsumeRowsContext>(*_schema, _sst, _consumer, std::move(drr), last_end, _integrity);
_context = co_await data_consume_rows<DataConsumeRowsContext>(*_schema, _sst, _consumer, std::move(drr), last_end, _integrity);
}
_monitor.on_read_started(_context->reader_position());
@@ -1813,7 +1813,7 @@ private:
_checksum = co_await _sst->read_checksum();
co_await _sst->read_digest();
}
_context = data_consume_rows<DataConsumeRowsContext>(*_schema, _sst, _consumer, _integrity);
_context = co_await data_consume_rows<DataConsumeRowsContext>(*_schema, _sst, _consumer, _integrity);
_monitor.on_read_started(_context->reader_position());
}
public:
@@ -2105,7 +2105,7 @@ future<uint64_t> validate(
sstables::read_monitor& monitor) {
auto schema = sstable->get_schema();
validating_consumer consumer(schema, permit, sstable, std::move(error_handler));
auto context = data_consume_rows<data_consume_rows_context_m<validating_consumer>>(*schema, sstable, consumer, integrity_check::yes);
auto context = co_await data_consume_rows<data_consume_rows_context_m<validating_consumer>>(*schema, sstable, consumer, integrity_check::yes);
std::optional<sstables::index_reader> idx_reader;
idx_reader.emplace(sstable, permit, tracing::trace_state_ptr{}, sstables::use_caching::no, false);

View File

@@ -109,16 +109,16 @@ position_in_partition_view get_slice_lower_bound(const schema& s, const query::p
// The amount of this excessive read is controlled by read ahead
// heuristics which learn from the usefulness of previous read aheads.
template <typename DataConsumeRowsContext>
inline std::unique_ptr<DataConsumeRowsContext> data_consume_rows(const schema& s, shared_sstable sst, typename DataConsumeRowsContext::consumer& consumer,
inline future<std::unique_ptr<DataConsumeRowsContext>> data_consume_rows(const schema& s, shared_sstable sst, typename DataConsumeRowsContext::consumer& consumer,
sstable::disk_read_range toread, uint64_t last_end, integrity_check integrity) {
// Although we were only asked to read until toread.end, we'll not limit
// the underlying file input stream to this end, but rather to last_end.
// This potentially enables read-ahead beyond end, until last_end, which
// can be beneficial if the user wants to fast_forward_to() on the
// returned context, and may make small skips.
auto input = sst->data_stream(toread.start, last_end - toread.start,
auto input = co_await sst->data_stream(toread.start, last_end - toread.start,
consumer.permit(), consumer.trace_state(), sst->_partition_range_history, sstable::raw_stream::no, integrity);
return std::make_unique<DataConsumeRowsContext>(s, std::move(sst), consumer, std::move(input), toread.start, toread.end - toread.start);
co_return std::make_unique<DataConsumeRowsContext>(s, std::move(sst), consumer, std::move(input), toread.start, toread.end - toread.start);
}
template <typename DataConsumeRowsContext>
@@ -149,16 +149,16 @@ inline reversed_context<DataConsumeRowsContext> data_consume_reversed_partition(
}
template <typename DataConsumeRowsContext>
inline std::unique_ptr<DataConsumeRowsContext> data_consume_single_partition(const schema& s, shared_sstable sst, typename DataConsumeRowsContext::consumer& consumer,
inline future<std::unique_ptr<DataConsumeRowsContext>> data_consume_single_partition(const schema& s, shared_sstable sst, typename DataConsumeRowsContext::consumer& consumer,
sstable::disk_read_range toread, integrity_check integrity) {
auto input = sst->data_stream(toread.start, toread.end - toread.start,
auto input = co_await sst->data_stream(toread.start, toread.end - toread.start,
consumer.permit(), consumer.trace_state(), sst->_single_partition_history, sstable::raw_stream::no, integrity);
return std::make_unique<DataConsumeRowsContext>(s, std::move(sst), consumer, std::move(input), toread.start, toread.end - toread.start);
co_return std::make_unique<DataConsumeRowsContext>(s, std::move(sst), consumer, std::move(input), toread.start, toread.end - toread.start);
}
// Like data_consume_rows() with bounds, but iterates over whole range
template <typename DataConsumeRowsContext>
inline std::unique_ptr<DataConsumeRowsContext> data_consume_rows(const schema& s, shared_sstable sst, typename DataConsumeRowsContext::consumer& consumer,
inline future<std::unique_ptr<DataConsumeRowsContext>> data_consume_rows(const schema& s, shared_sstable sst, typename DataConsumeRowsContext::consumer& consumer,
integrity_check integrity) {
auto data_size = sst->data_size();
return data_consume_rows<DataConsumeRowsContext>(s, std::move(sst), consumer, {0, data_size}, data_size, integrity);

View File

@@ -2458,7 +2458,7 @@ component_type sstable::component_from_sstring(version_types v, const sstring &s
}
}
input_stream<char> sstable::data_stream(uint64_t pos, size_t len,
future<input_stream<char>> sstable::data_stream(uint64_t pos, size_t len,
reader_permit permit, tracing::trace_state_ptr trace_state, lw_shared_ptr<file_input_stream_history> history, raw_stream raw,
integrity_check integrity, integrity_error_handler error_handler) {
file_input_stream_options options;
@@ -2475,13 +2475,15 @@ input_stream<char> sstable::data_stream(uint64_t pos, size_t len,
if (integrity == integrity_check::yes) {
digest = get_digest();
}
auto stream_creator = [this, f](uint64_t pos, uint64_t len, file_input_stream_options options) mutable -> future<input_stream<char>> {
co_return input_stream<char>(co_await _storage->make_data_or_index_source(*this, component_type::Data, std::move(f), pos, len, std::move(options)));
};
if (_components->compression && raw == raw_stream::no) {
if (_version >= sstable_version_types::mc) {
return make_compressed_file_m_format_input_stream(f, &_components->compression,
co_return make_compressed_file_m_format_input_stream(stream_creator, &_components->compression,
pos, len, std::move(options), permit, digest);
} else {
return make_compressed_file_k_l_format_input_stream(f, &_components->compression,
co_return make_compressed_file_k_l_format_input_stream(stream_creator, &_components->compression,
pos, len, std::move(options), permit, digest);
}
}
@@ -2489,22 +2491,21 @@ input_stream<char> sstable::data_stream(uint64_t pos, size_t len,
auto checksum = get_checksum();
auto file_len = data_size();
if (_version >= sstable_version_types::mc) {
return make_checksummed_file_m_format_input_stream(f, file_len,
co_return make_checksummed_file_m_format_input_stream(stream_creator, file_len,
*checksum, pos, len, std::move(options), digest, error_handler);
} else {
return make_checksummed_file_k_l_format_input_stream(f, file_len,
co_return make_checksummed_file_k_l_format_input_stream(stream_creator, file_len,
*checksum, pos, len, std::move(options), digest, error_handler);
}
}
return make_file_input_stream(f, pos, len, std::move(options));
co_return co_await stream_creator(pos, len, std::move(options));
}
future<temporary_buffer<char>> sstable::data_read(uint64_t pos, size_t len, reader_permit permit) {
return do_with(data_stream(pos, len, std::move(permit), tracing::trace_state_ptr(), {}), [len] (auto& stream) {
return stream.read_exactly(len).finally([&stream] {
return stream.close();
});
});
auto stream = co_await data_stream(pos, len, std::move(permit), tracing::trace_state_ptr(), {});
auto buff = co_await stream.read_exactly(len);
co_await stream.close();
co_return buff;
}
template <typename ChecksumType>
@@ -2670,10 +2671,10 @@ future<validate_checksums_result> validate_checksums(shared_sstable sst, reader_
input_stream<char> data_stream;
if (sst->get_compression()) {
data_stream = sst->data_stream(0, sst->ondisk_data_size(), permit,
data_stream = co_await sst->data_stream(0, sst->ondisk_data_size(), permit,
nullptr, nullptr, sstable::raw_stream::yes);
} else {
data_stream = sst->data_stream(0, sst->data_size(), permit,
data_stream = co_await sst->data_stream(0, sst->data_size(), permit,
nullptr, nullptr, sstable::raw_stream::no,
integrity_check::yes, [&ret](sstring msg) {
sstlog.error("{}", msg);
@@ -3654,6 +3655,9 @@ future<data_sink> file_io_extension::wrap_sink(const sstable& sst, component_typ
co_return co_await make_file_data_sink(std::move(f), file_output_stream_options{});
}
future<data_source> file_io_extension::wrap_source(const sstable& sst, component_type c, sstables::data_source_creator_fn, uint64_t, uint64_t) {
SCYLLA_ASSERT(0 && "You are not supposed to get here, file_io_extension::wrap_source() is not implemented");
}
} // namespace sstables
namespace seastar {

View File

@@ -752,7 +752,7 @@ public:
// integrity-checked stream with no compression. The parameter is ignored
// if integrity checking is disabled or the SSTable is compressed.
using raw_stream = bool_class<class raw_stream_tag>;
input_stream<char> data_stream(uint64_t pos, size_t len,
future<input_stream<char>> data_stream(uint64_t pos, size_t len,
reader_permit permit, tracing::trace_state_ptr trace_state, lw_shared_ptr<file_input_stream_history> history,
raw_stream raw = raw_stream::no, integrity_check integrity = integrity_check::no,
integrity_error_handler error_handler = throwing_integrity_error_handler);
@@ -1051,13 +1051,13 @@ public:
friend class promoted_index;
friend class sstables_manager;
template <typename DataConsumeRowsContext>
friend std::unique_ptr<DataConsumeRowsContext>
friend future<std::unique_ptr<DataConsumeRowsContext>>
data_consume_rows(const schema&, shared_sstable, typename DataConsumeRowsContext::consumer&, disk_read_range, uint64_t, integrity_check);
template <typename DataConsumeRowsContext>
friend std::unique_ptr<DataConsumeRowsContext>
friend future<std::unique_ptr<DataConsumeRowsContext>>
data_consume_single_partition(const schema&, shared_sstable, typename DataConsumeRowsContext::consumer&, disk_read_range, integrity_check);
template <typename DataConsumeRowsContext>
friend std::unique_ptr<DataConsumeRowsContext>
friend future<std::unique_ptr<DataConsumeRowsContext>>
data_consume_rows(const schema&, shared_sstable, typename DataConsumeRowsContext::consumer&, integrity_check);
friend void lw_shared_ptr_deleter<sstables::sstable>::dispose(sstable* s);
gc_clock::time_point get_gc_before_for_drop_estimation(const gc_clock::time_point& compaction_time, const tombstone_gc_state& gc_state, const schema_ptr& s) const;
@@ -1126,6 +1126,8 @@ public:
// output device. Default impl will call wrap_file and generate a wrapper object.
virtual future<data_sink> wrap_sink(const sstable&, component_type, data_sink);
virtual future<data_source>
wrap_source(const sstable&, component_type, sstables::data_source_creator_fn, uint64_t offset, uint64_t len);
// optionally return a map of attributes for a given sstable,
// suitable for "describe".
// This would preferably be interesting info on what/why the extension did

View File

@@ -89,6 +89,7 @@ public:
virtual future<> wipe(const sstable& sst, sync_dir) noexcept override;
virtual future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity) override;
virtual future<data_sink> make_data_or_index_sink(sstable& sst, component_type type) override;
future<data_source> make_data_or_index_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const override;
virtual future<data_sink> make_component_sink(sstable& sst, component_type type, open_flags oflags, file_output_stream_options options) override;
virtual future<> destroy(const sstable& sst) override { return make_ready_future<>(); }
virtual future<atomic_delete_context> atomic_delete_prepare(const std::vector<shared_sstable>&) const override;
@@ -110,6 +111,11 @@ future<data_sink> filesystem_storage::make_data_or_index_sink(sstable& sst, comp
return make_file_data_sink(type == component_type::Data ? std::move(sst._data_file) : std::move(sst._index_file), options);
}
future<data_source> filesystem_storage::make_data_or_index_source(sstable&, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const {
SCYLLA_ASSERT(type == component_type::Data || type == component_type::Index);
co_return make_file_data_source(std::move(f), offset, len, std::move(opt));
}
future<data_sink> filesystem_storage::make_component_sink(sstable& sst, component_type type, open_flags oflags, file_output_stream_options options) {
return sst.new_sstable_component_file(sst._write_error_handler, type, oflags).then([options = std::move(options)] (file f) mutable {
return make_file_data_sink(std::move(f), std::move(options));
@@ -570,6 +576,7 @@ public:
virtual future<> wipe(const sstable& sst, sync_dir) noexcept override;
virtual future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity) override;
virtual future<data_sink> make_data_or_index_sink(sstable& sst, component_type type) override;
future<data_source> make_data_or_index_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const override;
virtual future<data_sink> make_component_sink(sstable& sst, component_type type, open_flags oflags, file_output_stream_options options) override;
virtual future<> destroy(const sstable& sst) override {
return make_ready_future<>();
@@ -639,12 +646,45 @@ static future<data_sink> maybe_wrap_sink(const sstable& sst, component_type type
co_return sink;
}
static future<data_source> maybe_wrap_source(const sstable& sst, component_type type, data_source_creator_fn source_creator, uint64_t offset, uint64_t len) {
if (type != component_type::TOC && type != component_type::TemporaryTOC) {
for (auto* ext : sst.manager().config().extensions().sstable_file_io_extensions()) {
std::exception_ptr p;
try {
co_return co_await ext->wrap_source(sst, type, std::move(source_creator), offset, len);
} catch (...) {
p = std::current_exception();
}
if (p) {
std::rethrow_exception(std::move(p));
}
}
}
co_return source_creator(offset, len);
}
future<data_sink> s3_storage::make_data_or_index_sink(sstable& sst, component_type type) {
SCYLLA_ASSERT(type == component_type::Data || type == component_type::Index);
// FIXME: if we have file size upper bound upfront, it's better to use make_upload_sink() instead
return maybe_wrap_sink(sst, type, _client->make_upload_jumbo_sink(make_s3_object_name(sst, type), std::nullopt, _as));
}
future<data_source>
s3_storage::make_data_or_index_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options options) const {
if (offset == 0) {
co_return co_await maybe_wrap_source(
sst,
type,
[this, object_name = make_s3_object_name(sst, type)](uint64_t, uint64_t) {
return _client->make_chunked_download_source(object_name, s3::full_range, _as);
},
offset,
len);
}
co_return make_file_data_source(
co_await maybe_wrap_file(sst, type, open_flags::ro, _client->make_readable_file(make_s3_object_name(sst, type), _as)), offset, len, std::move(options));
}
future<data_sink> s3_storage::make_component_sink(sstable& sst, component_type type, open_flags oflags, file_output_stream_options options) {
return maybe_wrap_sink(sst, type, _client->make_upload_sink(make_s3_object_name(sst, type), _as));
}

View File

@@ -106,6 +106,7 @@ public:
virtual future<> wipe(const sstable& sst, sync_dir) noexcept = 0;
virtual future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity) = 0;
virtual future<data_sink> make_data_or_index_sink(sstable& sst, component_type type) = 0;
virtual future<data_source> make_data_or_index_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const = 0;
virtual future<data_sink> make_component_sink(sstable& sst, component_type type, open_flags oflags, file_output_stream_options options) = 0;
virtual future<> destroy(const sstable& sst) = 0;
virtual future<atomic_delete_context> atomic_delete_prepare(const std::vector<shared_sstable>&) const = 0;
@@ -124,4 +125,5 @@ future<> init_keyspace_storage(const sstables_manager&, const data_dictionary::s
std::vector<std::filesystem::path> get_local_directories(const db::config& db, const data_dictionary::storage_options::local& so);
using data_source_creator_fn = std::function<data_source(uint64_t, uint64_t)>;
} // namespace sstables

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/coroutine.hh>
#include <seastar/core/with_scheduling_group.hh>
#include "consumer.hh"
#include "replica/database.hh"
@@ -35,7 +36,9 @@ mutation_reader_consumer make_streaming_consumer(sstring origin,
auto cf = db.local().find_column_family(reader.schema()).shared_from_this();
auto guard = service::topology_guard(frozen_guard);
auto use_view_update_path = co_await db::view::check_needs_view_update_path(vb.local(), db.local().get_token_metadata_ptr(), *cf, reason);
bool use_view_update_path = co_await with_scheduling_group(db.local().get_gossip_scheduling_group(), [&] {
return db::view::check_needs_view_update_path(vb.local(), db.local().get_token_metadata_ptr(), *cf, reason);
});
//FIXME: for better estimations this should be transmitted from remote
auto metadata = mutation_source_metadata{};
auto& cs = cf->get_compaction_strategy();

View File

@@ -121,10 +121,6 @@ future<std::optional<double>> task_manager::task::impl::expected_total_workload(
return make_ready_future<std::optional<double>>(std::nullopt);
}
std::optional<double> task_manager::task::impl::expected_children_number() const {
return std::nullopt;
}
task_manager::task::progress task_manager::task::impl::get_binary_progress() const {
return tasks::task_manager::task::progress{
.completed = is_complete(),
@@ -133,20 +129,10 @@ task_manager::task::progress task_manager::task::impl::get_binary_progress() con
}
future<task_manager::task::progress> task_manager::task::impl::get_progress() const {
auto children_num = _children.size();
if (children_num == 0) {
co_return get_binary_progress();
std::optional<double> expected_workload = co_await expected_total_workload();
if (!expected_workload && _children.size() == 0) {
co_return task_manager::task::progress{};
}
std::optional<double> expected_workload = std::nullopt;
auto expected_children_num = expected_children_number();
// When get_progress is called, the task can have some of its children unregistered yet.
// Then if total workload is not known, progress obtained from children may be deceiving.
// In such a situation it's safer to return binary progress value.
if (expected_children_num.value_or(0) != children_num && !(expected_workload = co_await expected_total_workload())) {
co_return get_binary_progress();
}
auto progress = co_await _children.get_progress(_status.progress_units);
progress.total = expected_workload.value_or(progress.total);
co_return progress;

View File

@@ -226,7 +226,6 @@ public:
future<> finish_failed(std::exception_ptr ex, std::string error) noexcept;
future<> finish_failed(std::exception_ptr ex) noexcept;
virtual future<std::optional<double>> expected_total_workload() const;
virtual std::optional<double> expected_children_number() const;
task_manager::task::progress get_binary_progress() const;
friend task;

View File

@@ -637,3 +637,24 @@ def test_query_large_page_small_rows(test_table_sn):
ConsistentRead=True)['Items']
n = len(got_items)
assert n == N
# This test is a less extreme and faster version of the previous test
# (test_query_large_page_small_rows): We test a query returning a large but
# not huge number (700) of tiny rows. If Alternator has a special code path
# for handling a response with that many rows (namely, to work around problems
# with RapidJSON's contiguous allocation of array objects - see #23535),
# then this test exercises this case.
def test_query_many_small_rows(test_table_sn):
p = random_string()
N = 700
with test_table_sn.batch_writer() as batch:
for i in range(N):
batch.put_item({'p': p, 'c': i})
got_items = test_table_sn.query(KeyConditions={
'p': {'AttributeValueList': [p], 'ComparisonOperator': 'EQ'}},
ConsistentRead=True)['Items']
i = 0
for item in got_items:
assert item == {'p': p, 'c': i}
i += 1
assert N == i

View File

@@ -230,7 +230,7 @@ SEASTAR_THREAD_TEST_CASE(test_permissions_of_cdc_log_table) {
log_table, stream_id, time, batch_seq_no
)).get();
e.execute_cql("SELECT * FROM " + log_table).get();
e.execute_cql("ALTER TABLE " + log_table + " ALTER \"" + ttl + "\" TYPE blob").get();
e.execute_cql("ALTER TABLE " + log_table + " WITH comment = 'some not very interesting comment'").get();
// Disallow DROP
assert_unauthorized("DROP TABLE " + log_table);

View File

@@ -483,3 +483,130 @@ SEASTAR_TEST_CASE(test_encrypted_sink_data_large) {
return test_random_data_sink({ 4096, 4096, 4096, 4096, 8192, 1232, 32, 4096, 134 });
}
static future<> test_random_data_source(std::vector<size_t> sizes) {
testlog.info("test_random_data_source with sizes: {} ({})", sizes, std::accumulate(sizes.begin(), sizes.end(), size_t(0), std::plus{}));
auto name = "test_rand_source";
std::vector<temporary_buffer<char>> bufs, srcs;
auto [dst, k] = make_filename(name);
using namespace std::chrono_literals;
std::exception_ptr ex = nullptr;
data_sink sink(make_encrypted_sink(create_memory_sink(bufs), k));
try {
for (size_t s : sizes) {
auto buf = generate_random<char>(s);
co_await sink.put(buf.clone()); // deep copy. encrypted sink uses "owned" data
srcs.emplace_back(std::move(buf));
}
} catch (...) {
ex = std::current_exception();
}
co_await sink.close();
if (ex) {
std::rethrow_exception(ex);
}
{
auto os = co_await make_file_output_stream(co_await open_file_dma(dst, open_flags::truncate|open_flags::wo | open_flags::create));
for (auto& buf : bufs) {
co_await os.write(buf.get(), buf.size());
}
co_await os.flush();
co_await os.close();
}
auto f = co_await open_file_dma(dst, open_flags::ro);
testlog.info("file source {}", (co_await f.stat()).st_size);
auto source = make_file_data_source(std::move(f), file_input_stream_options{});
class random_chunk_source
: public data_source_impl
{
data_source _source;
temporary_buffer<char> _buf;
public:
random_chunk_source(data_source s)
: _source(std::move(s))
{}
future<temporary_buffer<char>> get() override {
if (!_buf.empty()) {
co_return std::exchange(_buf, temporary_buffer<char>{});
}
_buf = co_await _source.get();
if (_buf.empty()) {
co_return temporary_buffer<char>{};
}
auto n = tests::random::get_int(size_t(1), _buf.size());
auto res = _buf.share(0, n);
_buf.trim_front(n);
co_return res;
}
future<temporary_buffer<char>> skip(uint64_t n) override {
if (!_buf.empty()) {
auto m = std::min(n, _buf.size());
_buf.trim_front(m);
n -= m;
}
if (n) {
co_await _source.skip(n);
}
co_return temporary_buffer<char>{};
}
};
try {
auto encrypted_source = data_source(make_encrypted_source(data_source(std::make_unique<random_chunk_source>(std::move(source))), k));
temporary_buffer<char> unified_buff(std::accumulate(srcs.begin(), srcs.end(), 0, [](size_t acc, const auto& buf) { return acc + buf.size(); }));
size_t pos = 0;
for (const auto& src : srcs) {
memcpy(unified_buff.get_write() + pos, src.get(), src.size());
pos += src.size();
}
pos = 0;
while (auto read_buff = co_await encrypted_source.get()) {
auto rem = unified_buff.size() - pos;
BOOST_REQUIRE_LE(read_buff.size(), rem);
size_t size_to_compare = std::min(rem, read_buff.size());
auto v1 = std::string_view(read_buff.get(), size_to_compare);
auto v2 = std::string_view(unified_buff.get() + pos, size_to_compare);
BOOST_REQUIRE_EQUAL(v1, v2);
auto skip = unified_buff.size() - pos > 4113 ? 4097 : (unified_buff.size() - pos)/2;
co_await encrypted_source.skip(skip);
pos += size_to_compare + skip;
}
co_await encrypted_source.close();
} catch (...) {
ex = std::current_exception();
}
if (ex) {
std::rethrow_exception(ex);
}
}
SEASTAR_TEST_CASE(test_encrypted_data_source_simple) {
std::vector<size_t> sizes({3200, 13086, 12065, 200, 11959, 12159, 12852});
co_await test_random_data_source(sizes);
}
SEASTAR_TEST_CASE(test_encrypted_data_source_fuzzy) {
std::mt19937_64 rand_gen(std::random_device{}());
for (auto i = 0; i < 1000; ++i) {
std::uniform_int_distribution<uint16_t> rand_dist(1, 15);
std::vector<size_t> sizes(rand_dist(rand_gen));
for (auto& s : sizes) {
std::uniform_int_distribution<uint16_t> buff_sizes(1, 147*100);
s = buff_sizes(rand_gen);
}
co_await test_random_data_source(sizes);
}
co_return;
}

View File

@@ -1077,41 +1077,49 @@ static future<> test_broken_encrypted_commitlog(const test_provider_args& args,
*/
static future<> network_error_test_helper(const tmpdir& tmp, const std::string& host, std::function<std::tuple<scopts_map, std::string>(const fake_proxy&)> make_opts) {
fake_proxy proxy(host);
std::exception_ptr p;
try {
auto [scopts, yaml] = make_opts(proxy);
auto [scopts, yaml] = make_opts(proxy);
test_provider_args args{
.tmp = tmp,
.extra_yaml = yaml,
.n_tables = 10,
.before_create_table = [&](auto& env) {
// turn off proxy. all key resolution after this should fail
proxy.enable(false);
// wait for key cache expiry.
seastar::sleep(10ms).get();
// ensure commitlog will create a new segment on write -> eventual write failure
env.db().invoke_on_all([](replica::database& db) {
return db.commitlog()->force_new_active_segment();
}).get();
},
.on_insert_exception = [&](auto&&) {
// once we get the exception we have to enable key resolution again,
// otherwise we can't shut down cql test env.
proxy.enable(true);
},
.timeout = timeout_config{
// set really low write timeouts so we get a failure (timeout)
// when we fail to write to commitlog
100ms, 100ms, 100ms, 100ms, 100ms, 100ms, 100ms
},
};
test_provider_args args{
.tmp = tmp,
.extra_yaml = yaml,
.n_tables = 10,
.before_create_table = [&](auto& env) {
// turn off proxy. all key resolution after this should fail
proxy.enable(false);
// wait for key cache expiry.
seastar::sleep(10ms).get();
// ensure commitlog will create a new segment on write -> eventual write failure
env.db().invoke_on_all([](replica::database& db) {
return db.commitlog()->force_new_active_segment();
}).get();
},
.on_insert_exception = [&](auto&&) {
// once we get the exception we have to enable key resolution again,
// otherwise we can't shut down cql test env.
proxy.enable(true);
},
.timeout = timeout_config{
// set really low write timeouts so we get a failure (timeout)
// when we fail to write to commitlog
100ms, 100ms, 100ms, 100ms, 100ms, 100ms, 100ms
},
};
BOOST_REQUIRE_THROW(
co_await test_broken_encrypted_commitlog(args, scopts);
, exceptions::mutation_write_timeout_exception
);
BOOST_REQUIRE_THROW(
co_await test_broken_encrypted_commitlog(args, scopts);
, exceptions::mutation_write_timeout_exception
);
} catch (...) {
p = std::current_exception();
}
co_await proxy.stop();
if (p) {
std::rethrow_exception(p);
}
}
SEASTAR_TEST_CASE(test_kms_network_error, *check_run_test_decorator("ENABLE_KMS_TEST")) {

View File

@@ -14,6 +14,7 @@
#include <functional>
#include <seastar/core/on_internal_error.hh>
#include <seastar/util/defer.hh>
#include <seastar/util/closeable.hh>
#include "locator/types.hh"
#include "test/lib/scylla_test_case.hh"
@@ -213,6 +214,7 @@ SEASTAR_THREAD_TEST_CASE(test_load_sketch) {
.local_dc_rack = locator::endpoint_dc_rack::default_location
}
});
auto stop_stm = deferred_stop(stm);
stm.mutate_token_metadata([&] (token_metadata& tm) {
tm.update_topology(host1, locator::endpoint_dc_rack::default_location, node::state::normal, node1_shard_count);

View File

@@ -280,6 +280,7 @@ void simple_test() {
tm_cfg.topo_cfg.this_endpoint = my_address;
tm_cfg.topo_cfg.local_dc_rack = { snitch.local()->get_datacenter(), snitch.local()->get_rack() };
locator::shared_token_metadata stm([] () noexcept { return db::schema_tables::hold_merge_lock(); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
std::vector<ring_point> ring_points = {
{ 1.0, inet_address("192.100.10.1") },
@@ -363,6 +364,7 @@ void heavy_origin_test() {
locator::shared_token_metadata stm([] () noexcept { return db::schema_tables::hold_merge_lock(); },
locator::token_metadata::config{locator::topology::config{ .local_dc_rack = locator::endpoint_dc_rack::default_location }});
auto stop_stm = deferred_stop(stm);
std::vector<int> dc_racks = {2, 4, 8};
std::vector<int> dc_endpoints = {128, 256, 512};
@@ -476,6 +478,7 @@ SEASTAR_THREAD_TEST_CASE(NetworkTopologyStrategy_tablets_test) {
// Initialize the token_metadata
locator::shared_token_metadata stm([] () noexcept { return db::schema_tables::hold_merge_lock(); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
stm.mutate_token_metadata([&] (token_metadata& tm) -> future<> {
auto& topo = tm.get_topology();
for (const auto& [ring_point, endpoint, id] : ring_points) {
@@ -567,6 +570,7 @@ static void test_random_balancing(sharded<snitch_ptr>& snitch, gms::inet_address
// Initialize the token_metadata
locator::shared_token_metadata stm([] () noexcept { return db::schema_tables::hold_merge_lock(); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
stm.mutate_token_metadata([&] (token_metadata& tm) -> future<> {
auto& topo = tm.get_topology();
for (const auto& [ring_point, endpoint, id] : ring_points) {
@@ -897,6 +901,7 @@ SEASTAR_THREAD_TEST_CASE(testCalculateEndpoints) {
for (size_t run = 0; run < RUNS; ++run) {
semaphore sem(1);
shared_token_metadata stm([&sem] () noexcept { return get_units(sem, 1); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
std::unordered_set<dht::token> random_tokens;
while (random_tokens.size() < nodes.size() * VNODES) {
@@ -1043,6 +1048,7 @@ SEASTAR_THREAD_TEST_CASE(test_topology_compare_endpoints) {
semaphore sem(1);
shared_token_metadata stm([&sem] () noexcept { return get_units(sem, 1); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
stm.mutate_token_metadata([&] (token_metadata& tm) {
auto& topo = tm.get_topology();
generate_topology(topo, datacenters, nodes);
@@ -1087,6 +1093,7 @@ SEASTAR_THREAD_TEST_CASE(test_topology_sort_by_proximity) {
tm_cfg.topo_cfg.local_dc_rack = locator::endpoint_dc_rack::default_location;
semaphore sem(1);
shared_token_metadata stm([&sem] () noexcept { return get_units(sem, 1); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
stm.mutate_token_metadata([&] (token_metadata& tm) -> future<> {
generate_topology(tm.get_topology(), datacenters, nodes);
return make_ready_future();
@@ -1122,6 +1129,7 @@ SEASTAR_THREAD_TEST_CASE(test_topology_tracks_local_node) {
.local_dc_rack = ip1_dc_rack,
}
});
auto stop_stm = deferred_stop(stm);
// get_location() should work before any node is added
@@ -1249,6 +1257,7 @@ SEASTAR_THREAD_TEST_CASE(tablets_simple_rack_aware_view_pairing_test) {
// Initialize the token_metadata
locator::shared_token_metadata stm([] () noexcept { return db::schema_tables::hold_merge_lock(); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
stm.mutate_token_metadata([&] (token_metadata& tm) -> future<> {
auto& topo = tm.get_topology();
for (const auto& [ring_point, endpoint, id] : ring_points) {
@@ -1401,6 +1410,7 @@ void test_complex_rack_aware_view_pairing_test(bool more_or_less) {
// Initialize the token_metadata
locator::shared_token_metadata stm([] () noexcept { return db::schema_tables::hold_merge_lock(); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
stm.mutate_token_metadata([&] (token_metadata& tm) -> future<> {
auto& topo = tm.get_topology();
for (const auto& [ring_point, endpoint, id] : ring_points) {

View File

@@ -15,6 +15,7 @@
#include <seastar/core/reactor.hh>
#include <seastar/core/file.hh>
#include <seastar/core/fstream.hh>
#include <seastar/core/sleep.hh>
#include <seastar/http/exception.hh>
#include <seastar/util/closeable.hh>
#include <seastar/util/short_streams.hh>
@@ -25,6 +26,8 @@
#include "test/lib/test_utils.hh"
#include "test/lib/tmpdir.hh"
#include "utils/assert.hh"
#include "utils/error_injection.hh"
#include "utils/s3/aws_error.hh"
#include "utils/s3/client.hh"
#include "utils/s3/creds.hh"
#include "utils/s3/utils/manip_s3.hh"
@@ -675,6 +678,91 @@ SEASTAR_THREAD_TEST_CASE(test_chunked_download_data_source_proxy) {
test_download_data_source(make_proxy_client, true, 3 * 1024);
}
void test_chunked_download_data_source(const client_maker_function& client_maker, size_t object_size) {
const sstring base_name(fmt::format("test_object-{}", ::getpid()));
tmpdir tmp;
const auto file_path = tmp.path() / base_name;
file f = open_file_dma(file_path.native(), open_flags::create | open_flags::wo).get();
auto output = make_file_output_stream(std::move(f)).get();
for (size_t bytes_written = 0; bytes_written < object_size;) {
auto rnd = tests::random::get_bytes(std::min(object_size - bytes_written, 1024ul));
output.write(reinterpret_cast<char*>(rnd.data()), rnd.size()).get();
bytes_written += rnd.size();
}
output.close().get();
testlog.info("Make client\n");
semaphore mem(16 << 20);
auto cln = client_maker(mem);
auto close_client = deferred_close(*cln);
const auto object_name = fmt::format("/{}/{}", tests::getenv_safe("S3_BUCKET_FOR_TEST"), base_name);
auto delete_object = deferred_delete_object(cln, object_name);
cln->upload_file(file_path, object_name).get();
testlog.info("Download object");
auto in = input_stream<char>(cln->make_chunked_download_source(object_name, s3::full_range));
auto close = seastar::deferred_close(in);
file rf = open_file_dma(file_path.native(), open_flags::ro).get();
auto file_input = make_file_input_stream(std::move(rf));
auto close_file = seastar::deferred_close(file_input);
size_t total_size = 0;
size_t trigger_counter = 0;
while (true) {
// We want the background fiber to fill the buffer queue and start waiting to drain it
seastar::sleep(100us).get();
auto buf = in.read().get();
total_size += buf.size();
if (buf.empty()) {
break;
}
++trigger_counter;
if (trigger_counter % 10 == 0) {
utils::get_local_injector().enable("break_s3_inflight_req", true);
}
auto file_buf = file_input.read_exactly(buf.size()).get();
BOOST_REQUIRE_EQUAL(memcmp(buf.begin(), file_buf.begin(), buf.size()), 0);
}
BOOST_REQUIRE_EQUAL(total_size, object_size);
#ifdef SCYLLA_ENABLE_ERROR_INJECTION
utils::get_local_injector().enable("kill_s3_inflight_req");
auto in_throw = input_stream<char>(cln->make_chunked_download_source(object_name, s3::full_range));
auto close_throw = seastar::deferred_close(in_throw);
auto reader = [&in_throw] {
while (true) {
auto buf = in_throw.read().get();
if (buf.empty()) {
break;
}
}
};
BOOST_REQUIRE_EXCEPTION(
reader(), storage_io_error, [](const storage_io_error& e) {
return e.what() == "S3 request failed. Code: 16. Reason: "sv;
});
#else
testlog.info("Skipping error injection test, as it requires SCYLLA_ENABLE_ERROR_INJECTION to be enabled");
#endif
cln->delete_object(object_name).get();
cln->close().get();
}
SEASTAR_THREAD_TEST_CASE(test_chunked_download_data_source_with_delays_minio) {
test_chunked_download_data_source(make_minio_client, 20_MiB);
}
SEASTAR_THREAD_TEST_CASE(test_chunked_download_data_source_with_delays_proxy) {
test_chunked_download_data_source(make_proxy_client, 20_MiB);
}
void test_object_copy(const client_maker_function& client_maker, size_t chunk_size, size_t chunks) {
const sstring name(fmt::format("/{}/testobject-{}", tests::getenv_safe("S3_BUCKET_FOR_TEST"), ::getpid()));
const sstring name_copy(fmt::format("/{}/testobject-{}-copy", tests::getenv_safe("S3_BUCKET_FOR_TEST"), ::getpid()));

View File

@@ -14,7 +14,9 @@
#include <fmt/std.h>
#include <seastar/core/future.hh>
#include <seastar/util/closeable.hh>
#include "seastarx.hh"
#include "service/qos/qos_common.hh"
#include "test/lib/scylla_test_case.hh"
#include "test/lib/test_utils.hh"
@@ -107,6 +109,7 @@ SEASTAR_THREAD_TEST_CASE(subscriber_simple) {
sl_options.shares.emplace<int32_t>(1000);
scheduling_group default_scheduling_group = create_scheduling_group("sl_default_sg", 1.0).get();
locator::shared_token_metadata tm({}, {locator::topology::config{ .local_dc_rack = locator::endpoint_dc_rack::default_location }});
auto stop_tm = deferred_stop(tm);
sharded<abort_source> as;
as.start().get();
auto stop_as = defer([&as] { as.stop().get(); });
@@ -180,6 +183,7 @@ SEASTAR_THREAD_TEST_CASE(too_many_service_levels) {
sl_options.workload = service_level_options::workload_type::interactive;
scheduling_group default_scheduling_group = create_scheduling_group("sl_default_sg1", 1.0).get();
locator::shared_token_metadata tm({}, {locator::topology::config{ .local_dc_rack = locator::endpoint_dc_rack::default_location }});
auto stop_tm = deferred_stop(tm);
sharded<abort_source> as;
as.start().get();
auto stop_as = defer([&as] { as.stop().get(); });
@@ -256,6 +260,7 @@ SEASTAR_THREAD_TEST_CASE(add_remove_bad_sequence) {
sl_options.shares.emplace<int32_t>(1000);
scheduling_group default_scheduling_group = create_scheduling_group("sl_default_sg3", 1.0).get();
locator::shared_token_metadata tm({}, {locator::topology::config{ .local_dc_rack = locator::endpoint_dc_rack::default_location }});
auto stop_tm = deferred_stop(tm);
sharded<abort_source> as;
as.start().get();
auto stop_as = defer([&as] { as.stop().get(); });
@@ -282,6 +287,7 @@ SEASTAR_THREAD_TEST_CASE(verify_unset_shares_in_cache_when_service_level_created
sl_options.shares.emplace<int32_t>(1000);
scheduling_group default_scheduling_group = create_scheduling_group("sl_default_sg", 1.0).get();
locator::shared_token_metadata tm({}, {locator::topology::config{ .local_dc_rack = locator::endpoint_dc_rack::default_location }});
auto stop_tm = deferred_stop(tm);
sharded<abort_source> as;
as.start().get();

View File

@@ -672,7 +672,10 @@ SEASTAR_TEST_CASE(test_skipping_in_compressed_stream) {
auto make_is = [&] {
f = open_file_dma(file_path, open_flags::ro).get();
return make_compressed_file_m_format_input_stream(f, &c, 0, uncompressed_size, opts, semaphore.make_permit(), std::nullopt);
auto stream_creator = [f](uint64_t pos, uint64_t len, file_input_stream_options options)->future<input_stream<char>> {
co_return input_stream<char>(make_file_data_source(std::move(f), pos, len, std::move(options)));
};
return make_compressed_file_m_format_input_stream(stream_creator, &c, 0, uncompressed_size, opts, semaphore.make_permit(), std::nullopt);
};
auto expect = [] (input_stream<char>& in, const temporary_buffer<char>& buf) {

View File

@@ -52,9 +52,11 @@ SEASTAR_TEST_CASE(test_get_restricted_ranges) {
}
};
auto& stm = e.shared_token_metadata().local();
{
// Ring with minimum token
auto tmptr = locator::make_token_metadata_ptr(locator::token_metadata::config{e.shared_token_metadata().local().get()->get_topology().get_config()});
auto tmptr = stm.make_token_metadata_ptr();
const auto host_id = locator::host_id{utils::UUID(0, 1)};
tmptr->update_topology(host_id, locator::endpoint_dc_rack{"dc1", "rack1"}, locator::node::state::normal);
tmptr->update_normal_tokens(std::unordered_set<dht::token>({dht::minimum_token()}), host_id).get();
@@ -69,7 +71,7 @@ SEASTAR_TEST_CASE(test_get_restricted_ranges) {
}
{
auto tmptr = locator::make_token_metadata_ptr(locator::token_metadata::config{e.shared_token_metadata().local().get()->get_topology().get_config()});
auto tmptr = stm.make_token_metadata_ptr();
const auto id1 = locator::host_id{utils::UUID(0, 1)};
const auto id2 = locator::host_id{utils::UUID(0, 2)};
tmptr->update_topology(id1, locator::endpoint_dc_rack{"dc1", "rack1"}, locator::node::state::normal);

View File

@@ -799,6 +799,7 @@ SEASTAR_TEST_CASE(test_get_shard) {
.local_dc_rack = locator::endpoint_dc_rack::default_location
}
});
auto stop_stm = deferred_stop(stm);
tablet_id tid(0);
tablet_id tid1(0);
@@ -1048,7 +1049,7 @@ SEASTAR_TEST_CASE(test_sharder) {
auto table1 = table_id(utils::UUID_gen::get_time_UUID());
token_metadata tokm(token_metadata::config{ .topo_cfg{ .this_host_id = h1, .local_dc_rack = locator::endpoint_dc_rack::default_location } });
token_metadata tokm(e.get_shared_token_metadata().local(), token_metadata::config{ .topo_cfg{ .this_host_id = h1, .local_dc_rack = locator::endpoint_dc_rack::default_location } });
tokm.get_topology().add_or_update_endpoint(h1);
std::vector<tablet_id> tablet_ids;
@@ -1263,7 +1264,14 @@ SEASTAR_TEST_CASE(test_intranode_sharding) {
auto table1 = table_id(utils::UUID_gen::get_time_UUID());
token_metadata tokm(token_metadata::config{ .topo_cfg{ .this_host_id = h1, .local_dc_rack = locator::endpoint_dc_rack::default_location } });
locator::token_metadata::config tm_cfg;
tm_cfg.topo_cfg.this_host_id = h1;
tm_cfg.topo_cfg.local_dc_rack = endpoint_dc_rack::default_location;
semaphore sem(1);
shared_token_metadata stm([&] () noexcept { return get_units(sem, 1); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
auto tmptr = stm.make_token_metadata_ptr();
auto& tokm = *tmptr;
tokm.get_topology().add_or_update_endpoint(h1);
auto leaving_replica = tablet_replica{h1, 5};
@@ -3606,6 +3614,7 @@ static void execute_tablet_for_new_rf_test(calculate_tablet_replicas_for_new_rf_
tm_cfg.topo_cfg.local_dc_rack = { snitch.local()->get_datacenter(), snitch.local()->get_rack() };
tm_cfg.topo_cfg.this_host_id = test_config.ring_points[0].id;
locator::shared_token_metadata stm([] () noexcept { return db::schema_tables::hold_merge_lock(); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
// Initialize the token_metadata
stm.mutate_token_metadata([&] (token_metadata& tm) -> future<> {

View File

@@ -48,6 +48,6 @@ custom_args:
bloom_filter_test:
- '-c1'
s3_test:
- '-c2 -m2G --logger-log-level s3=trace --logger-log-level http=trace'
- '-c2 -m2G --logger-log-level s3=trace --logger-log-level http=trace --logger-log-level default_retry_strategy=trace'
run_in_debug:
- logalloc_standard_allocator_segment_pool_backend_test

View File

@@ -8,6 +8,7 @@
#include <boost/test/unit_test.hpp>
#include <fmt/ranges.h>
#include <seastar/util/closeable.hh>
#include "test/lib/scylla_test_case.hh"
#include "test/lib/test_utils.hh"
#include "locator/token_metadata.hh"
@@ -31,13 +32,11 @@ namespace {
};
}
mutable_token_metadata_ptr create_token_metadata(host_id this_host_id) {
return make_lw_shared<token_metadata>(token_metadata::config {
topology::config {
.this_host_id = this_host_id,
.local_dc_rack = get_dc_rack(this_host_id)
}
});
token_metadata::config create_token_metadata_config(host_id this_host_id) {
return token_metadata::config{topology::config{
.this_host_id = this_host_id,
.local_dc_rack = get_dc_rack(this_host_id)
}};
}
template <typename Strategy>
@@ -55,7 +54,11 @@ SEASTAR_THREAD_TEST_CASE(test_pending_and_read_endpoints_for_everywhere_strategy
const auto t1 = dht::token::from_int64(10);
const auto t2 = dht::token::from_int64(20);
auto token_metadata = create_token_metadata(e1_id);
semaphore sem(1);
auto tm_cfg = create_token_metadata_config(e1_id);
shared_token_metadata stm([&] () noexcept { return get_units(sem, 1); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
auto token_metadata = stm.make_token_metadata_ptr();
token_metadata->update_topology(e1_id, get_dc_rack(e1_id), node::state::normal);
token_metadata->update_topology(e2_id, get_dc_rack(e2_id), node::state::normal);
token_metadata->update_normal_tokens({t1}, e1_id).get();
@@ -75,7 +78,11 @@ SEASTAR_THREAD_TEST_CASE(test_pending_endpoints_for_bootstrap_second_node) {
const auto e1_id = gen_id(1);
const auto e2_id = gen_id(2);
auto token_metadata = create_token_metadata(e1_id);
semaphore sem(1);
auto tm_cfg = create_token_metadata_config(e1_id);
shared_token_metadata stm([&] () noexcept { return get_units(sem, 1); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
auto token_metadata = stm.make_token_metadata_ptr();
token_metadata->update_topology(e1_id, get_dc_rack(e1_id), node::state::normal);
token_metadata->update_topology(e2_id, get_dc_rack(e2_id), node::state::normal);
token_metadata->update_normal_tokens({t1}, e1_id).get();
@@ -103,7 +110,11 @@ SEASTAR_THREAD_TEST_CASE(test_pending_endpoints_for_bootstrap_with_replicas) {
const auto e2_id = gen_id(2);
const auto e3_id = gen_id(3);
auto token_metadata = create_token_metadata(e1_id);
semaphore sem(1);
auto tm_cfg = create_token_metadata_config(e1_id);
shared_token_metadata stm([&] () noexcept { return get_units(sem, 1); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
auto token_metadata = stm.make_token_metadata_ptr();
token_metadata->update_topology(e1_id, get_dc_rack(e1_id), node::state::normal);
token_metadata->update_topology(e2_id, get_dc_rack(e2_id), node::state::normal);
token_metadata->update_topology(e3_id, get_dc_rack(e3_id), node::state::normal);
@@ -133,7 +144,11 @@ SEASTAR_THREAD_TEST_CASE(test_pending_endpoints_for_leave_with_replicas) {
const auto e2_id = gen_id(2);
const auto e3_id = gen_id(3);
auto token_metadata = create_token_metadata(e1_id);
semaphore sem(1);
auto tm_cfg = create_token_metadata_config(e1_id);
shared_token_metadata stm([&] () noexcept { return get_units(sem, 1); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
auto token_metadata = stm.make_token_metadata_ptr();
token_metadata->update_topology(e1_id, get_dc_rack(e1_id), node::state::normal);
token_metadata->update_topology(e2_id, get_dc_rack(e2_id), node::state::normal);
token_metadata->update_topology(e3_id, get_dc_rack(e3_id), node::state::normal);
@@ -165,7 +180,11 @@ SEASTAR_THREAD_TEST_CASE(test_pending_endpoints_for_replace_with_replicas) {
const auto e3_id = gen_id(3);
const auto e4_id = gen_id(4);
auto token_metadata = create_token_metadata(e1_id);
semaphore sem(1);
auto tm_cfg = create_token_metadata_config(e1_id);
shared_token_metadata stm([&] () noexcept { return get_units(sem, 1); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
auto token_metadata = stm.make_token_metadata_ptr();
token_metadata->update_topology(e1_id, get_dc_rack(e1_id), node::state::normal);
token_metadata->update_topology(e2_id, get_dc_rack(e2_id), node::state::normal);
token_metadata->update_topology(e3_id, get_dc_rack(e3_id), node::state::normal);
@@ -201,7 +220,11 @@ SEASTAR_THREAD_TEST_CASE(test_endpoints_for_reading_when_bootstrap_with_replicas
const auto e2_id = gen_id(2);
const auto e3_id = gen_id(3);
auto token_metadata = create_token_metadata(e1_id);
semaphore sem(1);
auto tm_cfg = create_token_metadata_config(e1_id);
shared_token_metadata stm([&] () noexcept { return get_units(sem, 1); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
auto token_metadata = stm.make_token_metadata_ptr();
token_metadata->update_topology(e1_id, get_dc_rack(e1_id), node::state::normal);
token_metadata->update_topology(e2_id, get_dc_rack(e2_id), node::state::normal);
token_metadata->update_topology(e3_id, get_dc_rack(e3_id), node::state::normal);
@@ -254,7 +277,11 @@ SEASTAR_THREAD_TEST_CASE(test_replace_node_with_same_endpoint) {
const auto e1_id1 = gen_id(1);
const auto e1_id2 = gen_id(2);
auto token_metadata = create_token_metadata(e1_id2);
semaphore sem(1);
auto tm_cfg = create_token_metadata_config(e1_id2);
shared_token_metadata stm([&] () noexcept { return get_units(sem, 1); }, tm_cfg);
auto stop_stm = deferred_stop(stm);
auto token_metadata = stm.make_token_metadata_ptr();
token_metadata->update_topology(e1_id1, get_dc_rack(e1_id1), node::state::being_replaced);
token_metadata->update_normal_tokens({t1}, e1_id1).get();

View File

@@ -60,7 +60,7 @@ async def test_simple_backup(manager: ManagerClient, s3_server):
'experimental_features': ['keyspace-storage-options'],
'task_ttl_in_seconds': 300
}
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace']
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace:api=info']
server = await manager.server_add(config=cfg, cmdline=cmd)
ks, cf = await prepare_snapshot_for_backup(manager, server)
@@ -101,7 +101,7 @@ async def test_backup_move(manager: ManagerClient, s3_server, move_files):
'experimental_features': ['keyspace-storage-options'],
'task_ttl_in_seconds': 300
}
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace']
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace:api=info']
server = await manager.server_add(config=cfg, cmdline=cmd)
ks, cf = await prepare_snapshot_for_backup(manager, server)
@@ -135,7 +135,7 @@ async def test_backup_to_non_existent_bucket(manager: ManagerClient, s3_server):
'experimental_features': ['keyspace-storage-options'],
'task_ttl_in_seconds': 300
}
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace']
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace:api=info']
server = await manager.server_add(config=cfg, cmdline=cmd)
ks, cf = await prepare_snapshot_for_backup(manager, server)
@@ -187,7 +187,7 @@ async def do_test_backup_abort(manager: ManagerClient, s3_server,
'experimental_features': ['keyspace-storage-options'],
'task_ttl_in_seconds': 300
}
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace']
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace:api=info']
server = await manager.server_add(config=cfg, cmdline=cmd)
ks, cf = await prepare_snapshot_for_backup(manager, server)
@@ -240,7 +240,7 @@ async def test_backup_to_non_existent_snapshot(manager: ManagerClient, s3_server
'experimental_features': ['keyspace-storage-options'],
'task_ttl_in_seconds': 300
}
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace']
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace:api=info']
server = await manager.server_add(config=cfg, cmdline=cmd)
ks, cf = await prepare_snapshot_for_backup(manager, server)
@@ -276,7 +276,7 @@ async def test_backup_is_abortable_in_s3_client(manager: ManagerClient, s3_serve
await do_test_backup_abort(manager, s3_server, breakpoint_name="backup_task_pre_upload", min_files=0, max_files=1)
async def do_test_simple_backup_and_restore(manager: ManagerClient, s3_server, do_abort = False):
async def do_test_simple_backup_and_restore(manager: ManagerClient, s3_server, tmpdir, do_encrypt = False, do_abort = False):
'''check that restoring from backed up snapshot for a keyspace:table works'''
objconf = MinioServer.create_conf(s3_server.address, s3_server.port, s3_server.region)
@@ -285,7 +285,14 @@ async def do_test_simple_backup_and_restore(manager: ManagerClient, s3_server, d
'experimental_features': ['keyspace-storage-options'],
'task_ttl_in_seconds': 300
}
cmd = ['--logger-log-level', 'sstables_loader=debug:sstable_directory=trace:snapshots=trace:s3=trace:sstable=debug:http=debug']
if do_encrypt:
d = tmpdir / "system_keys"
d.mkdir()
cfg = cfg | {
'system_key_directory': str(d),
'user_info_encryption': { 'enabled': True, 'key_provider': 'LocalFileSystemKeyProviderFactory' }
}
cmd = ['--logger-log-level', 'sstables_loader=debug:sstable_directory=trace:snapshots=trace:s3=trace:sstable=debug:http=debug:encryption=debug:api=info']
server = await manager.server_add(config=cfg, cmdline=cmd)
cql = manager.get_cql()
@@ -383,9 +390,15 @@ async def do_test_simple_backup_and_restore(manager: ManagerClient, s3_server, d
assert objects == post_objects
@pytest.mark.asyncio
async def test_simple_backup_and_restore(manager: ManagerClient, s3_server):
async def test_simple_backup_and_restore(manager: ManagerClient, s3_server, tmp_path):
'''check that restoring from backed up snapshot for a keyspace:table works'''
await do_test_simple_backup_and_restore(manager, s3_server, False)
await do_test_simple_backup_and_restore(manager, s3_server, tmp_path, False, False)
@pytest.mark.asyncio
async def test_abort_simple_backup_and_restore(manager: ManagerClient, s3_server, tmp_path):
'''check that restoring from backed up snapshot for a keyspace:table works'''
await do_test_simple_backup_and_restore(manager, s3_server, tmp_path, False, True)
async def do_abort_restore(manager: ManagerClient, s3_server):
@@ -531,10 +544,9 @@ async def test_abort_restore_with_rpc_error(manager: ManagerClient, s3_server):
@pytest.mark.asyncio
async def test_abort_simple_backup_and_restore(manager: ManagerClient, s3_server):
async def test_simple_backup_and_restore_with_encryption(manager: ManagerClient, s3_server, tmp_path):
'''check that restoring from backed up snapshot for a keyspace:table works'''
await do_test_simple_backup_and_restore(manager, s3_server, True)
await do_test_simple_backup_and_restore(manager, s3_server, tmp_path, True, False)
# Helper class to parametrize the test below
class topo:
@@ -552,7 +564,7 @@ async def create_cluster(topology, rf_rack_valid_keyspaces, manager, logger, s3_
objconf = MinioServer.create_conf(s3_server.address, s3_server.port, s3_server.region)
cfg['object_storage_endpoints'] = objconf
cmd = [ '--logger-log-level', 'sstables_loader=debug:sstable_directory=trace:snapshots=trace:s3=trace:sstable=debug:http=debug' ]
cmd = [ '--logger-log-level', 'sstables_loader=debug:sstable_directory=trace:snapshots=trace:s3=trace:sstable=debug:http=debug:api=info' ]
servers = []
host_ids = {}
@@ -715,7 +727,7 @@ async def test_restore_with_non_existing_sstable(manager: ManagerClient, s3_serv
'experimental_features': ['keyspace-storage-options'],
'task_ttl_in_seconds': 300
}
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace']
cmd = ['--logger-log-level', 'snapshots=trace:task_manager=trace:api=info']
server = await manager.server_add(config=cfg, cmdline=cmd)
cql = manager.get_cql()
print('Create keyspace')

View File

@@ -0,0 +1,204 @@
#
# Copyright (C) 2025-present ScyllaDB
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
#
import asyncio
import pytest
import logging
import time
from test.pylib.manager_client import ManagerClient
from test.pylib.util import wait_for
from test.cluster.util import new_test_keyspace, reconnect_driver, wait_for_cql_and_get_hosts
from test.cluster.conftest import skip_mode
logger = logging.getLogger(__name__)
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_batchlog_replay_while_a_node_is_down(manager: ManagerClient) -> None:
""" Test that batchlog replay handles the case when a node is down while replaying a batch.
Reproduces issue #24599.
1. Create a cluster with 3 nodes.
2. Write a batch and inject an error to fail it before it's removed from the batchlog, so it
needs to be replayed.
3. Stop server 1.
4. Server 0 tries to replay the batch. it sends the mutation to all replicas, but one of them is down,
so it should fail.
5. Bring server 1 back up.
6. Verify that the batch is replayed and removed from the batchlog eventually.
"""
cmdline=['--logger-log-level', 'batchlog_manager=trace']
config = {'error_injections_at_startup': ['short_batchlog_manager_replay_interval'], 'write_request_timeout_in_ms': 2000}
servers = await manager.servers_add(3, config=config, cmdline=cmdline, auto_rack_dc="dc1")
cql, hosts = await manager.get_ready_cql(servers)
async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}") as ks:
await cql.run_async(f"CREATE TABLE {ks}.tab (key int, c int, v int, PRIMARY KEY (key, c))")
await asyncio.gather(*[manager.api.enable_injection(s.ip_addr, "storage_proxy_fail_remove_from_batchlog", one_shot=False) for s in servers])
# make sure the batch is replayed only after the server is stopped
await asyncio.gather(*[manager.api.enable_injection(s.ip_addr, "skip_batch_replay", one_shot=False) for s in servers])
s0_log = await manager.server_open_log(servers[0].server_id)
try:
await cql.run_async(f"BEGIN BATCH INSERT INTO {ks}.tab (key, c, v) VALUES (0,0,0); INSERT INTO {ks}.tab (key, c, v) VALUES (1,1,1); APPLY BATCH")
except Exception as e:
# injected error is expected
logger.error(f"Error executing batch: {e}")
await asyncio.gather(*[manager.api.disable_injection(s.ip_addr, "storage_proxy_fail_remove_from_batchlog") for s in servers])
await manager.server_stop(servers[1].server_id)
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog", host=hosts[0]))[0].count
assert batchlog_row_count > 0
await asyncio.gather(*[manager.api.disable_injection(s.ip_addr, "skip_batch_replay") for s in servers if s != servers[1]])
# The batch is replayed while server 1 is down
await s0_log.wait_for('Replaying batch', timeout=60)
await asyncio.sleep(1)
# Bring server 1 back up and verify that eventually the batch is replayed and removed from the batchlog
await manager.server_start(servers[1].server_id)
s0_mark = await s0_log.mark()
await s0_log.wait_for('Finished replayAllFailedBatches', timeout=60, from_mark=s0_mark)
async def batchlog_empty() -> bool:
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog", host=hosts[0]))[0].count
if batchlog_row_count == 0:
return True
await wait_for(batchlog_empty, time.time() + 60)
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_batchlog_replay_aborted_on_shutdown(manager: ManagerClient) -> None:
""" Similar to the previous test, but also verifies that the batchlog replay is aborted on shutdown,
and node shutdown is not stuck.
1. Create a cluster with 3 nodes.
2. Write a batch and inject an error to fail it before it's removed from the batchlog, so it
needs to be replayed.
3. Stop server 1.
4. Server 0 tries to replay the batch. it sends the mutation to all replicas, but one of them is down,
so it should fail.
5. Shut down server 0 gracefully, which should abort the batchlog replay which is in progress.
6. Bring server 0 and server 1 back up.
6. Verify that the batch is replayed and removed from the batchlog eventually.
"""
cmdline=['--logger-log-level', 'batchlog_manager=trace']
config = {'error_injections_at_startup': ['short_batchlog_manager_replay_interval'], 'write_request_timeout_in_ms': 2000}
servers = await manager.servers_add(3, config=config, cmdline=cmdline, auto_rack_dc="dc1")
cql, hosts = await manager.get_ready_cql(servers)
async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}") as ks:
await cql.run_async(f"CREATE TABLE {ks}.tab (key int, c int, v int, PRIMARY KEY (key, c))")
await asyncio.gather(*[manager.api.enable_injection(s.ip_addr, "storage_proxy_fail_remove_from_batchlog", one_shot=False) for s in servers])
# make sure the batch is replayed only after the server is stopped
await asyncio.gather(*[manager.api.enable_injection(s.ip_addr, "skip_batch_replay", one_shot=False) for s in servers])
s0_log = await manager.server_open_log(servers[0].server_id)
try:
await cql.run_async(f"BEGIN BATCH INSERT INTO {ks}.tab (key, c, v) VALUES (0,0,0); INSERT INTO {ks}.tab (key, c, v) VALUES (1,1,1); APPLY BATCH")
except Exception as e:
# injected error is expected
logger.error(f"Error executing batch: {e}")
await asyncio.gather(*[manager.api.disable_injection(s.ip_addr, "storage_proxy_fail_remove_from_batchlog") for s in servers])
await manager.server_stop(servers[1].server_id)
await asyncio.gather(*[manager.api.disable_injection(s.ip_addr, "skip_batch_replay") for s in servers if s != servers[1]])
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog", host=hosts[0]))[0].count
assert batchlog_row_count > 0
# The batch is replayed while server 1 is down
await s0_log.wait_for('Replaying batch', timeout=60)
await asyncio.sleep(1)
# verify shutdown is not stuck
await manager.server_stop_gracefully(servers[0].server_id)
await manager.server_start(servers[0].server_id)
await manager.server_start(servers[1].server_id)
cql = await reconnect_driver(manager)
hosts = await wait_for_cql_and_get_hosts(cql, servers, time.time() + 60)
async def batchlog_empty() -> bool:
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog", host=hosts[0]))[0].count
if batchlog_row_count == 0:
return True
await wait_for(batchlog_empty, time.time() + 60)
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_batchlog_replay_includes_cdc(manager: ManagerClient) -> None:
""" Test that when a batch is replayed from the batchlog, it includes CDC mutations.
1. Create a cluster with a single node.
2. Create a table with CDC enabled.
3. Write a batch and inject an error to fail it after it's written to the batchlog but before the mutation is applied.
4. Wait for the batch to be replayed.
5. Verify that the data is written to the base table.
6. Verify that CDC mutations are also applied and visible in the CDC log table.
"""
cmdline = ['--logger-log-level', 'batchlog_manager=trace']
config = {'error_injections_at_startup': ['short_batchlog_manager_replay_interval'], 'write_request_timeout_in_ms': 2000}
servers = await manager.servers_add(1, config=config, cmdline=cmdline)
cql, hosts = await manager.get_ready_cql(servers)
async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'enabled': false}") as ks:
# Create table with CDC enabled
await cql.run_async(f"CREATE TABLE {ks}.tab (key int, c int, v int, PRIMARY KEY (key, c)) WITH cdc = {{'enabled': true}}")
# Enable error injection to make the batch fail after writing to batchlog
await manager.api.enable_injection(servers[0].ip_addr, "storage_proxy_fail_remove_from_batchlog", one_shot=False)
# Execute a batch that will fail due to injection but be written to batchlog
try:
await cql.run_async(
"BEGIN BATCH " +
f"INSERT INTO {ks}.tab(key, c, v) VALUES (10, 20, 30); " +
f"INSERT INTO {ks}.tab(key, c, v) VALUES (40, 50, 60); " +
"APPLY BATCH"
)
except Exception as e:
logger.info(f"Expected error executing batch: {e}")
await manager.api.disable_injection(servers[0].ip_addr, "storage_proxy_fail_remove_from_batchlog")
# Wait for data to appear in the base table
async def data_written():
result1 = await cql.run_async(f"SELECT * FROM {ks}.tab WHERE key = 10 AND c = 20")
result2 = await cql.run_async(f"SELECT * FROM {ks}.tab WHERE key = 40 AND c = 50")
if len(result1) > 0 and len(result2) > 0:
return True
await wait_for(data_written, time.time() + 60)
# Check that CDC log table exists and has the CDC mutations
cdc_table_name = f"{ks}.tab_scylla_cdc_log"
# Wait for CDC mutations to be visible
async def cdc_data_present():
result1 = await cql.run_async(f"SELECT * FROM {cdc_table_name} WHERE key = 10 ALLOW FILTERING")
result2 = await cql.run_async(f"SELECT * FROM {cdc_table_name} WHERE key = 40 ALLOW FILTERING")
if len(result1) > 0 and len(result2) > 0:
return True
await wait_for(cdc_data_present, time.time() + 60)
result1 = await cql.run_async(f"SELECT * FROM {cdc_table_name} WHERE key = 10 ALLOW FILTERING")
assert len(result1) == 1, f"Expected 1 CDC mutation for key 10, got {len(result1)}"
result2 = await cql.run_async(f"SELECT * FROM {cdc_table_name} WHERE key = 40 ALLOW FILTERING")
assert len(result2) == 1, f"Expected 1 CDC mutation for key 40, got {len(result2)}"

View File

@@ -1,13 +1,8 @@
from test.pylib.manager_client import ManagerClient
from test.pylib.rest_client import inject_error
from test.cluster.util import check_token_ring_and_group0_consistency
from test.cluster.conftest import skip_mode
import logging
import pytest
import asyncio
logger = logging.getLogger(__name__)
"""
The injection forces the topology coordinator to send CDC generation data in multiple parts,
@@ -36,40 +31,3 @@ async def test_send_data_in_parts(manager: ManagerClient):
break
else:
pytest.fail("No CDC generation data sent in parts was found")
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_group0_apply_while_node_is_being_shutdown(manager: ManagerClient):
# This a regression test for #24401.
logger.info("Starting s0")
s0 = await manager.server_add(cmdline=['--logger-log-level', 'raft_group0=debug'])
logger.info("Injecting topology_state_load_before_update_cdc into s0")
await manager.api.enable_injection(s0.ip_addr, "topology_state_load_before_update_cdc", False)
logger.info("Starting s1")
s1_start_task = asyncio.create_task(manager.server_add())
logger.info("Waiting for topology_state_load_before_update_cdc on s0")
log = await manager.server_open_log(s0.server_id)
await log.wait_for('topology_state_load_before_update_cdc hit, wait for message')
logger.info("Triggering s0 shutdown")
stop_s0_task = asyncio.create_task(manager.server_stop_gracefully(s0.server_id))
logger.info("Waiting for group0 to start aborting on s0")
await log.wait_for('Raft group0 service is aborting...')
logger.info("Releasing topology_state_load_before_update_cdc on s0")
await manager.api.message_injection(s0.ip_addr, 'topology_state_load_before_update_cdc')
await stop_s0_task
try:
await s1_start_task
except Exception:
pass # ingore errors, since we don't care
errors = await log.grep_for_errors()
assert errors == []

View File

@@ -0,0 +1,75 @@
#
# Copyright (C) 2025-present ScyllaDB
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
#
from test.pylib.manager_client import ManagerClient
from test.cluster.util import new_test_keyspace
from cassandra.protocol import InvalidRequest
import asyncio
import logging
import threading
import pytest
logger = logging.getLogger(__name__)
@pytest.mark.asyncio
async def test_add_and_drop_column_with_cdc(manager: ManagerClient):
""" Test writing to a table with CDC enabled while adding and dropping a column.
In particular we are interested at the behavior when the schemas of the base table
and the CDC log may not be in sync, and we write a value to a column that exists
in the base table but not in the CDC table.
Reproduces #24952
"""
servers = await manager.servers_add(3)
cql = manager.get_cql()
async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3} AND tablets = {'enabled': false}") as ks:
await cql.run_async(f"CREATE TABLE {ks}.test (pk int PRIMARY KEY, v int) WITH cdc={{'enabled': true}}")
# sleep before CDC augmentation, because we want to have a write that starts with some base schema, and then
# the table is altered while the write is in progress, and the CDC augmentation will use the new schema that
# is not compatible with the base schema.
await asyncio.gather(*[manager.api.enable_injection(s.ip_addr, "sleep_before_cdc_augmentation", one_shot=False) for s in servers])
# The writer thread writes to the column 'a' while it's being added and dropped.
# We want to write a value to that column while it's in different stages - may exist
# in one table but not in the other.
stop_writer = threading.Event()
writer_error = threading.Event()
def do_writes():
i = 0
try:
while not stop_writer.is_set():
try:
cql.execute(f"INSERT INTO {ks}.test(pk, v, a) VALUES({i}, {i+1}, {i+2})")
except InvalidRequest as e:
if "Unknown identifier" in str(e) or "does not have base column" in str(e):
pass
else:
raise
i += 1
except Exception as e:
logger.error(f"Unexpected error while writing to {ks}.test: {e}")
writer_error.set()
writer_thread = threading.Thread(target=do_writes)
writer_thread.start()
await cql.run_async(f"ALTER TABLE {ks}.test ADD a int")
await asyncio.sleep(1)
await cql.run_async(f"ALTER TABLE {ks}.test DROP a")
stop_writer.set()
writer_thread.join()
if writer_error.is_set():
pytest.fail("Unexpected error occurred during writes to the table")
base_rows = await cql.run_async(f"SELECT COUNT(*) FROM {ks}.test")
cdc_rows = await cql.run_async(f"SELECT COUNT(*) FROM {ks}.test_scylla_cdc_log")
assert base_rows[0].count == cdc_rows[0].count, f"Base table rows: {base_rows[0].count}, CDC log rows: {cdc_rows[0].count}"

View File

@@ -0,0 +1,51 @@
#
# Copyright (C) 2025-present ScyllaDB
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
#
import asyncio
import logging
import pytest
import time
from cassandra.cluster import ConsistencyLevel
from test.cluster.dtest.alternator_utils import random_string
from test.cluster.util import new_test_keyspace
from test.pylib.manager_client import ManagerClient
logger = logging.getLogger(__name__)
@pytest.mark.asyncio
async def test_streaming_deadlock_removenode(request, manager: ManagerClient):
# Force removenode to exercise range_streamer and not repair.
# The bug is in the streaming, and when senders are on different nodes,
# and receivers are cross-located (B->C, C->B).
cfg = {
'rf_rack_valid_keyspaces': False,
'tablets_mode_for_new_keyspaces': 'disabled',
'maintenance_reader_concurrency_semaphore_count_limit': 1,
'enable_repair_based_node_ops': False,
'enable_cache': False, # Force IO
}
cmdline = [
'--logger-log-level', 'stream_session=trace',
'--logger-log-level', 'query_processor=trace'
]
servers = await manager.servers_add(3, config=cfg, cmdline=cmdline)
cql = manager.get_cql()
async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2}") as ks:
await cql.run_async(f"CREATE TABLE {ks}.test (pk int PRIMARY KEY, c int, v text);")
await cql.run_async(f"CREATE MATERIALIZED VIEW {ks}.mv AS SELECT * FROM {ks}.test "
"WHERE c IS NOT NULL and pk IS NOT NULL PRIMARY KEY (c, pk)")
keys = range(10240)
val = random_string(10240)
stmt = cql.prepare(f"INSERT INTO {ks}.test (pk, c, v) VALUES (?, ?, '{val}')")
await asyncio.gather(*[cql.run_async(stmt, [k, k]) for k in keys])
await manager.server_stop_gracefully(servers[0].server_id)
await manager.remove_node(servers[1].server_id, servers[0].server_id)

View File

@@ -1088,6 +1088,46 @@ async def test_tablet_split_finalization_with_migrations(manager: ManagerClient)
logger.info("Waiting for migrations to complete")
await log.wait_for("Tablet load balancer did not make any plan", from_mark=migration_mark)
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_tablet_split_finalization_with_repair(manager: ManagerClient):
injection = "handle_tablet_resize_finalization_wait"
cfg = {
'enable_tablets': True,
'error_injections_at_startup': [
injection,
"repair_tablets_no_sync",
'short_tablet_stats_refresh_interval',
]
}
servers = await manager.servers_add(2, config=cfg)
cql = manager.get_cql()
await cql.run_async("CREATE KEYSPACE test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 4};")
await cql.run_async("CREATE TABLE test.test (pk int PRIMARY KEY, c int) WITH compaction = {'class': 'NullCompactionStrategy'};")
await asyncio.gather(*[cql.run_async(f"INSERT INTO test.test (pk, c) VALUES ({k}, {k%3});") for k in range(64)])
await manager.api.keyspace_flush(servers[0].ip_addr, "test", "test")
logs = [await manager.server_open_log(s.server_id) for s in servers]
marks = [await log.mark() for log in logs]
logger.info("Trigger split in table")
await cql.run_async("ALTER TABLE test.test WITH tablets = {'min_tablet_count': 8};")
logger.info("Wait for tablets to split")
done, pending = await asyncio.wait([asyncio.create_task(log.wait_for('handle_tablet_resize_finalization: waiting', from_mark=mark)) for log, mark in zip(logs, marks)], return_when=asyncio.FIRST_COMPLETED)
for task in pending:
task.cancel()
async def repair():
await manager.api.client.post(f"/storage_service/repair_async/test", host=servers[0].ip_addr)
async def check_repair_waits():
await logs[0].wait_for("Topology is busy, waiting for it to quiesce", from_mark=marks[0])
await manager.api.message_injection(servers[0].ip_addr, injection)
await asyncio.gather(repair(), check_repair_waits())
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_two_tablets_concurrent_repair_and_migration_repair_writer_level(manager: ManagerClient):

View File

@@ -388,6 +388,71 @@ async def test_tablet_merge_cross_rack_migrations(manager: ManagerClient, racks)
return tablet_count < old_tablet_count or None
await wait_for(finished_merging, time.time() + 120)
# Reproduces #23284
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_tablet_split_merge_with_many_tables(manager: ManagerClient, racks = 2):
cmdline = ['--smp', '4', '-m', '2G', '--target-tablet-size-in-bytes', '30000', '--max-task-backlog', '200',]
config = {'error_injections_at_startup': ['short_tablet_stats_refresh_interval']}
servers = []
rf = racks
for rack_id in range(0, racks):
rack = f'rack{rack_id+1}'
servers.extend(await manager.servers_add(3, config=config, cmdline=cmdline, property_file={'dc': 'mydc', 'rack': rack}))
cql = manager.get_cql()
ks = await create_new_test_keyspace(cql, f"WITH replication = {{'class': 'NetworkTopologyStrategy', 'replication_factor': {rf}}} AND tablets = {{'initial': 1}}")
await cql.run_async(f"CREATE TABLE {ks}.test (pk int PRIMARY KEY, c blob) WITH compression = {{'sstable_compression': ''}};")
await asyncio.gather(*[cql.run_async(f"CREATE TABLE {ks}.test{i} (pk int PRIMARY KEY, c blob);") for i in range(1, 200)])
async def check_logs(when):
for server in servers:
log = await manager.server_open_log(server.server_id)
matches = await log.grep("Too long queue accumulated for gossip")
if matches:
pytest.fail(f"Server {server.server_id} has too long queue accumulated for gossip {when}: {matches=}")
await check_logs("after creating tables")
total_keys = 400
keys = range(total_keys)
insert = cql.prepare(f"INSERT INTO {ks}.test(pk, c) VALUES(?, ?)")
for pk in keys:
value = random.randbytes(2000)
cql.execute(insert, [pk, value])
for server in servers:
await manager.api.flush_keyspace(server.ip_addr, ks)
async def finished_splitting():
# FIXME: fragile since it's expecting on-disk size will be enough to produce a few splits.
# (raw_data=800k / target_size=30k) = ~26, lower power-of-two is 16. Compression was disabled.
# Per-table hints (min_tablet_count) can be used to improve this.
tablet_count = await get_tablet_count(manager, servers[0], ks, 'test')
return tablet_count >= 16 or None
# Give enough time for split to happen in debug mode
await wait_for(finished_splitting, time.time() + 120)
await check_logs("after split completion")
delete_keys = range(total_keys - 1)
await asyncio.gather(*[cql.run_async(f"DELETE FROM {ks}.test WHERE pk={k};") for k in delete_keys])
keys = range(total_keys - 1, total_keys)
old_tablet_count = await get_tablet_count(manager, servers[0], ks, 'test')
for server in servers:
await manager.api.flush_keyspace(server.ip_addr, ks)
await manager.api.keyspace_compaction(server.ip_addr, ks)
async def finished_merging():
tablet_count = await get_tablet_count(manager, servers[0], ks, 'test')
return tablet_count < old_tablet_count or None
await wait_for(finished_merging, time.time() + 120)
await check_logs("after merge completion")
# Reproduces use-after-free when migration right after merge, but concurrently to background
# merge completion handler.
# See: https://github.com/scylladb/scylladb/issues/24045

View File

@@ -469,3 +469,56 @@ async def test_restart_leaving_replica_during_cleanup(manager: ManagerClient, mi
if new_tablet_count < old_tablet_count:
return True
await wait_for(tablets_merged, time.time() + 60)
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_restart_in_cleanup_stage_after_cleanup(manager: ManagerClient):
"""
Migrate a tablet from one node to another, and restart the leaving replica during
the tablet cleanup stage, after tablet cleanup is completed.
Reproduces issue #24857
"""
cfg = {'error_injections_at_startup': ['short_tablet_stats_refresh_interval']}
servers = await manager.servers_add(2, config=cfg)
await manager.api.disable_tablet_balancing(servers[0].ip_addr)
cql = manager.get_cql()
async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 2}") as ks:
await cql.run_async(f"CREATE TABLE {ks}.test (pk int PRIMARY KEY, c int) WITH tablets = {{'min_tablet_count': 8}};")
total_keys = 10
for pk in range(total_keys):
await cql.run_async(f"INSERT INTO {ks}.test(pk, c) VALUES({pk}, {pk+1})")
await manager.api.flush_keyspace(servers[0].ip_addr, ks)
tablet_token = 0
s0_host_id = await manager.get_host_id(servers[0].server_id)
s1_host_id = await manager.get_host_id(servers[1].server_id)
# Find which server holds the tablet
replica = await get_tablet_replica(manager, servers[0], ks, 'test', tablet_token)
if replica[0] == s0_host_id:
src_server, dst_host_id = servers[0], s1_host_id
else:
src_server, dst_host_id = servers[1], s0_host_id
await asyncio.gather(*[manager.api.enable_injection(s.ip_addr, "wait_after_tablet_cleanup", one_shot=False) for s in servers])
log = await manager.server_open_log(servers[0].server_id)
mark = await log.mark()
# Start migration - move tablet to other node
move_task = asyncio.create_task(manager.api.move_tablet(servers[0].ip_addr, ks, 'test', replica[0], replica[1], dst_host_id, 0, tablet_token))
await log.wait_for("Waiting after tablet cleanup", from_mark=mark, timeout=60)
# Restart the leaving replica (src_server)
await manager.server_stop(src_server.server_id)
await manager.server_start(src_server.server_id)
await wait_for_cql_and_get_hosts(manager.get_cql(), servers, time.time() + 30)
await asyncio.gather(*[manager.api.message_injection(s.ip_addr, "wait_after_tablet_cleanup") for s in servers])
await manager.api.quiesce_topology(servers[0].ip_addr)

View File

@@ -0,0 +1,109 @@
#
# Copyright (C) 2025-present ScyllaDB
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
#
import logging
import threading
import pytest
import asyncio
import time
from cassandra import ConsistencyLevel # type: ignore
from cassandra.query import SimpleStatement # type: ignore
from test.pylib.manager_client import ManagerClient
from test.pylib.util import wait_for_cql_and_get_hosts
from test.cluster.util import check_token_ring_and_group0_consistency, new_test_keyspace
from test.pylib.util import wait_for
from test.cluster.test_tablets2 import inject_error_on
from test.pylib.scylla_cluster import ReplaceConfig
from test.cluster.util import get_topology_coordinator
from cassandra.cluster import ConnectionException, NoHostAvailable # type: ignore
from test.cluster.conftest import skip_mode
logger = logging.getLogger(__name__)
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_unfinished_writes_during_shutdown(request: pytest.FixtureRequest, manager: ManagerClient) -> None:
""" Test a simultaneous topology change and write query during shutdown, which may cause the node to get stuck (see https://github.com/scylladb/scylladb/issues/23665).
1. Create a keyspace with replication factor 3
2. Start 3 servers
3. Use error injection to pause the 3rd node on a topology change (`barrier_and_drain`)
4 Trigger a topology change by adding a new node to the cluster.
5. Make sure the topology change was paused on the node 3 (`barrier_and_drain`)
6. Now with error injection, make sure node 2 will pause before sending a write acknowledgment.
7. Send a write query to the node 3. (which already should be paused on the topology change operation)
8. The query should have completed, but one write to node 2 should be remaining, making write_response_handler block the topology change in node 3
9. Start node 3 shutdown. The shutdown should hang since the one of the replicas did not send the response and therefore the response write handler still holds the ERM.
"""
logger.info("Creating a new cluster")
cmdline = [
'--logger-log-level', 'debug_error_injection=debug',
]
servers = await manager.servers_add(3, auto_rack_dc="dc1", cmdline=cmdline)
cql, hosts = await manager.get_ready_cql(servers)
async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3};") as ks:
await cql.run_async(f"CREATE TABLE {ks}.t (pk int primary key, v int)")
target_host = hosts[2]
target_server = servers[2]
# Make the target node stop before locking the ERM
logger.info(
f"Enabling injection 'pause_before_barrier_and_drain' on the target server {target_server}")
target_server_log = await manager.server_open_log(target_server.server_id)
await manager.api.enable_injection(target_server.ip_addr, "pause_before_barrier_and_drain", one_shot=True)
async def do_add_node():
logger.info("Adding a node to the cluster")
try:
await manager.server_add(property_file={"dc": "dc1", "rack": "rack4"})
except Exception as exc:
logger.error(f"Failed to add a new node: {exc}")
# Start adding a new node to the cluster, causing a topology change that will issue a barrier and drain
add_last_node_task = asyncio.create_task(do_add_node())
# Wait for the topology change to start
logger.info("Waiting for a topology change to start")
await target_server_log.wait_for("pause_before_barrier_and_drain: waiting for message")
# Now make sure responses on one of the replicas will be delayed
server_to_pause = servers[1]
await inject_error_on(manager, "storage_proxy_write_response_pause", [server_to_pause])
logger.info(
f"Pausing responses on one of the replicas {server_to_pause}")
paused_server_logs = await manager.server_open_log(server_to_pause.server_id)
# Now send a write query to the target node that will be shut down.
await cql.run_async(f"insert into {ks}.t (pk, v) values ({32765}, {17777})", host=target_host)
# Make sure the node that's response is paused, got the write request.
await paused_server_logs.wait_for("storage_proxy_write_response_pause: waiting for message")
# Start shutdown of the query coordinator
async def do_shutdown():
logger.info(f"Starting shutdown of node {target_server.server_id}")
await manager.server_stop_gracefully(target_server.server_id)
shutdown_task = asyncio.create_task(do_shutdown())
# Wait for the shutdown to start
await target_server_log.wait_for("Stop transport: done")
# Unpause the coordinator to make it now continue with `barrier_and_drain` shutdown
await manager.api.message_injection(target_server.ip_addr, 'pause_before_barrier_and_drain')
logger.info(f"Unblocking writes on the node {server_to_pause}")
await manager.api.message_injection(server_to_pause.ip_addr, 'storage_proxy_write_response_pause')
logger.info("Waiting for the shutdown to complete")
await shutdown_task
logger.info("Cancelling addnode task")
add_last_node_task.cancel()

View File

@@ -64,7 +64,8 @@ async def test_zero_token_nodes_multidc_basic(manager: ManagerClient, zero_token
AND tablets = {{ 'enabled': true }}""")
ks_names.append(ks_name)
try:
await dc2_cql.run_async(f'CREATE TABLE {ks_names[rf]}.tbl (pk int PRIMARY KEY, v int)')
await dc2_cql.run_async(
f'CREATE TABLE {ks_names[rf]}.tbl (cl int, zero_token boolean, v int, PRIMARY KEY (cl, zero_token))')
except Exception:
failed = True
assert failed == (rf > normal_nodes_in_dc2)
@@ -85,18 +86,30 @@ async def test_zero_token_nodes_multidc_basic(manager: ManagerClient, zero_token
for cl in cls:
logging.info('Testing with rf=%s, consistency_level=%s', rf, cl)
insert_query = SimpleStatement(f'INSERT INTO {ks_names[rf]}.tbl (pk, v) VALUES (1, 1)',
consistency_level=cl)
await dc1_cql.run_async(insert_query)
await dc2_cql.run_async(insert_query)
insert_queries = [
SimpleStatement(
f'INSERT INTO {ks_names[rf]}.tbl (cl, zero_token, v) VALUES ({cl}, {zero_token_coordinator}, {cl})',
consistency_level=cl
) for zero_token_coordinator in [False, True]
]
await dc1_cql.run_async(insert_queries[0])
await dc2_cql.run_async(insert_queries[1])
if cl == ConsistencyLevel.EACH_QUORUM:
continue # EACH_QUORUM is supported only for writes
select_query = SimpleStatement(f'SELECT * FROM {ks_names[rf]}.tbl', consistency_level=cl)
dc1_result_set = await dc1_cql.run_async(select_query)
dc2_result_set = await dc2_cql.run_async(select_query)
assert dc1_result_set
assert list(dc1_result_set[0]) == [1, 1]
assert dc2_result_set
assert list(dc2_result_set[0]) == [1, 1]
select_queries = [
SimpleStatement(
f'SELECT * FROM {ks_names[rf]}.tbl WHERE cl = {cl} AND zero_token = {zero_token_coordinator}',
consistency_level=cl
) for zero_token_coordinator in [False, True]
]
dc1_result_set = await dc1_cql.run_async(select_queries[0])
dc2_result_set = await dc2_cql.run_async(select_queries[1])
# With CL=ONE we don't have a guarantee that the replicas written to and read from have a non-empty
# intersection. Hence, reads could miss the written rows.
assert cl == ConsistencyLevel.ONE or (dc1_result_set and dc2_result_set)
if dc1_result_set:
assert list(dc1_result_set[0]) == [cl, False, cl]
if dc2_result_set:
assert list(dc2_result_set[0]) == [cl, True, cl]

View File

@@ -116,3 +116,126 @@ def test_cdc_taken_log_name(scylla_only, cql, test_keyspace):
cql.execute(f"DROP TABLE {name}")
finally:
cql.execute(f"DROP TABLE {name}_scylla_cdc_log")
@pytest.mark.parametrize("test_keyspace",
[pytest.param("tablets", marks=[pytest.mark.xfail(reason="issue #16317")]), "vnodes"],
indirect=True)
def test_alter_column_of_cdc_log_table(cql, test_keyspace, scylla_only):
with new_test_table(cql, test_keyspace, "p int PRIMARY KEY, v int, u int", "with cdc = {'enabled': true}") as table:
cdc_log_table_name = f"{table}_scylla_cdc_log"
errmsg = "You cannot modify the set of columns of a CDC log table directly. " \
"Modify the base table instead."
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f"ALTER TABLE {cdc_log_table_name} ADD c int")
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f"ALTER TABLE {cdc_log_table_name} DROP u")
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f'ALTER TABLE {cdc_log_table_name} DROP "cdc$stream_id"')
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f"ALTER TABLE {cdc_log_table_name} ALTER u TYPE float")
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f'ALTER TABLE {cdc_log_table_name} ALTER "cdc$stream_id" TYPE float')
cql.execute(f"ALTER TABLE {table} DROP u")
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f'ALTER TABLE {cdc_log_table_name} DROP "cdc$deleted_u"')
@pytest.mark.parametrize("test_keyspace",
[pytest.param("tablets", marks=[pytest.mark.xfail(reason="issue #16317")]), "vnodes"],
indirect=True)
def test_rename_column_of_cdc_log_table(cql, test_keyspace, scylla_only):
with new_test_table(cql, test_keyspace, "p int PRIMARY KEY, v int, u int", "with cdc = {'enabled': true}") as table:
cdc_log_table_name = f"{table}_scylla_cdc_log"
errmsg = "Cannot rename a column of a CDC log table."
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f"ALTER TABLE {cdc_log_table_name} RENAME u TO c")
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f'ALTER TABLE {cdc_log_table_name} RENAME "cdc$stream_id" TO c')
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f'ALTER TABLE {cdc_log_table_name} RENAME "cdc$stream_id" TO "cdc$c"')
cql.execute(f"ALTER TABLE {table} DROP u")
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f'ALTER TABLE {cdc_log_table_name} RENAME "cdc$deleted_u" TO c')
# Verify that you cannot modify the set of columns on a CDC log table, even when it stops being active.
@pytest.mark.parametrize("test_keyspace",
[pytest.param("tablets", marks=[pytest.mark.xfail(reason="issue #16317")]), "vnodes"],
indirect=True)
def test_alter_column_of_inactive_cdc_log_table(cql, test_keyspace, scylla_only):
with new_test_table(cql, test_keyspace, "p int PRIMARY KEY, v int, u int", "with cdc = {'enabled': true}") as table:
cdc_log_table_name = f"{table}_scylla_cdc_log"
# Insert some data just so we don't work an empty table. This shouldn't
# have ANY impact on how the test should behave, but let's make do it anyway.
cql.execute(f"INSERT INTO {table}(p, v, u) VALUES (1, 2, 3)")
# Detach the log table.
cql.execute(f"ALTER TABLE {table} WITH cdc = {{'enabled': false}}")
errmsg = "You cannot modify the set of columns of a CDC log table directly. " \
"Although the base table has deactivated CDC, this table will continue being " \
"a CDC log table until it is dropped. If you want to modify the columns in it, " \
"you can only do that by reenabling CDC on the base table, which will reattach " \
"this log table. Then you will be able to modify the columns in the base table, " \
"and that will have effect on the log table too. Modifying the columns of a CDC " \
"log table directly is never allowed."
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f"ALTER TABLE {cdc_log_table_name} ADD c int")
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f"ALTER TABLE {cdc_log_table_name} DROP u")
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f'ALTER TABLE {cdc_log_table_name} DROP "cdc$stream_id"')
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f"ALTER TABLE {cdc_log_table_name} ALTER u TYPE float")
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f'ALTER TABLE {cdc_log_table_name} ALTER "cdc$stream_id" TYPE float')
# Verify that the set of columnfs of a table whose name resembles that of a CDC log table is possible.
def test_alter_column_of_fake_cdc_log_table(cql, test_keyspace, scylla_only):
name = unique_name()
fake_cdc_log_table_name = f"{name}_scylla_cdc_log"
try:
cql.execute(f"CREATE TABLE {test_keyspace}.{fake_cdc_log_table_name} (p int PRIMARY KEY, v int)")
cql.execute(f"ALTER TABLE {test_keyspace}.{fake_cdc_log_table_name} DROP v")
finally:
cql.execute(f"DROP TABLE IF EXISTS {test_keyspace}.{fake_cdc_log_table_name}")
# Verify that you cannot rename a column of a CDC log table, even when it stops being active.
@pytest.mark.parametrize("test_keyspace",
[pytest.param("tablets", marks=[pytest.mark.xfail(reason="issue #16317")]), "vnodes"],
indirect=True)
def test_rename_column_of_inactive_cdc_log_table(cql, test_keyspace, scylla_only):
with new_test_table(cql, test_keyspace, "p int PRIMARY KEY, v int, u int", "with cdc = {'enabled': true}") as table:
cdc_log_table_name = f"{table}_scylla_cdc_log"
# Insert some data just so we don't work an empty table. This shouldn't
# have ANY impact on how the test should behave, but let's make do it anyway.
cql.execute(f"INSERT INTO {table}(p, v, u) VALUES (1, 2, 3)")
# Detach the log table.
cql.execute(f"ALTER TABLE {table} WITH cdc = {{'enabled': false}}")
errmsg = "You cannot rename a column of a CDC log table. Although the base table " \
"has deactivated CDC, this table will continue being a CDC log table until it " \
"is dropped."
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f"ALTER TABLE {cdc_log_table_name} RENAME u TO c")
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f'ALTER TABLE {cdc_log_table_name} RENAME "cdc$stream_id" TO c')
with pytest.raises(InvalidRequest, match=errmsg):
cql.execute(f'ALTER TABLE {cdc_log_table_name} RENAME "cdc$stream_id" TO "cdc$c"')
# Verify that you can rename a column in a table whose name resembles that of a CDC log table
# but that is NOT a CDC log table.
def test_rename_column_of_fake_cdc_log_table(cql, test_keyspace, scylla_only):
name = unique_name()
fake_cdc_log_table_name = f"{name}_scylla_cdc_log"
try:
cql.execute(f"CREATE TABLE {test_keyspace}.{fake_cdc_log_table_name} (p int PRIMARY KEY, v int)")
cql.execute(f"ALTER TABLE {test_keyspace}.{fake_cdc_log_table_name} RENAME p TO q")
finally:
cql.execute(f"DROP TABLE IF EXISTS {test_keyspace}.{fake_cdc_log_table_name}")

View File

@@ -57,6 +57,11 @@ class LRUCache:
if len(self.cache) > self.capacity:
self.cache.popitem(last=False)
def remove(self, key: str):
with self.lock:
if key in self.cache:
del self.cache[key]
# Simple proxy between s3 client and minio to randomly inject errors and simulate cases when the request succeeds but the wire got "broken"
def true_or_false():
@@ -187,6 +192,8 @@ class InjectingHandler(BaseHTTPRequestHandler):
policy.error_count += 1
self.respond_with_error(reset_connection=policy.server_should_fail)
else:
# Once the request is successfully processed, we remove the policy from the cache to make following request to the resource being illegible to fail
self.policies.remove(self.path)
self.send_response(response.status_code)
for key, value in response.headers.items():
if key.upper() != 'CONTENT-LENGTH':

View File

@@ -88,8 +88,11 @@ def populateSomeData(cql, cf: str, pk_range: tuple[int], timestamp: int | None =
stmt = cql.prepare(f"INSERT INTO {cf} (pk, ck, v) VALUES (?, ?, ?) {'USING TIMESTAMP ?' if timestamp else ''}")
for pk in range(*pk_range):
for ck in range(1, 6):
timestamp = timestamp + step if timestamp is not None else None
cql.execute(stmt, [pk, ck*11+100, 0], timestamp)
data = [pk, ck*11+100, 0]
if timestamp is not None:
timestamp += step
data.append(timestamp)
cql.execute(stmt, data)
def alterSomeData(cql, cf: str, timestamp: int | None = None):
@@ -125,16 +128,22 @@ def test_compactionhistory_rows_merged_time_window_compaction_strategy(cql, rest
compaction_opt = "{'class': 'TimeWindowCompactionStrategy', 'compaction_window_unit': 'MINUTES', 'compaction_window_size': 1}"
with new_test_table(cql, ks, "pk int, ck int, v int, PRIMARY KEY (pk, ck)",
f"WITH compaction = {compaction_opt};") as cf:
timestamp = int(time.time())
now = int(time.time() * 1e6) # ms
window_size = int(6e7) # 1 minutes in microseconds
step = int(1e6) # 1 second in microseconds
# Spread data across 2 windows by simulating a write process. `USING TIMESTAMP` is
# provided to distribute the writes in the first one-minute window while updates and
# deletes are propagated into the second 1-minute window.
populateSomeData(cql, cf, (1, 6), timestamp - 60, 1)
#
# To assign a timestamp to a window in TWCS, we just divide it with the respective
# duration and use the result as the window id (discarding the remainder).
start = (now // window_size - 1)*window_size
populateSomeData(cql, cf, (1, 6), start, step)
response = rest_api.send("POST", f"storage_service/keyspace_flush/{ks}")
assert response.status_code == requests.codes.ok
alterSomeData(cql, cf)
alterSomeData(cql, cf, start + window_size)
response = rest_api.send("POST", f"storage_service/keyspace_flush/{ks}")
assert response.status_code == requests.codes.ok
@@ -167,7 +176,7 @@ def test_compactionhistory_tombstone_purge_statistics(cql, rest_api):
response = waitAndGetCompleteCompactionHistory(rest_api, cf)
stats = extractTombstonePurgeStatistics(response, ks)
assert stats == TombstonePurgeStats(5, 0, 0)
assert stats == TombstonePurgeStats(4, 0, 0)
stats = extractSStablesStatistics(response, ks)
assert len(stats.input) == 4 and len(stats.output) == 2
@@ -199,7 +208,7 @@ def test_compactionhistory_tombstone_purge_statistics_overlapping_with_memtable(
response = waitAndGetCompleteCompactionHistory(rest_api, cf)
stats = extractTombstonePurgeStatistics(response, ks)
assert stats == TombstonePurgeStats(5, 1, 0)
assert stats == TombstonePurgeStats(4, 1, 0)
def test_compactionhistory_tombstone_purge_statistics_overlapping_with_other_sstables(cql, rest_api):
@@ -230,7 +239,7 @@ def test_compactionhistory_tombstone_purge_statistics_overlapping_with_other_sst
response = waitAndGetCompleteCompactionHistory(rest_api, cf)
stats = extractTombstonePurgeStatistics(response, ks)
assert stats == TombstonePurgeStats(5, 0, 1)
assert stats == TombstonePurgeStats(4, 0, 1)
stats = extractSStablesStatistics(response, ks)
assert len(stats.input) == 4 and len(stats.output) == 2

View File

@@ -15,6 +15,7 @@
#include <limits>
#include <iterator>
#include <numeric>
#include <fstream>
#include <boost/algorithm/string/case_conv.hpp>
#include <boost/algorithm/string/classification.hpp>
#include <boost/algorithm/string/split.hpp>
@@ -25,9 +26,11 @@
#include <seastar/core/sleep.hh>
#include <seastar/core/thread.hh>
#include <seastar/core/when_all.hh>
#include <seastar/core/fstream.hh>
#include <seastar/http/exception.hh>
#include <seastar/http/request.hh>
#include <seastar/util/short_streams.hh>
#include <seastar/util/closeable.hh>
#include <seastar/core/units.hh>
#include <seastar/net/dns.hh>
#include <seastar/net/inet_address.hh>
@@ -1751,22 +1754,40 @@ void restore_operation(scylla_rest_client& client, const bpo::variables_map& vm)
}
params[required_param] = vm[required_param].as<sstring>();
}
if (!vm.contains("sstables")) {
throw std::invalid_argument("missing required possitional argument: sstables");
bool sstables_as_params = vm.contains("sstables");
bool sstables_as_file_list = vm.contains("sstables-file-list");
if (not sstables_as_params and not sstables_as_file_list) {
throw std::invalid_argument("missing both argument: sstables and --sstables-file-list (at least one is required)");
}
if (vm.contains("scope")) {
params["scope"] = vm["scope"].as<sstring>();
}
sstring sstables_body = std::invoke([&vm] {
std::stringstream output;
rjson::streaming_writer writer(output);
writer.StartArray();
for (auto& toc_fn : vm["sstables"].as<std::vector<sstring>>()) {
writer.String(toc_fn);
std::stringstream output;
rjson::streaming_writer writer(output);
writer.StartArray();
// add the list given by the file param
if (sstables_as_file_list) {
sstring sstables_list_file = vm["sstables-file-list"].as<sstring>();
auto file = open_file_dma(sstables_list_file, open_flags::ro).get();
auto file_close = seastar::deferred_close(file);
auto is = seastar::make_file_input_stream(file);
auto is_close = seastar::deferred_close(is);
auto sstables_list = seastar::util::read_entire_stream_contiguous(is).get();
for (const auto& toc : std::views::split(sstables_list, '\n')) {
writer.String(std::string_view(toc));
}
writer.EndArray();
return make_sstring(output.view());
});
}
// add the list provided by the command line
if (sstables_as_params) {
for (auto& toc : vm["sstables"].as<std::vector<sstring>>()) {
writer.String(toc);
}
}
writer.EndArray();
sstring sstables_body = make_sstring(output.view());
const auto restore_res = client.post("/storage_service/restore", std::move(params),
request_body{"application/json", std::move(sstables_body)});
const auto task_id = rjson::to_string_view(restore_res);
@@ -4267,6 +4288,7 @@ For more information, see: {}"
typed_option<sstring>("table", "Name of a table to copy SSTables to"),
typed_option<>("nowait", "Don't wait on the restore process"),
typed_option<sstring>("scope", "Load-and-stream scope (node, rack or dc)"),
typed_option<sstring>("sstables-file-list", "A file containing the list of sstables to restore (optional)"),
},
{
typed_option<std::vector<sstring>>("sstables", "The object keys of the TOC component of the SSTables to be restored", -1),

Some files were not shown because too many files have changed in this diff Show More