Compare commits

...

85 Commits

Author SHA1 Message Date
Andrei Chekun
d1274f01aa test.py: rewrite the wait_for_first_completed
Rewrite wait_for first_completed to return only first completed task guarantee
of awaiting(disappearing) all cancelled and finished tasks
Use wait_for_first_completed to avoid false pass tests in the future and issues
like #26148
Use gather_safely to await tasks and removing warning that coroutine was
not awaited

Closes scylladb/scylladb#26435

(cherry picked from commit 24d17c3ce5)

Closes scylladb/scylladb#26663
2025-10-22 18:12:52 +02:00
Michael Litvak
aa2065fe2e storage_service: improve colocated repair error to show table names
When requesting repair for tablets of a colocated table, the request
fails with an error. Improve the error message to show the table names
instead of table IDs, because the table names are more useful for users.

Fixes scylladb/scylladb#26567

Closes scylladb/scylladb#26568

(cherry picked from commit b808d84d63)

Closes scylladb/scylladb#26624
2025-10-22 15:25:15 +02:00
Asias He
5c7eb2ac61 repair: Fix uuid and nodes_down order in the log
Fixes #26536

Closes scylladb/scylladb#26547

(cherry picked from commit 33bc1669c4)

Closes scylladb/scylladb#26630
2025-10-22 14:25:18 +02:00
Tomasz Grabiec
0621a8aee5 Merge '[Backport 2025.4] Synchronize tablet split and load-and-stream' from Scylladb[bot]
Load-and-stream is broken when running concurrently to the finalization step of tablet split.

Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets

two possible fixes (maybe both):

1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets

This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.

Fixes https://github.com/scylladb/scylladb/issues/26455.

- (cherry picked from commit 3abc66da5a)

- (cherry picked from commit 4654cdc6fd)

Parent PR: #26456

Closes scylladb/scylladb#26651

* github.com:scylladb/scylladb:
  test: Add reproducer for l-a-s and split synchronization issue
  sstables_loader: Synchronize tablet split and load-and-stream
2025-10-22 14:23:04 +02:00
Jenkins Promoter
10db3f7c85 Update ScyllaDB version to: 2025.4.0-rc3 2025-10-22 14:11:52 +03:00
Pavel Emelyanov
45341ca246 Merge '[Backport 2025.4] s3_client: handle failures which require http::request updating' from Scylladb[bot]
Apply two main changes to the s3_client error handling
1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header
2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether

Fixes: https://github.com/scylladb/scylladb/issues/26483

Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions

- (cherry picked from commit 55fb2223b6)

- (cherry picked from commit db1ca8d011)

- (cherry picked from commit 185d5cd0c6)

- (cherry picked from commit 116823a6bc)

- (cherry picked from commit 43acc0d9b9)

- (cherry picked from commit 58a1cff3db)

- (cherry picked from commit 1d34657b14)

- (cherry picked from commit 4497325cd6)

- (cherry picked from commit fdd0d66f6e)

Parent PR: #26527

Closes scylladb/scylladb#26650

* github.com:scylladb/scylladb:
  s3_client: tune logging level
  s3_client: add logging
  s3_client: improve exception handling for chunked downloads
  s3_client: fix indentation
  s3_client: add max for client level retries
  s3_client: remove `s3_retry_strategy`
  s3_client: support high-level request retries
  s3_client: just reformat `make_request`
  s3_client: unify `make_request` implementation
2025-10-22 11:33:53 +03:00
Piotr Dulikowski
1efb2eb174 view_building_worker: access tablet map through erm on sstable discovery
Currently, the data returned by `database::get_tables_metadata()` and
`database::get_token_metadata()` may not be consistent. Specifically,
the tables metadata may contain some tablet-based tables before their
tablet maps appear in the token metadata. This is going to be fixed
after issue scylladb/scylladb#24414 is closed, but for the time being
work around it by accessing the token metadata via
`table`->effective_replication_map() - that token metadata is guaranteed
to have the tablet map of the `table`.

Fixes: scylladb/scylladb#26403

Closes scylladb/scylladb#26588

(cherry picked from commit f76917956c)

Closes scylladb/scylladb#26631
2025-10-22 11:33:22 +03:00
Pavel Emelyanov
320ef84367 Merge '[Backport 2025.4] compaction/twcs: fix use after free issues' from Scylladb[bot]
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.

Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.

The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.

The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.

Fixes #25913

Issue probably started after 9d3755f276, so backport to 2025.4

- (cherry picked from commit 1cd43bce0e)

- (cherry picked from commit 35159e5b02)

- (cherry picked from commit 18c071c94b)

Parent PR: #26593

Closes scylladb/scylladb#26625

* github.com:scylladb/scylladb:
  compaction: fix use after free when strategy is altered during compaction
  compaction/twcs: pass compaction_strategy_state to internal methods
  compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction
2025-10-22 11:32:28 +03:00
Raphael S. Carvalho
92a603699e test: Add reproducer for l-a-s and split synchronization issue
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 4654cdc6fd)
2025-10-21 12:26:55 +00:00
Raphael S. Carvalho
d998d9d418 sstables_loader: Synchronize tablet split and load-and-stream
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.

Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets

two possible fixes (maybe both):

1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets

This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.

Fixes #26455.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 3abc66da5a)
2025-10-21 12:26:54 +00:00
Ernest Zaslavsky
6f6b3a26c4 s3_client: tune logging level
Change all logging related to errors in `chunked_download_source` background download fiber to `info` to make it visible right away in logs.

(cherry picked from commit fdd0d66f6e)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
4eb427976b s3_client: add logging
Add logging for the case when we encounter expired credentials, shouldnt happen but just in case

(cherry picked from commit 4497325cd6)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
94d49da8ec s3_client: improve exception handling for chunked downloads
Refactor the wrapping exception used in `chunked_download_source` to
prevent the retry strategy from reattempting failed requests. The new
implementation preserves the original `exception_ptr`, making the root
cause clearer and easier to diagnose.

(cherry picked from commit 1d34657b14)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
f9bc211966 s3_client: fix indentation
Reformat `client::make_request` to fix the indentation of `if` block

(cherry picked from commit 58a1cff3db)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
4aff338282 s3_client: add max for client level retries
To prevent client retrying indefinitely time skew and authentication errors add `max_attempts` to the `client::make_request`

(cherry picked from commit 43acc0d9b9)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
8b7dce8334 s3_client: remove s3_retry_strategy
It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request

(cherry picked from commit 116823a6bc)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
2afd323838 s3_client: support high-level request retries
Add an option to retry S3 requests at the highest level, including
reinitializing headers and reauthenticating. This addresses cases
where retrying the same request fails, such as when the S3 server
rejects a timestamp older than 15 minutes.

(cherry picked from commit 185d5cd0c6)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
f2f415a742 s3_client: just reformat make_request
Just reformat previously changed methods to improve readability

(cherry picked from commit db1ca8d011)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
5c2d8bd273 s3_client: unify make_request implementation
Refactor `make_request` to use a single core implementation that
handles authentication and issues the HTTP request. All overloads now
delegate to this unified method.

(cherry picked from commit 55fb2223b6)
2025-10-21 12:26:49 +00:00
Lakshmi Narayanan Sreethar
45b9675d28 compaction: fix use after free when strategy is altered during compaction
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.

Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.

The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.

Fixes #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 18c071c94b)
2025-10-21 00:59:33 +00:00
Lakshmi Narayanan Sreethar
f1e1c7db4c compaction/twcs: pass compaction_strategy_state to internal methods
During TWCS compaction, multiple methods independently fetch the
compaction_strategy_state using get_state(). This can lead to
inconsistencies if the compaction strategy is ALTERed while the
compaction is in progress.

This patch fixes a part of this issue by passing down the state to the
lower level methods as parameters instead of fetching it repeatedly.

Refs #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 35159e5b02)
2025-10-21 00:59:33 +00:00
Lakshmi Narayanan Sreethar
5e1f32b3d4 compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction
The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.

Refs #26546
Refs #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 1cd43bce0e)
2025-10-21 00:59:32 +00:00
Botond Dénes
99f2dd02bf Merge '[Backport 2025.4] raft topology: disable schema pulls in the Raft-based recovery procedure' from Scylladb[bot]
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.

We fix this issue and add a regression test in this PR.

Fixes #26569

This is an important bug fix, so it should be backported to all branches
with the Raft-based recovery procedure (2025.2 and newer branches).

- (cherry picked from commit ec3a35303d)

- (cherry picked from commit da8748e2b1)

- (cherry picked from commit 71de01cd41)

Parent PR: #26572

Closes scylladb/scylladb#26599

* github.com:scylladb/scylladb:
  test: test_raft_recovery_entry_loss: fix the typo in the test case name
  test: verify that schema pulls are disabled in the Raft-based recovery procedure
  raft topology: disable schema pulls in the Raft-based recovery procedure
2025-10-20 10:39:52 +03:00
Botond Dénes
76a6a059c8 Merge '[Backport 2025.4] Fix vector store client flaky test' from Scylladb[bot]
This series of patches improves test vector_store_client_test stability. The primary issue with flaky connections was discovered while working on PR #26308.

Key Changes:
- Fixes premature connection closures in the mock server:
The mock HTTP server was not consuming request payloads, causing it to close connections immediately after a response. Subsequent tests attempting to reuse these closed connections would fail intermittently, leading to flakiness. The server has been updated to handle payloads correctly.

- Removes a retry workaround:
With the underlying connection issue resolved, the retry logic in the vector_store_client_test_ann_request test is no longer needed and has been removed.

- Mocks the DNS resolver in tests:
The vector_store_client_uri_update_to_invalid test has been corrected to mock DNS lookups, preventing it from making real network requests.

- Corrects request timeout handling:
A bug has been fixed where the request timeout was not being reset between consecutive requests.

- Unifies test timeouts:
Timeouts have been standardized across the test suite for consistency.

Fixes: #26468

It is recommended to backport this series to the 2025.4 branch. Since these changes only affect test code and do not alter any production logic, the backport is safe. Addressing this test flakiness will improve the stability of the CI pipeline and prevent it from blocking unrelated patches.

- (cherry picked from commit ac5e9c34b6)

- (cherry picked from commit 2eb752e582)

- (cherry picked from commit d99a4c3bad)

- (cherry picked from commit 0de1fb8706)

- (cherry picked from commit 62deea62a4)

Parent PR: #26374

Closes scylladb/scylladb#26551

* github.com:scylladb/scylladb:
  vector_search: Unify test timeouts
  vector_search: Fix missing timeout reset
  vector_search: Refactor ANN request test
  vector_search: Fix flaky connection in tests
  vector_search: Fix flaky test by mocking DNS queries
2025-10-20 10:35:45 +03:00
Michał Chojnowski
6ff4910d96 test/cluster/test_bti_index.py: avoid a race with CQL tracing
The test uses CQL tracing to check which files were read by a query.
This is flaky if the coordinator and the replica are different shards,
because the Python driver only waits for the coordinator, and not
for replicas, to finish writing their traces.
(So it might happen that the Python driver returns a result
with only coordinator events and no replica events).

Let's just dodge the issue by using --smp=1.

Fixes scylladb/scylladb#26432

Closes scylladb/scylladb#26434

(cherry picked from commit c35b82b860)

Closes scylladb/scylladb#26492
2025-10-20 10:32:58 +03:00
Botond Dénes
d213953d0a Merge '[Backport 2025.4] tools: fix documentation links after change to source-available' from Scylladb[bot]
Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these.

Fixes: scylladb/scylladb#26320

Broken documentation link fix for the  tool help output, needs backport to all live source-available versions.

- (cherry picked from commit 5a69838d06)

- (cherry picked from commit 15a4a9936b)

- (cherry picked from commit fe73c90df9)

Parent PR: #26322

Closes scylladb/scylladb#26390

* github.com:scylladb/scylladb:
  tools/scylla-sstable: fix doc links
  release: adjust doc_link() for the post source-available world
  tools/scylla-nodetool: remove trailing " from doc urls
2025-10-20 10:31:05 +03:00
Michał Jadwiszczak
f5e76d0fcb test/cluster/test_view_building_coordinator: skip reproducer instead of xfail
The reproducer for issue scylladb/scylladb#26244 takes some time
and since the test is failing, there is no point in wasting resources on
it.
We can change the xfail mark to skip.

Refs scylladb/scylladb#26244

Closes scylladb/scylladb#26350

(cherry picked from commit d92628e3bd)

Closes scylladb/scylladb#26365
2025-10-20 10:30:34 +03:00
Aleksandra Martyniuk
2819b8b755 test: wait for cql in test_two_tablets_concurrent_repair_and_migration_repair_writer_level
In test_two_tablets_concurrent_repair_and_migration_repair_writer_level
safe_rolling_restart returns ready cql. However, get_all_tablet_replicas
uses the cql reference from manager that isn't ready. Wait for cql.

Fixes: #26328

Closes scylladb/scylladb#26349

(cherry picked from commit 0e73ce202e)

Closes scylladb/scylladb#26362
2025-10-20 10:29:56 +03:00
Avi Kivity
245d27347b dht, sstables: replace vector with chunked_vector when computing sstable shards
sstable::compute_shards_for_this_sstable() has a temporary of type
std::vector<dht::token_range> (aka dht::partition_range_vector), which
allocates a contiguous 300k when loading an sstable from disk. This
causes large allocation warnings (it doesn't really stress the allocator
since this typically happens during startup, but best to clear the warning
anyway).

Fix this by changing the container to by chunked_vector. It is passed
to dht::ring_position_range_vector_sharder, but since we're the only
user, we can change that class to accept the new type.

Fixes #24198.

Closes scylladb/scylladb#26353

(cherry picked from commit 7230a04799)

Closes scylladb/scylladb#26360
2025-10-20 10:28:59 +03:00
Patryk Jędrzejczak
323a7b8c55 test: test_raft_recovery_entry_loss: fix the typo in the test case name
(cherry picked from commit 71de01cd41)
2025-10-17 10:27:33 +00:00
Patryk Jędrzejczak
cd0bb11eef test: verify that schema pulls are disabled in the Raft-based recovery procedure
We do this at the end of `test_raft_recovery_entry_loss`. It's not worth
to add a separate regression test, as tests of the recovery procedure
are complicated and have a long running time. Also, we choose
`test_raft_recovery_entry_loss` out of all tests of the recovery
procedure because it does some schema changes.

(cherry picked from commit da8748e2b1)
2025-10-17 10:27:32 +00:00
Patryk Jędrzejczak
95d4206585 raft topology: disable schema pulls in the Raft-based recovery procedure
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.

The old gossip-based recovery procedure doesn't have this problem
because we disable schema pulls after completing the upgrade-to-group0
procedure, which is a part of the old recovery procedure.

Fixes #26569

(cherry picked from commit ec3a35303d)
2025-10-17 10:27:32 +00:00
Botond Dénes
2bc0c9c45b Update seastar submodule
* seastar 37983cd0...60e4b3b9 (1):
  > http: add "Connection: close" header to final server response.

Refs #26298
2025-10-17 10:22:45 +03:00
Artsiom Mishuta
de5a13db28 test.py: reintroducing sudo in resource_gather.py
conditionally reintroducing sudo for resource gathering
when running under docker

related: https://github.com/scylladb/scylladb/pull/26294#issuecomment-3346968097

fixes: https://github.com/scylladb/scylladb/issues/26312

Closes scylladb/scylladb#26401

(cherry picked from commit 99455833bd)

Closes scylladb/scylladb#26473
2025-10-17 09:27:13 +03:00
Pavel Emelyanov
dc3c6c3090 Update seastar submodule (iotune fixes for i7i/i4i)
* seastar c8a3515f9...37983cd04 (2):
  > iotune: fix very long warm up duration on systems with high cpu count
  > Merge '[Backport 2025.4] iotune: Add warmup period to measurements' from Robert Bindar
    iotune: Ignore measurements during warmup period
    iotune: Fix warmup calculation bug and botched rebase

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Fixes #26530

Closes scylladb/scylladb#26583
2025-10-16 20:30:37 +03:00
Jenkins Promoter
83babc20e3 Update ScyllaDB version to: 2025.4.0-rc2 2025-10-15 15:43:09 +03:00
Ernest Zaslavsky
04b9e98ef8 s3_client: track memory starvation in background filling fiber
Introduce a counter metric to monitor instances where the background
filling fiber is blocked due to insufficient memory in the S3 client.

Closes scylladb/scylladb#26466

(cherry picked from commit 413739824f)

Closes scylladb/scylladb#26555
2025-10-15 12:03:09 +02:00
Michał Chojnowski
de8c2a8196 test/boost/sstable_compressor_factory_test: fix thread-unsafe usage of Boost.Test
It turns out that Boost assertions are thread-unsafe,
(and can't be used from multiple threads concurrently).
This causes the test to fail with cryptic log corruptions sometimes.
Fix that by switching to thread-safe checks.

Fixes scylladb/scylladb#24982

Closes scylladb/scylladb#26472

(cherry picked from commit 7c6e84e2ec)

Closes scylladb/scylladb#26554
2025-10-15 12:08:54 +03:00
Jenkins Promoter
dd2e8a2105 Update pgo profiles - aarch64 2025-10-15 05:03:22 +03:00
Jenkins Promoter
90fd618967 Update pgo profiles - x86_64 2025-10-15 04:31:33 +03:00
Karol Nowacki
da8bd30a5b vector_search: Unify test timeouts
The test previously used separate timeouts for requests (5s) and the
overall test case (10s).

This change unifies both timeouts to 10 seconds.

(cherry picked from commit 62deea62a4)
2025-10-14 22:49:42 +00:00
Karol Nowacki
4e9a42f343 vector_search: Fix missing timeout reset
The `vector_store_client_test` could be flaky because the request timeout
was not consistently reset in all code paths. This could lead to a
timeout from a previous operation firing prematurely and failing the
test.

The fix ensures `abort_source_timeout` is reset before each request.
The implementation is also simplified by changing
`abort_source_timeout::reset` that combines the reset and arm
operations into a same invocation.

(cherry picked from commit 0de1fb8706)
2025-10-14 22:49:42 +00:00
Karol Nowacki
6db7481c7a vector_search: Refactor ANN request test
Refactor the `vector_store_client_test_ann_request` test to use the
`vs_mock_server` class, unifying the structure of the test cases.

This change also removes retry logic that waited for the server to be ready.
This is no longer necessary because the handler now exists for all index names
and consumes the entire request payload, preventing connection closures.

Previously, the server did not handle requests for unconfigured
indexes, which caused the connection to close. This could lead to a
race condition where the client would attempt to reuse a closed
connection.

(cherry picked from commit d99a4c3bad)
2025-10-14 22:49:42 +00:00
Karol Nowacki
62a5d4f932 vector_search: Fix flaky connection in tests
The vector store mock server was not reading the ANN request body,
which could cause it to prematurely close the connection.

This could lead to a race condition where the client attempts to reuse a
closed connection from its pool, resulting in a flaky test.

The fix is to always read the request body in the mock server.

(cherry picked from commit 2eb752e582)
2025-10-14 22:49:42 +00:00
Karol Nowacki
f5319b06ae vector_search: Fix flaky test by mocking DNS queries
The `vector_store_client_uri_update_to_invalid` test was flaky because
it performed real DNS lookups, making it dependent on the network
environment.

This commit replaces the live DNS queries with a mock to make the test
hermetic and prevent intermittent failures.

`vector_search_metrics_test` test did not call configure{vs},
as a consequence the test did real DNS queries, which made the test
flaky.

The refreshes counter increment has been moved before the call to the resolver.
In tests, the resolver is mocked leading to lack of increments in production code.
Without this change, there is no way to test DNS counter increments.

The change also simplifies the test making it more readable.

(cherry picked from commit ac5e9c34b6)
2025-10-14 22:49:42 +00:00
Piotr Wieczorek
c191c31682 alternator: Correct RCU undercount in BatchGetItem
The `describe_multi_item` function treated the last reference-captured
argument as the number of used RCU half units. The caller
`batch_get_item`, however, expected this parameter to hold an item size.
This RCU value was then passed to
`rcu_consumed_capacity_counter::get_half_units`, treating the
already-calculated RCU integer as if it were a size in bytes.

This caused a second conversion that undercounted the true RCU. During
conversion, the number of bytes is divided by `RCU_BLOCK_SIZE_LENGTH`
(=4KB), so the double conversion divided the number of bytes by 16 MB.

The fix removes the second conversion in `describe_multi_item` and
changes the API of `describe_multi_item`.

Fixes: https://github.com/scylladb/scylladb/pull/25847

Closes scylladb/scylladb#25842

(cherry picked from commit a55c5e9ec7)

Closes scylladb/scylladb#26539
2025-10-14 11:53:09 +03:00
Dawid Mędrek
a4fd7019e3 replica/database: Fix description of validate_tablet_views_indexes
The current description is not accurate: the function doesn't throw
an exception if there's an invalid materialized view. Instead, it
simply logs the keyspaces that violate the requirement.

Furthermore, the experimental feature `views-with-tablets` is no longer
necessary for considering a materialized view as valid. It was dropped
in scylladb/scylladb@b409e85c20. The
replacement for it is the cluster feature `VIEWS_WITH_TABLETS`.

Fixes scylladb/scylladb#26420

Closes scylladb/scylladb#26421

(cherry picked from commit a9577e4d52)

Closes scylladb/scylladb#26476
2025-10-14 11:52:34 +03:00
Pavel Emelyanov
e18072d4b8 Merge '[Backport 2025.4] service/qos: set long timeout for auth queries on SL cache update' from Scylladb[bot]
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes https://github.com/scylladb/scylladb/issues/25290

backport possible to improve stability

- (cherry picked from commit a1161c156f)

- (cherry picked from commit 3c3dd4cf9d)

- (cherry picked from commit ad1a5b7e42)

Parent PR: #26180

Closes scylladb/scylladb#26479

* github.com:scylladb/scylladb:
  service/qos: set long timeout for auth queries on SL cache update
  auth: add query_state parameter to query functions
  auth: refactor query_all_directly_granted
2025-10-13 15:26:21 +03:00
Robert Bindar
7353aa5aa5 Make scylla_io_setup detect request size for best write IOPS
We noticed during work on scylladb/seastar#2802 that on i7i family
(later proved that it's valid for i4i family as well),
the disks are reporting the physical sector sizes incorrectly
as 512bytes, whilst we proved we can render much better write IOPS with
4096bytes.

This is not the case on AWS i3en family where the reported 512bytes
physical sector size is also the size we can achieve the best write IOPS.

This patch works around this issue by changing `scylla_io_setup` to parse
the instance type out of `/sys/devices/virtual/dmi/id/product_name`
and run iotune with the correct request size based on the instance type.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#25315

(cherry picked from commit 2c74a6981b)

Closes scylladb/scylladb#26474
2025-10-13 15:25:16 +03:00
Michał Chojnowski
ec0b31b193 docs: fix a parameter name in API calls in sstable-dictionary-compression.rst
The correct argument name is `cf`, not `table`.

Fixes scylladb/scylladb#25275

Closes scylladb/scylladb#26447

(cherry picked from commit 87e3027c81)

Closes scylladb/scylladb#26495
2025-10-12 21:10:25 +03:00
Patryk Jędrzejczak
b5c3e2465f test: test_raft_no_quorum: test_can_restart: deflake the read barrier call
Expecting the group 0 read barrier to succeed with a timeout of 1s, just
after restarting 3 out of 5 voters, turned out to be flaky. In some
unlikely scenarios, such as multiple vote splits, the Raft leader
election could finish after the read barrier times out.

To deflake the test, we increase the timeout of Raft operations back to
300s for read barriers we expect to succeed.

Fixes #26457

Closes scylladb/scylladb#26489

(cherry picked from commit 5f68b9dc6b)

Closes scylladb/scylladb#26522
2025-10-12 21:02:02 +03:00
Asias He
3cae4a21ab repair: Rename incremental mode name
Using the name regular as the incremental mode could be confusing, since
regular might be interpreted as the non-incremental repair. It is better
to use incremental directly.

Before:

- regular (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)

After:

- incremental (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)

Fixes #26503

Closes scylladb/scylladb#26504

(cherry picked from commit 13dd88b010)

Closes scylladb/scylladb#26521
2025-10-12 21:01:05 +03:00
Ernest Zaslavsky
5c6335e029 s3_client: fix when condition to prevent infinite locking
Refine condition variable predicate in filling fiber to avoid
indefinite waiting when `close` is invoked.

Closes scylladb/scylladb#26449

(cherry picked from commit c2bab430d7)

Closes scylladb/scylladb#26497
2025-10-12 16:19:48 +03:00
Avi Kivity
de4975d181 dist: scylla_raid_setup: don't override XFS block size on modern kernels
In 6977064693 ("dist: scylla_raid_setup:
reduce xfs block size to 1k"), we reduced the XFS block size to 1k when
possible. This is because commitlog wants to write the smallest amount
of padding it can, and older Linux could only write a multiple of the
block size. Modern Linux [1] can O_DIRECT overwrite a range smaller than
a filesystem block.

However, this doesn't play well with some SSDs that have 512 byte
logical sector size and 4096 byte physical sector size - it causes them
to issue read-modify-writes.

To improve the situation, if we detect that the kernel is recent enough,
format the filesystem with its default block size, which should be optimal.

Note that commitlog will still issue sub-4k writes, which can translate
to RMW. There, we believe that the amplification is reduced since
sequential sub-physical-sector writes can be merged, and that the overhead
from commitlog space amplification is worse than the RMW overhead.

Tested on AWS i4i.large. fsqual report:

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   4096
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0.0003 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.7961 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8006 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

The sub-block overwrite cases are GOOD.

In comparison, the fsqual report for 1k (similar):

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   1024
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0.0005 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.7948 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0015 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0022 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.4999 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.798 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0012 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0019 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.5 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

Fixes #25441.

[1] ed1128c2d0

Closes scylladb/scylladb#25445

(cherry picked from commit 5d1846d783)

Closes scylladb/scylladb#26471
2025-10-12 16:17:16 +03:00
Piotr Dulikowski
1f73e18eaf Merge '[Backport 2025.4] db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Scylladb[bot]
Materialized views are currently in the experimental phase and using them
in tablet-based keyspaces requires starting Scylla with an experimental feature,
`views-with-tablets`. Any attempts to create a materialized view or secondary
index when it's not enabled will fail with an appropriate error.

After considerable effort, we're drawing close to bringing views out of the
experimental phase, and the experimental feature will no longer be needed.
However, materialized views in tablet-based keyspaces will still be restricted,
and creating them will only be possible after enabling the configuration option
`rf_rack_valid_keyspaces`. That's what we do in this PR.

In this patch, we adjust existing tests in the tree to work with the new
restriction. That shouldn't have been necessary because we've already seemingly
adjusted all of them to work with the configuration option, but some tests hid
well. We fix that mistake now.

After that, we introduce the new restriction. What's more, when starting Scylla,
we verify that there is no materialized view that would violate the contract.
If there are some that do, we list them, notify the user, and refuse to start.

High-level implementation strategy:

1. Name the restrictions in form of a function.
2. Adjust existing tests.
3. Restrict materialized views by both the experimental feature
   and the configuration option. Add validation test.
4. Drop the requirement for the experimental feature. Adjust the added test
   and add a new one.
5. Update the user documentation.

Fixes scylladb/scylladb#23030

Backport: 2025.4, as we are aiming to support materialized views for tablets from that version.

- (cherry picked from commit a1254fb6f3)

- (cherry picked from commit d6fcd18540)

- (cherry picked from commit 994f09530f)

- (cherry picked from commit 6322b5996d)

- (cherry picked from commit 71606ffdda)

- (cherry picked from commit 00222070cd)

- (cherry picked from commit 288be6c82d)

- (cherry picked from commit b409e85c20)

Parent PR: #25802

Closes scylladb/scylladb#26416

* github.com:scylladb/scylladb:
  view: Stop requiring experimental feature
  db/view: Verify valid configuration for tablet-based views
  db/view: Require rf_rack_valid_keyspaces when creating view
  test/cluster/random_failures: Skip creating secondary indexes
  test/cluster/mv: Mark test_mv_rf_change as skipped
  test/cluster: Adjust MV tests to RF-rack-validity
  test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
  db/view: Name requirement for views with tablets
2025-10-12 08:20:20 +02:00
Michał Jadwiszczak
931f9ca3db db/view/view_building_worker: update state again if some batch was finished during the update
There was a race between loop in `view_building_worker::run_view_building_state_observer()`
and a moment when a batch was finishing its work (`.finally()` callback
in `view_building_worker::batch::start()`).

State observer waits on `_vb_state_machine.event` CV and when it's
awoken, it takes group0 read apply mutex and updates its state. While
updating the state, the observer looks at `batch::state` field and
reacts to it accordingly.
On the other hand, when a batch finishes its work, it sets `state` field
to `batch_state::finished` and does a broadcast on
`_vb_state_machine.event` CV.
So if the batch will execute the callback in `.finally()` while the
observer is updating its state, the observer may miss the event on the
CV and it will never notice that the batch was finished.

This patch fixes this by adding a `some_batch_finished` flag. Even if
the worker won't see an event on the CV, it will notice that the flag
was set and it will do next iteration.

Fixes scylladb/scylladb#26204

Closes scylladb/scylladb#26289

(cherry picked from commit 8d0d53016c)

Closes scylladb/scylladb#26500
2025-10-10 09:53:22 +02:00
Piotr Dulikowski
3775e8e49a Merge '[Backport 2025.4] db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground' from Scylladb[bot]
db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.

This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like https://github.com/scylladb/scylladb/issues/26403).

Fixes https://github.com/scylladb/scylladb/issues/26417

The patch should be backported to 2025.4

- (cherry picked from commit 575dce765e)

- (cherry picked from commit 84e4e34d81)

Parent PR: #26446

Closes scylladb/scylladb#26501

* github.com:scylladb/scylladb:
  db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
  db/view/view_building_worker: futurize and rename `start_background_fibers()`
2025-10-10 09:52:45 +02:00
Michał Jadwiszczak
f4d9513e0f db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.

This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like scylladb/scylladb#26403).

Fixes scylladb/scylladb#26417

(cherry picked from commit 84e4e34d81)
2025-10-09 22:39:33 +00:00
Michał Jadwiszczak
5eeb1e3e76 db/view/view_building_worker: futurize and rename start_background_fibers()
Next commit will move `discover_existing_staging_sstables()`
to the foreground, so to prepare for this we need to futurize
`start_background_fibers()` method and change its name to better reflect
its purpose.

(cherry picked from commit 575dce765e)
2025-10-09 22:39:32 +00:00
Patryk Jędrzejczak
989aa0b237 raft topology: make the voter handler consider only group 0 members
In the Raft-based recovery procedure, we create a new group 0 and add
live nodes to it one by one. This means that for some time there are
nodes which belong to the topology, but not to the new group 0. The
voter handler running on the recovery leader incorrectly considers these
nodes while choosing voters.

The consequences:
- misleading logs, for example, "making servers {<ID of a non-member>}
  voters", where the non-member won't become a voter anyway,
- increased chance of majority loss during the recovery procedure, for
  example, all 3 nodes that first joined the new group 0 are in the same
  dc and rack, but only one of them becomes a voter because the voter
  handler tries to make non-members in other dcs/racks voters.

Fixes #26321

Closes scylladb/scylladb#26327

(cherry picked from commit 67d48a459f)

Closes scylladb/scylladb#26428
2025-10-09 18:17:49 +02:00
Michael Litvak
eba0a2cf72 service/qos: set long timeout for auth queries on SL cache update
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes scylladb/scylladb#25290

(cherry picked from commit ad1a5b7e42)
2025-10-09 12:48:45 +00:00
Michael Litvak
3a9eb9b65f auth: add query_state parameter to query functions
add a query_state parameter to several auth functions that execute
internal queries. currently the queries use the
internal_distributed_query_state() query state, and we maintain this as
default, but we want also to be able to pass a query state from the
caller.

in particular, the auth queries currently use a timeout of 5 seconds,
and we will want to set a different timeout when executed in some
different context.

(cherry picked from commit 3c3dd4cf9d)
2025-10-09 12:48:45 +00:00
Michael Litvak
f75541b7b3 auth: refactor query_all_directly_granted
rewrite query_all_directly_granted to use execute_internal instead of
query_internal in a style that is more consistent with the rest of the
module.

This will also be useful for a later change because execute_internal
accepts an additional parameter of query_state.

(cherry picked from commit a1161c156f)
2025-10-09 12:48:45 +00:00
Michał Chojnowski
879db5855d utils/config_file: fix a missing allowed_values propagation in one of named_value constructors
In one of the constructors of `named_value`, the `allowed_values`
argument isn't used.

(This means that if some config entry uses this constructor,
the values aren't validated on the config layer,
and might give some lower layer a bad surprise).

Fix that.

Fixes scyllladb/scylladb#26371

Closes scylladb/scylladb#26196

(cherry picked from commit 3b338e36c2)

Closes scylladb/scylladb#26425
2025-10-09 13:19:41 +03:00
Michał Chojnowski
22d3ee5670 sstables/trie: actually apply BYPASS CACHE to index reads
BYPASS CACHE is implemented for `bti_index_reader` by
giving it its own private `cached_file` wrappers over
Partitions.db and Rows.db, instead of passing it
the shared `cached_file` owned by the sstable.

But due to an oversight, the private `cached_file`s aren't
constructed on top of the raw Partitions.db and Rows.db
files, but on top of `cached_file_impl` wrappers around
those files. Which means that BYPASS CACHE doesn't
actually do its job.

Tests based on `scylla_index_page_cache_*` metrics
and on CQL tracing still see the reads from the private
files as "cache misses", but those misses are served
from the shared cached files anyway, so the tests don't see
the problem. In this commit we extend `test_bti_index.py`
with a check that looks at reactor's `io_queue` metrics
instead, and catches the problem.

Fixes scylladb/scylladb#26372

Closes scylladb/scylladb#26373

(cherry picked from commit dbddba0794)

Closes scylladb/scylladb#26424
2025-10-09 13:17:29 +03:00
Dawid Mędrek
2bdf792f8e view: Stop requiring experimental feature
We modify the requirements for using materialized views in tablet-based
keyspaces. Before, it was necessary to enable the configuration option
`rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS`
enabled, and using the experimental feature `views-with-tablets`.
We drop the last requirement.

We adjust code to that change and provide a new validation test.
We also update the user documentation to reflect the changes.

Fixes scylladb/scylladb#23030

(cherry picked from commit b409e85c20)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
2e2d1f17bb db/view: Verify valid configuration for tablet-based views
Creating a materialized view or a secondary index in a tablet-based
keyspace requires that the user enabled two options:

* experimental feature `views-with-tablets`,
* configuration option `rf_rack_vaid_keyspaces`.

Because the latter has only become a necessity recently (in this series),
it's possible that there are already existing materialized views that
violate it.

We add a new check at start-up that iterates over existing views and
makes sure that that is not the case. Otherwise, Scylla notifies the user
of the problem.

(cherry picked from commit 288be6c82d)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
e9aba62cc5 db/view: Require rf_rack_valid_keyspaces when creating view
We extend the requirements for being able to create materialized views
and secondary indexes in tablet-based keyspaces. It's now necessary to
enable the configuration option `rf_rack_valid_keyspaces`. This is
a stepping stone towards bringing materialized views and secondary
indexes with tablets out of the experimental phase.

We add a validation test to verify the changes.

Refs scylladb/scylladb#23030

(cherry picked from commit 00222070cd)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
a7d0cf6dd0 test/cluster/random_failures: Skip creating secondary indexes
Materialized views are going to require the configuration option
`rf_rack_valid_keyspaces` when being created in tablet-based keyspaces.
Since random-failure tests still haven't been adjusted to work with it,
and because it's not trivial, we skip the cases when we end up creating
or dropping an index.

(cherry picked from commit 71606ffdda)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
6e94c075e3 test/cluster/mv: Mark test_mv_rf_change as skipped
The test will not work with `rf_rack_valid_keyspaces`. Since the option
is going to become a requirement for using views with tablets, the test
will need to be rewritten to take that into consideration. Since that
adjustment doesn't seem trivial, we mark the test as skipped for the
time being.

(cherry picked from commit 6322b5996d)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
f90ca413a0 test/cluster: Adjust MV tests to RF-rack-validity
Some of the new tests covering materialized views explicitly disabled
the configuration option `rf_rack_valid_keyspaces`. It's going to become
a new requirement for views with tablets, so we adjust those tests and
enable the option. There is one exception, the test:

`cluster/mv/test_mv_topology_change.py::test_mv_rf_change`

We handle it separately in the following commit.

(cherry picked from commit 994f09530f)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
5e0f5f4b44 test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
The test cases in the file aren't run via an existing interface like
`do_with_cql_env`, but they rely on a more direct approach -- calling
one of the schema loader tools. Because of that, they manage the
`db::config` object on their own and don't enable the configuration
option `rf_rack_valid_keyspaces`.

That hasn't been a problem so far since the test doesn't attempt to
create RF-rack-invalid keyspaces anyway. However, in an upcoming commit,
we're going to further restrict views with tablets and require that the
option is enabled.

To prepare for that, we enable the option in all test cases. It's only
necessary in a small subset of them, but it won't hurt the enforce it
everywhere, so let's do that.

Refs scylladb/scylladb#23958

(cherry picked from commit d6fcd18540)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
5d32fef3ae db/view: Name requirement for views with tablets
We add a named requirement, a function, for materialized views with tablets.
It decides whether we can create views and secondary indexes in a given
keyspace. It's a stepping stone towards modifying the requirements for it.

This way, we keep the code in one place, so it's not possible to forget
to modify it somewhere. It also makes it more organized and concise.

(cherry picked from commit a1254fb6f3)
2025-10-06 13:19:53 +00:00
Botond Dénes
1b5c46a796 Merge '[Backport 2025.4] test: dtest: test_limits.py: migrate from dtest' from Dario Mirovic
Backport motivation:
scylla-dtest PR [limits_test.py: remove tests already ported to scylladb repo](https://github.com/scylladb/scylla-dtest/pull/6232) that removes migrated tests got merged before branch-2025.4 separation
scylladb PR [test: dtest: test_limits.py: migrate from dtest](https://github.com/scylladb/scylladb/pull/26077) got merged after branch-2025.4 separation
This caused the tests to be fully removed from branch-2025.4. This backport PR makes sure the tests are present in scylladb branch-2025.4.

This PR migrates limits tests from dtest to this repository.

One reason is that there is an ongoing effort to migrate tests from dtest to here.

Debug logs are enabled on `test_max_cells` for `lsa-timing` logger, to have more information about memory reclaim operation times and memory chunk sizes. This will allow analysis of their value distributions, which can be helpful with debugging if the issue reoccurs.

Also, scylladb keeps sql files with metrics which, with some modifications, can be used to track metrics over time for some tests. This would show if there are pauses and spikes or the test performance is more or less consistent over time.

scylla-dtest PR that removes migrated tests:
[limits_test.py: remove tests already ported to scylladb repo #6232](https://github.com/scylladb/scylla-dtest/pull/6232)

Fixes #25097

- (cherry picked from commit 82e9623911)
- (cherry picked from commit 70128fd5c7)
- (cherry picked from commit 554fd5e801)
- (cherry picked from commit b3347bcf84)

Parent PR: #26077

Closes scylladb/scylladb#26359

* github.com:scylladb/scylladb:
  test: dtest: limits_test.py: test_max_cells log level
  test: dtest: limits_test.py: make the tests work
  test: dtest: test_limits.py: remove test that are not being migrated
  test: dtest: copy unmodified limits_test.py
2025-10-06 15:46:49 +03:00
Jenkins Promoter
f2c5874fa9 Update ScyllaDB version to: 2025.4.0-rc1 2025-10-03 21:26:12 +03:00
Botond Dénes
4049dae0b2 tools/scylla-sstable: fix doc links
The doc links in scylla-sstable help output are static, so they always
point to the documentation of the latest stable release, not to the
documentation of the release the tool binary is from. On top of that,
the links point to old open-source documentation, which is now EOL.
Fix both problems: point link at the new source-available documentation
pages and make them version aware.

(cherry picked from commit fe73c90df9)
2025-10-03 14:29:19 +00:00
Botond Dénes
8b83294c0f release: adjust doc_link() for the post source-available world
There is no more separate enterprise product and the doc urls are
slightly different.

(cherry picked from commit 15a4a9936b)
2025-10-03 14:29:19 +00:00
Botond Dénes
5930726b38 tools/scylla-nodetool: remove trailing " from doc urls
They are accidental leftover from a previous way of storing command
descriptions.

(cherry picked from commit 5a69838d06)
2025-10-03 14:29:19 +00:00
Dario Mirovic
664cdd3d99 test: dtest: limits_test.py: test_max_cells log level
Set `lsa-timing` logger log level to `debug`. This will help with
the analysis of the whole spectrum of memory reclaim operation
times and memory sizes.

Refs #25097

(cherry picked from commit b3347bcf84)
2025-10-01 22:40:34 +02:00
Dario Mirovic
4ea6c51fb1 test: dtest: limits_test.py: make the tests work
Remove unused imports and markers.
Remove Apache license header.

Enable the test in suite.yaml for `dev` and `debug` modes.

Refs #25097

(cherry picked from commit 554fd5e801)
2025-10-01 22:40:29 +02:00
Dario Mirovic
eb9babfd4a test: dtest: test_limits.py: remove test that are not being migrated
Refs #25097

(cherry picked from commit 70128fd5c7)
2025-10-01 22:40:24 +02:00
Dario Mirovic
558f460517 test: dtest: copy unmodified limits_test.py
Copy limits_test.py from scylla-dtest to test/cluster/dtest/limits_test.py.
Add license header.

Disable it for `debug`, `dev`, and `release` mode.

Refs #25097

(cherry picked from commit 82e9623911)
2025-10-01 22:40:16 +02:00
Jenkins Promoter
a9f4024c1b Update pgo profiles - aarch64 2025-10-01 04:42:23 +03:00
Jenkins Promoter
6969918d31 Update pgo profiles - x86_64 2025-10-01 04:20:49 +03:00
Luis Freitas
d69edfcd34 Update ScyllaDB version to: 2025.4.0-rc0 2025-09-30 18:51:59 +03:00
99 changed files with 1549 additions and 614 deletions

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2025.4.0-dev
VERSION=2025.4.0-rc3
if test -f version
then

View File

@@ -3636,16 +3636,16 @@ future<std::vector<rjson::value>> executor::describe_multi_item(schema_ptr schem
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units) {
noncopyable_function<void(uint64_t)> item_callback) {
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();
std::vector<rjson::value> ret;
for (auto& result_row : result_set->rows()) {
rjson::value item = rjson::empty_object();
rcu_consumed_capacity_counter consumed_capacity;
describe_single_item(*selection, result_row, *attrs_to_get, item, &consumed_capacity._total_bytes);
rcu_half_units += consumed_capacity.get_half_units();
uint64_t item_length_in_bytes = 0;
describe_single_item(*selection, result_row, *attrs_to_get, item, &item_length_in_bytes);
item_callback(item_length_in_bytes);
ret.push_back(std::move(item));
co_await coroutine::maybe_yield();
}
@@ -4584,7 +4584,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
}
};
std::vector<table_requests> requests;
std::vector<std::vector<uint64_t>> responses_sizes;
uint batch_size = 0;
for (auto it = request_items.MemberBegin(); it != request_items.MemberEnd(); ++it) {
table_requests rs(get_table_from_batch_request(_proxy, it));
@@ -4612,11 +4611,10 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
// If we got here, all "requests" are valid, so let's start the
// requests for the different partitions all in parallel.
std::vector<future<std::vector<rjson::value>>> response_futures;
responses_sizes.resize(requests.size());
size_t responses_sizes_pos = 0;
for (const auto& rs : requests) {
responses_sizes[responses_sizes_pos].resize(rs.requests.size());
size_t pos = 0;
std::vector<uint64_t> consumed_rcu_half_units_per_table(requests.size());
for (size_t i = 0; i < requests.size(); i++) {
const table_requests& rs = requests[i];
bool is_quorum = rs.cl == db::consistency_level::LOCAL_QUORUM;
lw_shared_ptr<stats> per_table_stats = get_stats_from_schema(_proxy, *rs.schema);
per_table_stats->api_operations.batch_get_item_histogram.add(rs.requests.size());
for (const auto &r : rs.requests) {
@@ -4639,16 +4637,17 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
auto command = ::make_lw_shared<query::read_command>(rs.schema->id(), rs.schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
query::tombstone_limit(_proxy.get_tombstone_limit()));
command->allow_limit = db::allow_per_partition_rate_limit::yes;
const auto item_callback = [is_quorum, &rcus_per_table = consumed_rcu_half_units_per_table[i]](uint64_t size) {
rcus_per_table += rcu_consumed_capacity_counter::get_half_units(size, is_quorum);
};
future<std::vector<rjson::value>> f = _proxy.query(rs.schema, std::move(command), std::move(partition_ranges), rs.cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get, &response_size = responses_sizes[responses_sizes_pos][pos]] (service::storage_proxy::coordinator_query_result qr) mutable {
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get, item_callback = std::move(item_callback)] (service::storage_proxy::coordinator_query_result qr) mutable {
utils::get_local_injector().inject("alternator_batch_get_item", [] { throw std::runtime_error("batch_get_item injection"); });
return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get), response_size);
return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get), std::move(item_callback));
});
pos++;
response_futures.push_back(std::move(f));
}
responses_sizes_pos++;
}
// Wait for all requests to complete, and then return the response.
@@ -4660,14 +4659,11 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rjson::value response = rjson::empty_object();
rjson::add(response, "Responses", rjson::empty_object());
rjson::add(response, "UnprocessedKeys", rjson::empty_object());
size_t rcu_half_units;
auto fut_it = response_futures.begin();
responses_sizes_pos = 0;
rjson::value consumed_capacity = rjson::empty_array();
for (const auto& rs : requests) {
for (size_t i = 0; i < requests.size(); i++) {
const table_requests& rs = requests[i];
std::string table = table_name(*rs.schema);
size_t pos = 0;
rcu_half_units = 0;
for (const auto &r : rs.requests) {
auto& pk = r.first;
auto& cks = r.second;
@@ -4682,7 +4678,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
for (rjson::value& json : results) {
rjson::push_back(response["Responses"][table], std::move(json));
}
rcu_half_units += rcu_consumed_capacity_counter::get_half_units(responses_sizes[responses_sizes_pos][pos], rs.cl == db::consistency_level::LOCAL_QUORUM);
} catch(...) {
eptr = std::current_exception();
// This read of potentially several rows in one partition,
@@ -4706,8 +4701,8 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rjson::push_back(response["UnprocessedKeys"][table]["Keys"], std::move(*ck.second));
}
}
pos++;
}
uint64_t rcu_half_units = consumed_rcu_half_units_per_table[i];
_stats.rcu_half_units_total += rcu_half_units;
lw_shared_ptr<stats> per_table_stats = get_stats_from_schema(_proxy, *rs.schema);
per_table_stats->rcu_half_units_total += rcu_half_units;
@@ -4717,7 +4712,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rjson::add(entry, "CapacityUnits", rcu_half_units*0.5);
rjson::push_back(consumed_capacity, std::move(entry));
}
responses_sizes_pos++;
}
if (should_add_rcu) {

View File

@@ -228,12 +228,15 @@ public:
const std::optional<attrs_to_get>&,
uint64_t* = nullptr);
// Converts a multi-row selection result to JSON compatible with DynamoDB.
// For each row, this method calls item_callback, which takes the size of
// the item as the parameter.
static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units);
noncopyable_function<void(uint64_t)> item_callback = {});
static void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,

View File

@@ -2924,7 +2924,7 @@
},
{
"name":"incremental_mode",
"description":"Set the incremental repair mode. Can be 'disabled', 'regular', or 'full'. 'regular': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to regular.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to regular.",
"required":false,
"allowMultiple":false,
"type":"string",

View File

@@ -233,9 +233,9 @@ future<role_set> ldap_role_manager::query_granted(std::string_view grantee_name,
}
future<role_to_directly_granted_map>
ldap_role_manager::query_all_directly_granted() {
ldap_role_manager::query_all_directly_granted(::service::query_state& qs) {
role_to_directly_granted_map result;
auto roles = co_await query_all();
auto roles = co_await query_all(qs);
for (auto& role: roles) {
auto granted_set = co_await query_granted(role, recursive_role_query::no);
for (auto& granted: granted_set) {
@@ -247,8 +247,8 @@ ldap_role_manager::query_all_directly_granted() {
co_return result;
}
future<role_set> ldap_role_manager::query_all() {
return _std_mgr.query_all();
future<role_set> ldap_role_manager::query_all(::service::query_state& qs) {
return _std_mgr.query_all(qs);
}
future<> ldap_role_manager::create_role(std::string_view role_name) {
@@ -311,12 +311,12 @@ future<bool> ldap_role_manager::can_login(std::string_view role_name) {
}
future<std::optional<sstring>> ldap_role_manager::get_attribute(
std::string_view role_name, std::string_view attribute_name) {
return _std_mgr.get_attribute(role_name, attribute_name);
std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.get_attribute(role_name, attribute_name, qs);
}
future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name) {
return _std_mgr.query_attribute_for_all(attribute_name);
future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.query_attribute_for_all(attribute_name, qs);
}
future<> ldap_role_manager::set_attribute(

View File

@@ -75,9 +75,9 @@ class ldap_role_manager : public role_manager {
future<role_set> query_granted(std::string_view, recursive_role_query) override;
future<role_to_directly_granted_map> query_all_directly_granted() override;
future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
future<role_set> query_all() override;
future<role_set> query_all(::service::query_state&) override;
future<bool> exists(std::string_view) override;
@@ -85,9 +85,9 @@ class ldap_role_manager : public role_manager {
future<bool> can_login(std::string_view) override;
future<std::optional<sstring>> get_attribute(std::string_view, std::string_view) override;
future<std::optional<sstring>> get_attribute(std::string_view, std::string_view, ::service::query_state&) override;
future<role_manager::attribute_vals> query_attribute_for_all(std::string_view) override;
future<role_manager::attribute_vals> query_attribute_for_all(std::string_view, ::service::query_state&) override;
future<> set_attribute(std::string_view, std::string_view, std::string_view, ::service::group0_batch& mc) override;

View File

@@ -78,11 +78,11 @@ future<role_set> maintenance_socket_role_manager::query_granted(std::string_view
return operation_not_supported_exception<role_set>("QUERY GRANTED");
}
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted() {
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted(::service::query_state&) {
return operation_not_supported_exception<role_to_directly_granted_map>("QUERY ALL DIRECTLY GRANTED");
}
future<role_set> maintenance_socket_role_manager::query_all() {
future<role_set> maintenance_socket_role_manager::query_all(::service::query_state&) {
return operation_not_supported_exception<role_set>("QUERY ALL");
}
@@ -98,11 +98,11 @@ future<bool> maintenance_socket_role_manager::can_login(std::string_view role_na
return make_ready_future<bool>(true);
}
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<std::optional<sstring>>("GET ATTRIBUTE");
}
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name) {
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<role_manager::attribute_vals>("QUERY ATTRIBUTE");
}

View File

@@ -53,9 +53,9 @@ public:
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
virtual future<role_set> query_all() override;
virtual future<role_set> query_all(::service::query_state&) override;
virtual future<bool> exists(std::string_view role_name) override;
@@ -63,9 +63,9 @@ public:
virtual future<bool> can_login(std::string_view role_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;
virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

View File

@@ -17,12 +17,17 @@
#include <seastar/core/format.hh>
#include <seastar/core/sstring.hh>
#include "auth/common.hh"
#include "auth/resource.hh"
#include "cql3/description.hh"
#include "seastarx.hh"
#include "exceptions/exceptions.hh"
#include "service/raft/raft_group0_client.hh"
namespace service {
class query_state;
};
namespace auth {
struct role_config final {
@@ -167,9 +172,9 @@ public:
/// (role2, role3)
/// }
///
virtual future<role_to_directly_granted_map> query_all_directly_granted() = 0;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state& = internal_distributed_query_state()) = 0;
virtual future<role_set> query_all() = 0;
virtual future<role_set> query_all(::service::query_state& = internal_distributed_query_state()) = 0;
virtual future<bool> exists(std::string_view role_name) = 0;
@@ -186,12 +191,12 @@ public:
///
/// \returns the value of the named attribute, if one is set.
///
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) = 0;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;
///
/// \returns a mapping of each role's value for the named attribute, if one is set for the role.
///
virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name) = 0;
virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;
/// Sets `attribute_name` with `attribute_value` for `role_name`.
/// \returns an exceptional future with nonexistant_role if the role does not exist.

View File

@@ -663,21 +663,30 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
});
}
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted() {
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted(::service::query_state& qs) {
const sstring query = seastar::format("SELECT * FROM {}.{}",
get_auth_ks_name(_qp),
meta::role_members_table::name);
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::ONE,
qs,
cql3::query_processor::cache_internal::yes);
role_to_directly_granted_map roles_map;
co_await _qp.query_internal(query, [&roles_map] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
roles_map.insert({row.get_as<sstring>("member"), row.get_as<sstring>("role")});
co_return stop_iteration::no;
});
std::transform(
results->begin(),
results->end(),
std::inserter(roles_map, roles_map.begin()),
[] (const cql3::untyped_result_set_row& row) {
return std::make_pair(row.get_as<sstring>("member"), row.get_as<sstring>("role")); }
);
co_return roles_map;
}
future<role_set> standard_role_manager::query_all() {
future<role_set> standard_role_manager::query_all(::service::query_state& qs) {
const sstring query = seastar::format("SELECT {} FROM {}.{}",
meta::roles_table::role_col_name,
get_auth_ks_name(_qp),
@@ -695,7 +704,7 @@ future<role_set> standard_role_manager::query_all() {
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
qs,
cql3::query_processor::cache_internal::yes);
role_set roles;
@@ -727,11 +736,11 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {
});
}
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
const sstring query = seastar::format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",
get_auth_ks_name(_qp),
meta::role_attributes_table::name);
const auto result_set = co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
const auto result_set = co_await _qp.execute_internal(query, db::consistency_level::ONE, qs, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
if (!result_set->empty()) {
const cql3::untyped_result_set_row &row = result_set->one();
co_return std::optional<sstring>(row.get_as<sstring>("value"));
@@ -739,11 +748,11 @@ future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_
co_return std::optional<sstring>{};
}
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name) {
return query_all().then([this, attribute_name] (role_set roles) {
return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles)] (attribute_vals &role_to_att_val) {
return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name] (sstring role) {
return get_attribute(role, attribute_name).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name, ::service::query_state& qs) {
return query_all(qs).then([this, attribute_name, &qs] (role_set roles) {
return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles), &qs] (attribute_vals &role_to_att_val) {
return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name, &qs] (sstring role) {
return get_attribute(role, attribute_name, qs).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
if (att_val) {
role_to_att_val.emplace(std::move(role), std::move(*att_val));
}
@@ -788,7 +797,7 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std
future<std::vector<cql3::description>> standard_role_manager::describe_role_grants() {
std::vector<cql3::description> result{};
const auto grants = co_await query_all_directly_granted();
const auto grants = co_await query_all_directly_granted(internal_distributed_query_state());
result.reserve(grants.size());
for (const auto& [grantee_role, granted_role] : grants) {

View File

@@ -66,9 +66,9 @@ public:
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
virtual future<role_set> query_all() override;
virtual future<role_set> query_all(::service::query_state&) override;
virtual future<bool> exists(std::string_view role_name) override;
@@ -76,9 +76,9 @@ public:
virtual future<bool> can_login(std::string_view role_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;
virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

View File

@@ -1506,7 +1506,7 @@ future<> compaction_manager::maybe_wait_for_sstable_count_reduction(compaction_g
co_return;
}
auto num_runs_for_compaction = [&, this] -> future<size_t> {
auto& cs = t.get_compaction_strategy();
auto cs = t.get_compaction_strategy();
auto desc = co_await cs.get_sstables_for_compaction(t, get_strategy_control());
co_return std::ranges::size(desc.sstables
| std::views::transform(std::mem_fn(&sstables::sstable::run_identifier))

View File

@@ -804,9 +804,9 @@ compaction_strategy_state compaction_strategy_state::make(const compaction_strat
case compaction_strategy_type::incremental:
return compaction_strategy_state(default_empty_state{});
case compaction_strategy_type::leveled:
return compaction_strategy_state(leveled_compaction_strategy_state{});
return compaction_strategy_state(seastar::make_shared<leveled_compaction_strategy_state>());
case compaction_strategy_type::time_window:
return compaction_strategy_state(time_window_compaction_strategy_state{});
return compaction_strategy_state(seastar::make_shared<time_window_compaction_strategy_state>());
default:
throw std::runtime_error("strategy not supported");
}

View File

@@ -18,7 +18,7 @@ namespace compaction {
class compaction_strategy_state {
public:
struct default_empty_state {};
using states_variant = std::variant<default_empty_state, leveled_compaction_strategy_state, time_window_compaction_strategy_state>;
using states_variant = std::variant<default_empty_state, leveled_compaction_strategy_state_ptr, time_window_compaction_strategy_state_ptr>;
private:
states_variant _state;
public:

View File

@@ -14,12 +14,12 @@
namespace compaction {
leveled_compaction_strategy_state& leveled_compaction_strategy::get_state(compaction_group_view& table_s) const {
return table_s.get_compaction_strategy_state().get<leveled_compaction_strategy_state>();
leveled_compaction_strategy_state_ptr leveled_compaction_strategy::get_state(compaction_group_view& table_s) const {
return table_s.get_compaction_strategy_state().get<leveled_compaction_strategy_state_ptr>();
}
future<compaction_descriptor> leveled_compaction_strategy::get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) {
auto& state = get_state(table_s);
auto state = get_state(table_s);
auto candidates = co_await control.candidates(table_s);
// NOTE: leveled_manifest creation may be slightly expensive, so later on,
// we may want to store it in the strategy itself. However, the sstable
@@ -27,10 +27,10 @@ future<compaction_descriptor> leveled_compaction_strategy::get_sstables_for_comp
// sstable in it may be marked for deletion after compacted.
// Currently, we create a new manifest whenever it's time for compaction.
leveled_manifest manifest = leveled_manifest::create(table_s, candidates, _max_sstable_size_in_mb, _stcs_options);
if (!state.last_compacted_keys) {
generate_last_compacted_keys(state, manifest);
if (!state->last_compacted_keys) {
generate_last_compacted_keys(*state, manifest);
}
auto candidate = manifest.get_compaction_candidates(*state.last_compacted_keys, state.compaction_counter);
auto candidate = manifest.get_compaction_candidates(*state->last_compacted_keys, state->compaction_counter);
if (!candidate.sstables.empty()) {
auto main_set = co_await table_s.main_sstable_set();
@@ -78,12 +78,12 @@ compaction_descriptor leveled_compaction_strategy::get_major_compaction_job(comp
}
void leveled_compaction_strategy::notify_completion(compaction_group_view& table_s, const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added) {
auto& state = get_state(table_s);
auto state = get_state(table_s);
// All the update here is only relevant for regular compaction's round-robin picking policy, and if
// last_compacted_keys wasn't generated by regular, it means regular is disabled since last restart,
// therefore we can skip the updates here until regular runs for the first time. Once it runs,
// it will be able to generate last_compacted_keys correctly by looking at metadata of files.
if (removed.empty() || added.empty() || !state.last_compacted_keys) {
if (removed.empty() || added.empty() || !state->last_compacted_keys) {
return;
}
auto min_level = std::numeric_limits<uint32_t>::max();
@@ -99,16 +99,16 @@ void leveled_compaction_strategy::notify_completion(compaction_group_view& table
}
target_level = std::max(target_level, int(candidate->get_sstable_level()));
}
state.last_compacted_keys.value().at(min_level) = last->get_last_decorated_key();
state->last_compacted_keys.value().at(min_level) = last->get_last_decorated_key();
for (int i = leveled_manifest::MAX_LEVELS - 1; i > 0; i--) {
state.compaction_counter[i]++;
state->compaction_counter[i]++;
}
state.compaction_counter[target_level] = 0;
state->compaction_counter[target_level] = 0;
if (leveled_manifest::logger.level() == logging::log_level::debug) {
for (auto j = 0U; j < state.compaction_counter.size(); j++) {
leveled_manifest::logger.debug("CompactionCounter: {}: {}", j, state.compaction_counter[j]);
for (auto j = 0U; j < state->compaction_counter.size(); j++) {
leveled_manifest::logger.debug("CompactionCounter: {}: {}", j, state->compaction_counter[j]);
}
}
}

View File

@@ -36,6 +36,8 @@ struct leveled_compaction_strategy_state {
leveled_compaction_strategy_state();
};
using leveled_compaction_strategy_state_ptr = seastar::shared_ptr<leveled_compaction_strategy_state>;
class leveled_compaction_strategy : public compaction_strategy_impl {
static constexpr int32_t DEFAULT_MAX_SSTABLE_SIZE_IN_MB = 160;
static constexpr auto SSTABLE_SIZE_OPTION = "sstable_size_in_mb";
@@ -45,7 +47,7 @@ class leveled_compaction_strategy : public compaction_strategy_impl {
private:
int32_t calculate_max_sstable_size_in_mb(std::optional<sstring> option_value) const;
leveled_compaction_strategy_state& get_state(compaction_group_view& table_s) const;
leveled_compaction_strategy_state_ptr get_state(compaction_group_view& table_s) const;
public:
static unsigned ideal_level_for_input(const std::vector<sstables::shared_sstable>& input, uint64_t max_sstable_size);
static void validate_options(const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options);

View File

@@ -13,6 +13,7 @@
#include "sstables/sstables.hh"
#include "sstables/sstable_set_impl.hh"
#include "compaction_strategy_state.hh"
#include "utils/error_injection.hh"
#include <ranges>
@@ -22,8 +23,8 @@ extern logging::logger clogger;
using timestamp_type = api::timestamp_type;
time_window_compaction_strategy_state& time_window_compaction_strategy::get_state(compaction_group_view& table_s) const {
return table_s.get_compaction_strategy_state().get<time_window_compaction_strategy_state>();
time_window_compaction_strategy_state_ptr time_window_compaction_strategy::get_state(compaction_group_view& table_s) const {
return table_s.get_compaction_strategy_state().get<time_window_compaction_strategy_state_ptr>();
}
const std::unordered_map<sstring, std::chrono::seconds> time_window_compaction_strategy_options::valid_window_units = {
@@ -335,7 +336,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_
future<compaction_descriptor>
time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) {
auto& state = get_state(table_s);
auto state = get_state(table_s);
auto compaction_time = gc_clock::now();
auto candidates = co_await control.candidates(table_s);
@@ -344,7 +345,7 @@ time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_vi
}
auto now = db_clock::now();
if (now - state.last_expired_check > _options.expired_sstable_check_frequency) {
if (now - state->last_expired_check > _options.expired_sstable_check_frequency) {
clogger.debug("[{}] TWCS expired check sufficiently far in the past, checking for fully expired SSTables", fmt::ptr(this));
// Find fully expired SSTables. Those will be included no matter what.
@@ -356,12 +357,14 @@ time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_vi
// Keep checking for fully_expired_sstables until we don't find
// any among the candidates, meaning they are either already compacted
// or registered for compaction.
state.last_expired_check = now;
state->last_expired_check = now;
} else {
clogger.debug("[{}] TWCS skipping check for fully expired SSTables", fmt::ptr(this));
}
auto compaction_candidates = get_next_non_expired_sstables(table_s, control, std::move(candidates), compaction_time);
co_await utils::get_local_injector().inject("twcs_get_sstables_for_compaction", utils::wait_for_message(30s));
auto compaction_candidates = get_next_non_expired_sstables(table_s, control, std::move(candidates), compaction_time, *state);
clogger.debug("[{}] Going to compact {} non-expired sstables", fmt::ptr(this), compaction_candidates.size());
co_return compaction_descriptor(std::move(compaction_candidates));
}
@@ -384,8 +387,8 @@ time_window_compaction_strategy::compaction_mode(const time_window_compaction_st
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> non_expiring_sstables, gc_clock::time_point compaction_time) {
auto most_interesting = get_compaction_candidates(table_s, control, non_expiring_sstables);
std::vector<sstables::shared_sstable> non_expiring_sstables, gc_clock::time_point compaction_time, time_window_compaction_strategy_state& state) {
auto most_interesting = get_compaction_candidates(table_s, control, non_expiring_sstables, state);
if (!most_interesting.empty()) {
return most_interesting;
@@ -410,14 +413,14 @@ time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_
}
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::get_compaction_candidates(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> candidate_sstables) {
auto& state = get_state(table_s);
time_window_compaction_strategy::get_compaction_candidates(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> candidate_sstables, time_window_compaction_strategy_state& state) {
auto [buckets, max_timestamp] = get_buckets(std::move(candidate_sstables), _options);
// Update the highest window seen, if necessary
state.highest_window_seen = std::max(state.highest_window_seen, max_timestamp);
return newest_bucket(table_s, control, std::move(buckets), table_s.min_compaction_threshold(), table_s.schema()->max_compaction_threshold(),
state.highest_window_seen);
state.highest_window_seen, state);
}
timestamp_type
@@ -465,8 +468,7 @@ namespace compaction {
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::newest_bucket(compaction_group_view& table_s, strategy_control& control, std::map<timestamp_type, std::vector<sstables::shared_sstable>> buckets,
int min_threshold, int max_threshold, timestamp_type now) {
auto& state = get_state(table_s);
int min_threshold, int max_threshold, timestamp_type now, time_window_compaction_strategy_state& state) {
clogger.debug("time_window_compaction_strategy::newest_bucket:\n now {}\n{}", now, buckets);
for (auto&& [key, bucket] : buckets | std::views::reverse) {
@@ -517,7 +519,7 @@ time_window_compaction_strategy::trim_to_threshold(std::vector<sstables::shared_
}
future<int64_t> time_window_compaction_strategy::estimated_pending_compactions(compaction_group_view& table_s) const {
auto& state = get_state(table_s);
auto state = get_state(table_s);
auto min_threshold = table_s.min_compaction_threshold();
auto max_threshold = table_s.schema()->max_compaction_threshold();
auto main_set = co_await table_s.main_sstable_set();
@@ -526,7 +528,7 @@ future<int64_t> time_window_compaction_strategy::estimated_pending_compactions(c
int64_t n = 0;
for (auto& [bucket_key, bucket] : buckets) {
switch (compaction_mode(state, bucket, bucket_key, max_timestamp, min_threshold)) {
switch (compaction_mode(*state, bucket, bucket_key, max_timestamp, min_threshold)) {
case bucket_compaction_mode::size_tiered:
n += size_tiered_compaction_strategy::estimated_pending_compactions(bucket, min_threshold, max_threshold, _stcs_options);
break;

View File

@@ -67,6 +67,8 @@ struct time_window_compaction_strategy_state {
std::unordered_set<api::timestamp_type> recent_active_windows;
};
using time_window_compaction_strategy_state_ptr = seastar::shared_ptr<time_window_compaction_strategy_state>;
class time_window_compaction_strategy : public compaction_strategy_impl {
time_window_compaction_strategy_options _options;
size_tiered_compaction_strategy_options _stcs_options;
@@ -87,7 +89,7 @@ public:
static void validate_options(const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options);
private:
time_window_compaction_strategy_state& get_state(compaction_group_view& table_s) const;
time_window_compaction_strategy_state_ptr get_state(compaction_group_view& table_s) const;
static api::timestamp_type
to_timestamp_type(time_window_compaction_strategy_options::timestamp_resolutions resolution, int64_t timestamp_from_sstable) {
@@ -110,9 +112,11 @@ private:
compaction_mode(const time_window_compaction_strategy_state&, const bucket_t& bucket, api::timestamp_type bucket_key, api::timestamp_type now, size_t min_threshold) const;
std::vector<sstables::shared_sstable>
get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> non_expiring_sstables, gc_clock::time_point compaction_time);
get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> non_expiring_sstables,
gc_clock::time_point compaction_time, time_window_compaction_strategy_state& state);
std::vector<sstables::shared_sstable> get_compaction_candidates(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> candidate_sstables);
std::vector<sstables::shared_sstable> get_compaction_candidates(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> candidate_sstables, time_window_compaction_strategy_state& state);
public:
// Find the lowest timestamp for window of given size
static api::timestamp_type
@@ -126,7 +130,7 @@ public:
std::vector<sstables::shared_sstable>
newest_bucket(compaction_group_view& table_s, strategy_control& control, std::map<api::timestamp_type, std::vector<sstables::shared_sstable>> buckets,
int min_threshold, int max_threshold, api::timestamp_type now);
int min_threshold, int max_threshold, api::timestamp_type now, time_window_compaction_strategy_state& state);
static std::vector<sstables::shared_sstable>
trim_to_threshold(std::vector<sstables::shared_sstable> bucket, int max_threshold);

View File

@@ -1078,7 +1078,6 @@ scylla_core = (['message/messaging_service.cc',
'utils/s3/client.cc',
'utils/s3/retryable_http_client.cc',
'utils/s3/retry_strategy.cc',
'utils/s3/s3_retry_strategy.cc',
'utils/s3/credentials_providers/aws_credentials_provider.cc',
'utils/s3/credentials_providers/environment_aws_credentials_provider.cc',
'utils/s3/credentials_providers/instance_profile_credentials_provider.cc',

View File

@@ -10,6 +10,8 @@
#include <seastar/core/coroutine.hh>
#include "create_index_statement.hh"
#include "db/config.hh"
#include "db/view/view.hh"
#include "exceptions/exceptions.hh"
#include "prepared_statement.hh"
#include "types/types.hh"
@@ -92,9 +94,13 @@ std::vector<::shared_ptr<index_target>> create_index_statement::validate_while_e
throw exceptions::invalid_request_exception(format("index names shouldn't be more than {:d} characters long (got \"{}\")", schema::NAME_LENGTH, _index_name.c_str()));
}
if (!db.features().views_with_tablets && db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
throw exceptions::invalid_request_exception(format("Secondary indexes are not supported on base tables with tablets (keyspace '{}')", keyspace()));
try {
db::view::validate_view_keyspace(db, keyspace());
} catch (const std::exception& e) {
// The type of the thrown exception is not specified, so we need to wrap it here.
throw exceptions::invalid_request_exception(e.what());
}
validate_for_local_index(*schema);
std::vector<::shared_ptr<index_target>> targets;

View File

@@ -113,8 +113,7 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chun
if (rs->uses_tablets()) {
warnings.push_back(
"Tables in this keyspace will be replicated using Tablets "
"and will not support Materialized Views, Secondary Indexes and counters features. "
"To use Materialized Views, Secondary Indexes or counters, drop this keyspace and re-create it "
"and will not support counters features. To use counters, drop this keyspace and re-create it "
"without tablets by adding AND TABLETS = {'enabled': false} to the CREATE KEYSPACE statement.");
if (ksm->initial_tablets().value()) {
warnings.push_back("Keyspace `initial` tablets option is deprecated. Use per-table tablet options instead.");

View File

@@ -152,9 +152,13 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(
schema_ptr schema = validation::validate_column_family(db, _base_name.get_keyspace(), _base_name.get_column_family());
if (!db.features().views_with_tablets && db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
throw exceptions::invalid_request_exception(format("Materialized views are not supported on base tables with tablets"));
try {
db::view::validate_view_keyspace(db, keyspace());
} catch (const std::exception& e) {
// The type of the thrown exception is not specified, so we need to wrap it here.
throw exceptions::invalid_request_exception(e.what());
}
if (schema->is_counter()) {
throw exceptions::invalid_request_exception(format("Materialized views are not supported on counter tables"));
}

View File

@@ -1756,7 +1756,7 @@ std::map<sstring, db::experimental_features_t::feature> db::experimental_feature
{"broadcast-tables", feature::BROADCAST_TABLES},
{"keyspace-storage-options", feature::KEYSPACE_STORAGE_OPTIONS},
{"tablets", feature::UNUSED},
{"views-with-tablets", feature::VIEWS_WITH_TABLETS}
{"views-with-tablets", feature::UNUSED}
};
}

View File

@@ -136,8 +136,7 @@ struct experimental_features_t {
UDF,
ALTERNATOR_STREAMS,
BROADCAST_TABLES,
KEYSPACE_STORAGE_OPTIONS,
VIEWS_WITH_TABLETS
KEYSPACE_STORAGE_OPTIONS
};
static std::map<sstring, feature> map(); // See enum_option.
static std::vector<enum_option<experimental_features_t>> all();

View File

@@ -26,6 +26,7 @@
#include <seastar/coroutine/maybe_yield.hh>
#include <flat_map>
#include "db/config.hh"
#include "db/view/base_info.hh"
#include "db/view/view_build_status.hh"
#include "db/view/view_consumer.hh"
@@ -3715,5 +3716,22 @@ sstring build_status_to_sstring(build_status status) {
on_internal_error(vlogger, fmt::format("Unknown view build status: {}", (int)status));
}
void validate_view_keyspace(const data_dictionary::database& db, std::string_view keyspace_name) {
const bool tablet_views_enabled = db.features().views_with_tablets;
// Note: if the configuration option `rf_rack_valid_keyspaces` is enabled, we can be
// sure that all tablet-based keyspaces are RF-rack-valid. We check that
// at start-up and then we don't allow for creating RF-rack-invalid keyspaces.
const bool rf_rack_valid_keyspaces = db.get_config().rf_rack_valid_keyspaces();
const bool required_config = tablet_views_enabled && rf_rack_valid_keyspaces;
const bool uses_tablets = db.find_keyspace(keyspace_name).get_replication_strategy().uses_tablets();
if (!required_config && uses_tablets) {
throw std::logic_error("Materialized views and secondary indexes are not supported on base tables with tablets. "
"To be able to use them, enable the configuration option `rf_rack_valid_keyspaces` and make sure "
"that the cluster feature `VIEWS_WITH_TABLETS` is enabled.");
}
}
} // namespace view
} // namespace db

View File

@@ -309,6 +309,18 @@ endpoints_to_update get_view_natural_endpoint(
bool use_tablets_basic_rack_aware_view_pairing,
replica::cf_stats& cf_stats);
/// Verify that the provided keyspace is eligible for storing materialized views.
///
/// Result:
/// * If the keyspace is eligible, no effect.
/// * If the keyspace is not eligible, an exception is thrown. Its type is not specified,
/// and the user of this function cannot make any assumption about it. The carried exception
/// message will be worded in a way that can be directly passed on to the end user.
///
/// Preconditions:
/// * The provided `keyspace_name` must correspond to an existing keyspace.
void validate_view_keyspace(const data_dictionary::database&, std::string_view keyspace_name);
}
}

View File

@@ -127,8 +127,9 @@ view_building_worker::view_building_worker(replica::database& db, db::system_key
init_messaging_service();
}
void view_building_worker::start_background_fibers() {
future<> view_building_worker::init() {
SCYLLA_ASSERT(this_shard_id() == 0);
co_await discover_existing_staging_sstables();
_staging_sstables_registrator = run_staging_sstables_registrator();
_view_building_state_observer = run_view_building_state_observer();
_mnotifier.register_listener(this);
@@ -195,8 +196,6 @@ future<> view_building_worker::register_staging_sstable_tasks(std::vector<sstabl
}
future<> view_building_worker::run_staging_sstables_registrator() {
co_await discover_existing_staging_sstables();
while (!_as.abort_requested()) {
try {
auto lock = co_await get_units(_staging_sstables_mutex, 1, _as);
@@ -310,7 +309,10 @@ std::unordered_map<table_id, std::vector<view_building_worker::staging_sstable_t
return;
}
auto& tablet_map = _db.get_token_metadata().tablets().get_tablet_map(table_id);
// scylladb/scylladb#26403: Make sure to access the tablets map via the effective replication map of the table object.
// The token metadata object pointed to by the database (`_db.get_token_metadata()`) may not contain
// the tablets map of the currently processed table yet. After #24414 is fixed, this should not matter anymore.
auto& tablet_map = table->get_effective_replication_map()->get_token_metadata().tablets().get_tablet_map(table_id);
auto sstables = table->get_sstables();
for (auto sstable: *sstables) {
if (!sstable->requires_view_building()) {
@@ -340,6 +342,7 @@ future<> view_building_worker::run_view_building_state_observer() {
while (!_as.abort_requested()) {
bool sleep = false;
_state.some_batch_finished = false;
try {
vbw_logger.trace("view_building_state_observer() iteration");
auto read_apply_mutex_holder = co_await _group0_client.hold_read_apply_mutex(_as);
@@ -349,7 +352,12 @@ future<> view_building_worker::run_view_building_state_observer() {
_as.check();
read_apply_mutex_holder.return_all();
co_await _vb_state_machine.event.wait();
// A batch could finished its work while the worker was
// updating the state. In that case we should do another iteration.
if (!_state.some_batch_finished) {
co_await _vb_state_machine.event.wait();
}
} catch (abort_requested_exception&) {
} catch (broken_condition_variable&) {
} catch (...) {
@@ -657,6 +665,7 @@ future<> view_building_worker::local_state::clear_state() {
finished_tasks.clear();
aborted_tasks.clear();
state_updated_cv.broadcast();
some_batch_finished = false;
vbw_logger.debug("View building worker state was cleared.");
}
@@ -676,6 +685,7 @@ void view_building_worker::batch::start() {
return do_work();
}).finally([this] () {
state = batch_state::finished;
_vbw.local()._state.some_batch_finished = true;
_vbw.local()._vb_state_machine.event.broadcast();
});
}

View File

@@ -111,6 +111,7 @@ class view_building_worker : public seastar::peering_sharded_service<view_buildi
std::unordered_set<utils::UUID> finished_tasks;
std::unordered_set<utils::UUID> aborted_tasks;
bool some_batch_finished = false;
condition_variable state_updated_cv;
// Clears completed/aborted tasks and creates batches (without starting them) for started tasks.
@@ -166,7 +167,7 @@ public:
view_building_worker(replica::database& db, db::system_keyspace& sys_ks, service::migration_notifier& mnotifier,
service::raft_group0_client& group0_client, view_update_generator& vug, netw::messaging_service& ms,
view_building_state_machine& vbsm);
void start_background_fibers();
future<> init();
future<> register_staging_sstable_tasks(std::vector<sstables::shared_sstable> ssts, table_id table_id);

View File

@@ -204,7 +204,7 @@ ring_position_range_sharder::next(const schema& s) {
return ring_position_range_and_shard{std::move(_range), shard};
}
ring_position_range_vector_sharder::ring_position_range_vector_sharder(const sharder& sharder, dht::partition_range_vector ranges)
ring_position_range_vector_sharder::ring_position_range_vector_sharder(const sharder& sharder, utils::chunked_vector<dht::partition_range> ranges)
: _ranges(std::move(ranges))
, _sharder(sharder)
, _current_range(_ranges.begin()) {

View File

@@ -11,6 +11,7 @@
#include "dht/ring_position.hh"
#include "dht/token-sharding.hh"
#include "utils/interval.hh"
#include "utils/chunked_vector.hh"
#include <vector>
@@ -89,7 +90,7 @@ struct ring_position_range_and_shard_and_element : ring_position_range_and_shard
//
// During migration uses a view on shard routing for reads.
class ring_position_range_vector_sharder {
using vec_type = dht::partition_range_vector;
using vec_type = utils::chunked_vector<dht::partition_range>;
vec_type _ranges;
const sharder& _sharder;
vec_type::iterator _current_range;
@@ -104,7 +105,7 @@ public:
// Initializes the `ring_position_range_vector_sharder` with the ranges to be processesd.
// Input ranges should be non-overlapping (although nothing bad will happen if they do
// overlap).
ring_position_range_vector_sharder(const sharder& sharder, dht::partition_range_vector ranges);
ring_position_range_vector_sharder(const sharder& sharder, utils::chunked_vector<dht::partition_range> ranges);
// Fetches the next range-shard mapping. When the input range is exhausted, std::nullopt is
// returned. Within an input range, results are contiguous and non-overlapping (but since input
// ranges usually are discontiguous, overall the results are not contiguous). Together, the results

View File

@@ -131,6 +131,28 @@ def configure_iotune_open_fd_limit(shards_count):
logging.error(f"Required FDs count: {precalculated_fds_count}, default limit: {fd_limits}!")
sys.exit(1)
def force_random_request_size_of_4k():
"""
It is a known bug that on i4i, i7i, i8g, i8ge instances, the disk controller reports the wrong
physical sector size as 512bytes, but the actual physical sector size is 4096bytes. This function
helps us work around that issue until AWS manages to get a fix for it. It returns 4096 if it
detect it's running on one of the affected instance types, otherwise it returns None and IOTune
will use the physical sector size reported by the disk.
"""
path="/sys/devices/virtual/dmi/id/product_name"
try:
with open(path, "r") as f:
instance_type = f.read().strip()
except FileNotFoundError:
logging.warning(f"Couldn't find {path}. Falling back to IOTune using the physical sector size reported by disk.")
return
prefixes = ["i7i", "i4i", "i8g", "i8ge"]
if any(instance_type.startswith(p) for p in prefixes):
return 4096
def run_iotune():
if "SCYLLA_CONF" in os.environ:
conf_dir = os.environ["SCYLLA_CONF"]
@@ -173,6 +195,8 @@ def run_iotune():
configure_iotune_open_fd_limit(cpudata.nr_shards())
if (reqsize := force_random_request_size_of_4k()):
iotune_args += ["--random-write-io-buffer-size", f"{reqsize}"]
try:
subprocess.check_call([bindir() + "/iotune",
"--format", "envfile",

View File

@@ -17,6 +17,7 @@ import stat
import logging
import pyudev
import psutil
import platform
from pathlib import Path
from scylla_util import *
from subprocess import run, SubprocessError
@@ -102,6 +103,21 @@ def is_selinux_enabled():
return True
return False
def is_kernel_version_at_least(major, minor):
"""Check if the Linux kernel version is at least major.minor"""
try:
kernel_version = platform.release()
# Extract major.minor from version string like "5.15.0-56-generic"
version_parts = kernel_version.split('.')
if len(version_parts) >= 2:
kernel_major = int(version_parts[0])
kernel_minor = int(version_parts[1])
return (kernel_major, kernel_minor) >= (major, minor)
except (ValueError, IndexError):
# If we can't parse the version, assume older kernel for safety
pass
return False
if __name__ == '__main__':
if os.getuid() > 0:
print('Requires root permission.')
@@ -231,8 +247,17 @@ if __name__ == '__main__':
# see https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/mkfs/xfs_mkfs.c .
# and it also cannot be smaller than the sector size.
block_size = max(1024, sector_size)
run('udevadm settle', shell=True, check=True)
run(f'mkfs.xfs -b size={block_size} {fsdev} -K -m rmapbt=0 -m reflink=0', shell=True, check=True)
# On Linux 5.12+, sub-block overwrites are supported well, so keep the default block
# size, which will play better with the SSD.
if is_kernel_version_at_least(5, 12):
block_size_opt = ""
else:
block_size_opt = f"-b size={block_size}"
run(f'mkfs.xfs {block_size_opt} {fsdev} -K -m rmapbt=0 -m reflink=0', shell=True, check=True)
run('udevadm settle', shell=True, check=True)
if is_debian_variant():

View File

@@ -202,12 +202,9 @@ enabled. If you plan to use any of the features listed below, CREATE your keyspa
:ref:`with tablets disabled <tablets-enable-tablets>`.
* Counters
* Materialized Views (MV) ``*``
* Secondary indexes (SI, as it depends on MV) ``*``
``*`` You can enable experimental support for MV and SI using
the ``--experimental-features=views-with-tablets`` configuration option.
See :ref:`Views with tablets <admin-views-with-tablets>` for details.
To enable materialized views and secondary indexes for tablet keyspaces, use
the `--rf-rack-valid-keyspaces` See :ref:`Views with tablets <admin-views-with-tablets>` for details.
Resharding in keyspaces with tablets enabled has the following limitations:

View File

@@ -341,17 +341,13 @@ credentials and endpoint.
Views with Tablets
------------------
By default, Materialized Views (MV) and Secondary Indexes (SI)
are disabled in keyspaces that use tablets.
Support for MV and SI with tablets is experimental and must be explicitly
enabled in the ``scylla.yaml`` configuration file by specifying
the ``views-with-tablets`` option:
Materialized Views (MV) and Secondary Indexes (SI) are enabled in keyspaces that use tablets
only when :term:`RF-rack-valid keyspaces <RF-rack-valid keyspace>` are enforced. That can be
done in the ``scylla.yaml`` configuration file by specifying
.. code-block:: yaml
experimental_features:
- views-with-tablets
rf_rack_valid_keyspaces: true
Monitoring

View File

@@ -53,7 +53,7 @@ ScyllaDB nodetool cluster repair command supports the following options:
nodetool cluster repair --tablet-tokens 1,10474535988
- ``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'regular', or 'full'. 'regular': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to regular.
- ``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental.
For example:

View File

@@ -38,14 +38,14 @@ Manual Dictionary Training
You can manually trigger dictionary training using the REST API::
curl -X POST "http://node-address:10000/storage_service/retrain_dict?keyspace=mykeyspace&table=mytable"
curl -X POST "http://node-address:10000/storage_service/retrain_dict?keyspace=mykeyspace&cf=mytable"
Estimating Compression Ratios
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
To choose the best compression configuration, you can estimate compression ratios using the REST API::
curl -X GET "http://node-address:10000/storage_service/estimate_compression_ratios?keyspace=mykeyspace&table=mytable"
curl -X GET "http://node-address:10000/storage_service/estimate_compression_ratios?keyspace=mykeyspace&cf=mytable"
This will return a report with estimated compression ratios for various combinations of compression
parameters (algorithm, chunk size, zstd level, dictionary).

View File

@@ -76,7 +76,7 @@ struct repair_row_level_start_response {
namespace locator {
enum class tablet_repair_incremental_mode : uint8_t {
regular,
incremental,
full,
disabled,
};

View File

@@ -99,9 +99,6 @@ std::set<sstring> get_disabled_features_from_db_config(const db::config& cfg, st
if (!cfg.check_experimental(db::experimental_features_t::feature::KEYSPACE_STORAGE_OPTIONS)) {
disabled.insert("KEYSPACE_STORAGE_OPTIONS"s);
}
if (!cfg.check_experimental(db::experimental_features_t::feature::VIEWS_WITH_TABLETS)) {
disabled.insert("VIEWS_WITH_TABLETS"s);
}
if (cfg.force_gossip_topology_changes()) {
if (cfg.enable_tablets_by_default()) {
throw std::runtime_error("Tablets cannot be enabled with gossip topology changes. Use either --tablets-mode-for-new-keyspaces=enabled|enforced or --force-gossip-topology-changes, but not both.");

View File

@@ -754,7 +754,7 @@ tablet_task_type tablet_task_type_from_string(const sstring& name) {
// The names are persisted in system tables so should not be changed.
static const std::unordered_map<locator::tablet_repair_incremental_mode, sstring> tablet_repair_incremental_mode_to_name = {
{locator::tablet_repair_incremental_mode::disabled, "disabled"},
{locator::tablet_repair_incremental_mode::regular, "regular"},
{locator::tablet_repair_incremental_mode::incremental, "incremental"},
{locator::tablet_repair_incremental_mode::full, "full"},
};

View File

@@ -162,11 +162,11 @@ sstring tablet_task_type_to_string(tablet_task_type);
tablet_task_type tablet_task_type_from_string(const sstring&);
// - regular (regular incremental repair): The incremental repair logic is enabled.
// - incremental (incremental repair): The incremental repair logic is enabled.
// Unrepaired sstables will be included for repair. Repaired sstables will be
// skipped. The incremental repair states will be updated after repair.
// - full (full incremental repair): The incremental repair logic is enabled.
// - full (full repair): The incremental repair logic is enabled.
// Both repaired and unrepaired sstables will be included for repair. The
// incremental repair states will be updated after repair.
@@ -175,12 +175,12 @@ tablet_task_type tablet_task_type_from_string(const sstring&);
// sstables_repaired_at in system.tablets table, will not be updated after
// repair.
enum class tablet_repair_incremental_mode : uint8_t {
regular,
incremental,
full,
disabled,
};
constexpr tablet_repair_incremental_mode default_tablet_repair_incremental_mode{tablet_repair_incremental_mode::regular};
constexpr tablet_repair_incremental_mode default_tablet_repair_incremental_mode{tablet_repair_incremental_mode::incremental};
sstring tablet_repair_incremental_mode_to_string(tablet_repair_incremental_mode);
tablet_repair_incremental_mode tablet_repair_incremental_mode_from_string(const sstring&);

View File

@@ -2128,7 +2128,7 @@ sharded<locator::shared_token_metadata> token_metadata;
});
checkpoint(stop_signal, "starting sstables loader");
sst_loader.start(std::ref(db), std::ref(messaging), std::ref(view_builder), std::ref(view_building_worker), std::ref(task_manager), std::ref(sstm), maintenance_scheduling_group).get();
sst_loader.start(std::ref(db), std::ref(ss), std::ref(messaging), std::ref(view_builder), std::ref(view_building_worker), std::ref(task_manager), std::ref(sstm), maintenance_scheduling_group).get();
auto stop_sst_loader = defer_verbose_shutdown("sstables loader", [&sst_loader] {
sst_loader.stop().get();
});
@@ -2208,6 +2208,11 @@ sharded<locator::shared_token_metadata> token_metadata;
startlog.info("Verifying that all of the keyspaces are RF-rack-valid");
db.local().check_rf_rack_validity(cfg->rf_rack_valid_keyspaces(), token_metadata.local().get());
// Materialized views and secondary indexes are still restricted and require specific configuration
// options to work. Make sure that if there are existing views or indexes, they don't violate
// the requirements imposed on them.
db.local().validate_tablet_views_indexes();
// Semantic validation of sstable compression parameters from config.
// Adding here (i.e., after `join_cluster`) to ensure that the
// required SSTABLE_COMPRESSION_DICTS cluster feature has been negotiated.
@@ -2426,7 +2431,7 @@ sharded<locator::shared_token_metadata> token_metadata;
checkpoint(stop_signal, "starting view building worker's background fibers");
with_scheduling_group(maintenance_scheduling_group, [&] {
view_building_worker.local().start_background_fibers();
return view_building_worker.local().init();
}).get();
auto drain_view_buiding_worker = defer_verbose_shutdown("draining view building worker", [&] {
view_building_worker.invoke_on_all(&db::view::view_building_worker::drain).get();

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:103bd12a1f0feb60d814da074b81ebafaa13059d1267ee3612c48a8bc96798b6
size 6242980
oid sha256:5e35a15a32060d47846c2a5ab29373639e651ac112cc0785306789b8273c63dc
size 6299456

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:2cb637e741a2b9badc96f3f175f15db257b9273ea43040289de7d72657b5505a
size 6240824
oid sha256:71a3e8a3a0e68d35c2e14b553a81e1bc55f6adb73a1988e17e4326923020db2c
size 6316028

View File

@@ -15,7 +15,6 @@
#include <seastar/core/format.hh>
static const char scylla_product_str[] = SCYLLA_PRODUCT;
static const char scylla_version_str[] = SCYLLA_VERSION;
static const char scylla_release_str[] = SCYLLA_RELEASE;
static const char scylla_build_mode_str[] = SCYLLA_BUILD_MODE_STR;
@@ -31,12 +30,9 @@ std::string scylla_build_mode()
}
std::string doc_link(std::string_view url_tail) {
const std::string_view product = scylla_product_str;
const std::string_view version = scylla_version_str;
const auto prefix = product == "scylla-enterprise" ? "enterprise" : "opensource";
std::string branch = product == "scylla-enterprise" ? "enterprise" : "master";
std::string branch = "master";
if (!version.ends_with("~dev")) {
std::vector<std::string> components;
boost::split(components, version, boost::algorithm::is_any_of("."));
@@ -45,7 +41,7 @@ std::string doc_link(std::string_view url_tail) {
branch = fmt::format("branch-{}.{}", components[0], components[1]);
}
return fmt::format("https://{}.docs.scylladb.com/{}/{}", prefix, branch, url_tail);
return fmt::format("https://docs.scylladb.com/manual/{}/{}", branch, url_tail);
}
// get the version number into writeable memory, so we can grep for it if we get a core dump

View File

@@ -420,7 +420,7 @@ future<std::tuple<bool, gc_clock::time_point>> repair_service::flush_hints(repai
}
if (!nodes_down.empty()) {
rlogger.warn("repair[{}]: Skipped sending repair_flush_hints_batchlog due to nodes_down={}, continue to run repair",
nodes_down, uuid);
uuid, nodes_down);
co_return std::make_tuple(hints_batchlog_flushed, flush_time);
}
co_await parallel_for_each(waiting_nodes, [this, uuid, start_time, &times, &req] (locator::host_id node) -> future<> {

View File

@@ -12,6 +12,7 @@
#include <fmt/ranges.h>
#include <fmt/std.h>
#include <seastar/core/rwlock.hh>
#include "db/view/view.hh"
#include "locator/network_topology_strategy.hh"
#include "locator/tablets.hh"
#include "locator/token_metadata_fwd.hh"
@@ -86,6 +87,7 @@
#include "tracing/trace_keyspace_helper.hh"
#include <algorithm>
#include <flat_set>
using namespace std::chrono_literals;
using namespace db;
@@ -3483,6 +3485,37 @@ void database::check_rf_rack_validity(const bool enforce_rf_rack_valid_keyspaces
}
}
void database::validate_tablet_views_indexes() const {
dblog.info("Verifying that all existing materialized views are valid");
const data_dictionary::database& db = this->as_data_dictionary();
std::flat_set<std::string_view> invalid_keyspaces;
for (const view_ptr& view : get_views()) {
const auto& ks = view->ks_name();
try {
db::view::validate_view_keyspace(db, ks);
} catch (...) {
invalid_keyspaces.emplace(ks);
}
}
if (invalid_keyspaces.empty()) {
dblog.info("All existing materialized views are valid");
return;
}
// `std::flat_set` guarantees iteration in the increasing order.
const std::string ks_list = invalid_keyspaces
| std::views::join_with(std::string_view(", "))
| std::ranges::to<std::string>();
dblog.warn("Some of the existing keyspaces violate the requirements "
"for using materialized views or secondary indexes. Those features require enabling "
"the configuration option `rf_rack_valid_keyspaces` and the cluster feature "
"`VIEWS_WITH_TABLETS`. The keyspaces that violate that condition: {}", ks_list);
}
utils::chunked_vector<uint64_t> compute_random_sorted_ints(uint64_t max_value, uint64_t n_values) {
static thread_local std::minstd_rand rng{std::random_device{}()};
std::uniform_int_distribution<uint64_t> dist(0, max_value);

View File

@@ -2091,6 +2091,20 @@ public:
// * the `locator::topology` instance corresponding to the passed `locator::token_metadata_ptr`
// must contain a complete list of racks and data centers in the cluster.
void check_rf_rack_validity(const bool enforce_rf_rack_valid_keyspaces, const locator::token_metadata_ptr) const;
/// Verify that all existing materialized views are valid.
///
/// We consider a materialized view valid if one of the following
/// conditions is satisfied:
/// * it resides in a vnode-based keyspace,
/// * it resides in a tablet-based keyspace, the cluster feature `VIEWS_WITH_TABLETS`
/// is enabled, and the configuration option `rf_rack_valid_keyspaces` is enabled.
///
/// Result:
/// * Depending on whether there are invalid materialized views, the function will
/// log that either everything's OK, or that there are some keyspaces that violate
/// the requirement.
void validate_tablet_views_indexes() const;
private:
// SSTable sampling might require considerable amounts of memory,
// so we want to limit the number of concurrent sampling operations.

Submodule seastar updated: c8a3515f9b...60e4b3b921

View File

@@ -319,7 +319,7 @@ future<> service_level_controller::update_service_levels_cache(qos::query_contex
});
}
future<> service_level_controller::auth_integration::reload_cache() {
future<> service_level_controller::auth_integration::reload_cache(qos::query_context ctx) {
SCYLLA_ASSERT(this_shard_id() == global_controller);
const auto _ = _stop_gate.hold();
@@ -336,11 +336,12 @@ future<> service_level_controller::auth_integration::reload_cache() {
}
auto units = co_await get_units(_sl_controller._global_controller_db->notifications_serializer, 1);
auto& qs = qos_query_state(ctx);
auto& role_manager = _auth_service.underlying_role_manager();
const auto all_roles = co_await role_manager.query_all();
const auto hierarchy = co_await role_manager.query_all_directly_granted();
const auto all_roles = co_await role_manager.query_all(qs);
const auto hierarchy = co_await role_manager.query_all_directly_granted(qs);
// includes only roles with attached service level
const auto attributes = co_await role_manager.query_attribute_for_all("service_level");
const auto attributes = co_await role_manager.query_attribute_for_all("service_level", qs);
std::map<sstring, service_level_options> effective_sl_map;
@@ -403,7 +404,7 @@ future<> service_level_controller::update_cache(update_both_cache_levels update_
}
if (_auth_integration) {
co_await _auth_integration->reload_cache();
co_await _auth_integration->reload_cache(ctx);
}
}

View File

@@ -173,7 +173,7 @@ public:
future<std::vector<cql3::description>> describe_attached_service_levels();
/// Must be executed on shard 0.
future<> reload_cache();
future<> reload_cache(qos::query_context ctx);
void clear_cache();
};

View File

@@ -497,7 +497,15 @@ future<> group0_voter_handler::update_nodes(
};
// Helper for adding a single node to the nodes list
auto add_node = [&nodes, &group0_config, &leader_id](const raft::server_id& id, const replica_state& rs, bool is_alive) {
auto add_node = [this, &nodes, &group0_config, &leader_id](const raft::server_id& id, const replica_state& rs, bool is_alive) {
// Some topology members may not belong to the new group 0 in the Raft-based recovery procedure.
if (!group0_config.contains(id)) {
if (!_gossiper.get_recovery_leader()) {
rvlogger.warn("node {} in state {} is not a part of the group 0 configuration {}, ignoring",
id, rs.state, group0_config);
}
return;
}
const auto is_voter = group0_config.can_vote(id);
const auto is_leader = (id == leader_id);
nodes.emplace(id, group0_voter_calculator::node_descriptor{

View File

@@ -745,7 +745,10 @@ future<> raft_group0::setup_group0_if_exist(db::system_keyspace& sys_ks, service
} else {
// We'll disable them once we complete the upgrade procedure.
}
} else if (!qp.db().get_config().recovery_leader.is_set()) {
} else if (qp.db().get_config().recovery_leader.is_set()) {
group0_log.info("Disabling migration_manager schema pulls in the Raft-based recovery procedure");
co_await mm.disable_schema_pulls();
} else {
// Scylla has bootstrapped earlier but group 0 ID is not present and we are not recovering from majority loss
// using the Raft-based procedure. This means we're upgrading.
// Upgrade will start through a feature listener created after we enter NORMAL state.

View File

@@ -6693,11 +6693,14 @@ future<std::unordered_map<sstring, sstring>> storage_service::add_repair_tablet_
// repair can only be requested for the base table, and this will repair the base table's tablets
// and all its colocated tablets as well.
if (!get_token_metadata().tablets().is_base_table(table)) {
auto table_schema = _db.local().find_schema(table);
auto base_schema = _db.local().find_schema(get_token_metadata().tablets().get_base_table(table));
throw std::invalid_argument(::format(
"Cannot set repair request on table {} because it is colocated with the base table {}. "
"Cannot set repair request on table '{}'.'{}' because it is colocated with the base table '{}'.'{}'. "
"Repair requests can be made only on the base table. "
"Repairing the base table will also repair all tables colocated with it.",
table, get_token_metadata().tablets().get_base_table(table)));
table_schema->ks_name(), table_schema->cf_name(), base_schema->ks_name(), base_schema->cf_name()));
}
auto& tmap = get_token_metadata().tablets().get_tablet_map(table);
@@ -6777,10 +6780,13 @@ future<> storage_service::del_repair_tablet_request(table_id table, locator::tab
// see add_repair_tablet_request. repair requests can only be added on base tables.
if (!get_token_metadata().tablets().is_base_table(table)) {
auto table_schema = _db.local().find_schema(table);
auto base_schema = _db.local().find_schema(get_token_metadata().tablets().get_base_table(table));
throw std::invalid_argument(::format(
"Cannot delete repair request on table {} because it is colocated with the base table {}. "
"Cannot delete repair request on table '{}'.'{}' because it is colocated with the base table '{}'.'{}'. "
"Repair requests can be added and deleted only on the base table.",
table, get_token_metadata().tablets().get_base_table(table)));
table_schema->ks_name(), table_schema->cf_name(), base_schema->ks_name(), base_schema->cf_name()));
}
auto& tmap = get_token_metadata().tablets().get_tablet_map(table);
@@ -7163,6 +7169,20 @@ future<> storage_service::await_topology_quiesced() {
co_await _topology_state_machine.await_not_busy();
}
future<bool> storage_service::verify_topology_quiesced(token_metadata::version_t expected_version) {
auto holder = _async_gate.hold();
if (this_shard_id() != 0) {
// group0 is only set on shard 0.
co_return co_await container().invoke_on(0, [&] (auto& ss) {
return ss.verify_topology_quiesced(expected_version);
});
}
co_await _group0->group0_server().read_barrier(&_group0_as);
co_return _topology_state_machine._topology.version == expected_version && !_topology_state_machine._topology.is_busy();
}
future<join_node_request_result> storage_service::join_node_request_handler(join_node_request_params params) {
join_node_request_result result;
rtlogger.info("received request to join from host_id: {}", params.host_id);

View File

@@ -1002,7 +1002,11 @@ public:
future<> add_tablet_replica(table_id, dht::token, locator::tablet_replica dst, loosen_constraints force = loosen_constraints::no);
future<> del_tablet_replica(table_id, dht::token, locator::tablet_replica dst, loosen_constraints force = loosen_constraints::no);
future<> set_tablet_balancing_enabled(bool);
future<> await_topology_quiesced();
// Verifies topology is not busy, and also that topology version hasn't changed since the one provided
// by the caller.
future<bool> verify_topology_quiesced(token_metadata::version_t expected_version);
// In the maintenance mode, other nodes won't be available thus we disabled joining
// the token ring and the token metadata won't be populated with the local node's endpoint.

View File

@@ -1854,6 +1854,8 @@ class topology_coordinator : public endpoint_lifecycle_subscriber {
// of token metadata will complete before we update topology.
auto guard = co_await global_tablet_token_metadata_barrier(std::move(g));
co_await utils::get_local_injector().inject("tablet_resize_finalization_post_barrier", utils::wait_for_message(std::chrono::minutes(2)));
auto tm = get_token_metadata_ptr();
auto plan = co_await _tablet_allocator.balance_tablets(tm, {}, get_dead_nodes());

View File

@@ -3280,7 +3280,7 @@ future<uint64_t> sstable::estimated_keys_for_range(const dht::token_range& range
std::vector<unsigned>
sstable::compute_shards_for_this_sstable(const dht::sharder& sharder_) const {
std::unordered_set<unsigned> shards;
dht::partition_range_vector token_ranges;
utils::chunked_vector<dht::partition_range> token_ranges;
const auto* sm = _components->scylla_metadata
? _components->scylla_metadata->data.get<scylla_metadata_type::Sharding, sharding_metadata>()
: nullptr;
@@ -3298,7 +3298,7 @@ sstable::compute_shards_for_this_sstable(const dht::sharder& sharder_) const {
};
token_ranges = sm->token_ranges.elements
| std::views::transform(disk_token_range_to_ring_position_range)
| std::ranges::to<dht::partition_range_vector>();
| std::ranges::to<utils::chunked_vector<dht::partition_range>>();
}
sstlog.trace("{}: token_ranges={}", get_filename(), token_ranges);
auto sharder = dht::ring_position_range_vector_sharder(sharder_, std::move(token_ranges));
@@ -3642,7 +3642,7 @@ std::unique_ptr<abstract_index_reader> sstable::make_index_reader(
auto cached_partitions_file = caching == use_caching::yes
? _cached_partitions_file
: seastar::make_shared<cached_file>(
_partitions_file,
uncached_partitions_file(),
_manager.get_cache_tracker().get_index_cached_file_stats(),
_manager.get_cache_tracker().get_lru(),
_manager.get_cache_tracker().region(),
@@ -3652,7 +3652,7 @@ std::unique_ptr<abstract_index_reader> sstable::make_index_reader(
auto cached_rows_file = caching == use_caching::yes
? _cached_rows_file
: seastar::make_shared<cached_file>(
_rows_file,
uncached_rows_file(),
_manager.get_cache_tracker().get_index_cached_file_stats(),
_manager.get_cache_tracker().get_lru(),
_manager.get_cache_tracker().region(),

View File

@@ -27,6 +27,7 @@
#include "readers/mutation_fragment_v1_stream.hh"
#include "locator/abstract_replication_strategy.hh"
#include "message/messaging_service.hh"
#include "service/storage_service.hh"
#include <cfloat>
#include <algorithm>
@@ -142,11 +143,12 @@ protected:
const unlink_sstables _unlink_sstables;
const stream_scope _stream_scope;
public:
sstable_streamer(netw::messaging_service& ms, replica::database& db, ::table_id table_id, std::vector<sstables::shared_sstable> sstables, primary_replica_only primary, unlink_sstables unlink, stream_scope scope)
sstable_streamer(netw::messaging_service& ms, replica::database& db, ::table_id table_id, locator::effective_replication_map_ptr erm,
std::vector<sstables::shared_sstable> sstables, primary_replica_only primary, unlink_sstables unlink, stream_scope scope)
: _ms(ms)
, _db(db)
, _table(db.find_column_family(table_id))
, _erm(_table.get_effective_replication_map())
, _erm(std::move(erm))
, _sstables(std::move(sstables))
, _primary_replica_only(primary)
, _unlink_sstables(unlink)
@@ -181,8 +183,9 @@ private:
class tablet_sstable_streamer : public sstable_streamer {
const locator::tablet_map& _tablet_map;
public:
tablet_sstable_streamer(netw::messaging_service& ms, replica::database& db, ::table_id table_id, std::vector<sstables::shared_sstable> sstables, primary_replica_only primary, unlink_sstables unlink, stream_scope scope)
: sstable_streamer(ms, db, table_id, std::move(sstables), primary, unlink, scope)
tablet_sstable_streamer(netw::messaging_service& ms, replica::database& db, ::table_id table_id, locator::effective_replication_map_ptr erm,
std::vector<sstables::shared_sstable> sstables, primary_replica_only primary, unlink_sstables unlink, stream_scope scope)
: sstable_streamer(ms, db, table_id, std::move(erm), std::move(sstables), primary, unlink, scope)
, _tablet_map(_erm->get_token_metadata().tablets().get_tablet_map(table_id)) {
}
@@ -526,13 +529,42 @@ static std::unique_ptr<sstable_streamer> make_sstable_streamer(bool uses_tablets
return std::make_unique<sstable_streamer>(std::forward<Args>(args)...);
}
future<locator::effective_replication_map_ptr> sstables_loader::await_topology_quiesced_and_get_erm(::table_id table_id) {
// By waiting for topology to quiesce, we guarantee load-and-stream will not start in the middle
// of a topology operation that changes the token range boundaries, e.g. split or merge.
// Split, for example, first executes the barrier and then splits the tablets.
// So it can happen a sstable is generated between those steps and will incorrectly span two
// tablets. We want to serialize load-and-stream and split finalization (a topology op).
locator::effective_replication_map_ptr erm;
while (true) {
auto& t = _db.local().find_column_family(table_id);
erm = t.get_effective_replication_map();
auto expected_topology_version = erm->get_token_metadata().get_version();
auto& ss = _ss.local();
// optimistically attempt to grab an erm on quiesced topology
// The awaiting is only needed with tablet over raft, so we're bypassing the check
// when raft is disabled.
if (!ss.raft_topology_change_enabled() || co_await ss.verify_topology_quiesced(expected_topology_version)) {
break;
}
erm = nullptr;
co_await _ss.local().await_topology_quiesced();
}
co_return std::move(erm);
}
future<> sstables_loader::load_and_stream(sstring ks_name, sstring cf_name,
::table_id table_id, std::vector<sstables::shared_sstable> sstables, bool primary, bool unlink, stream_scope scope,
shared_ptr<stream_progress> progress) {
// streamer guarantees topology stability, for correctness, by holding effective_replication_map
// throughout its lifetime.
auto erm = co_await await_topology_quiesced_and_get_erm(table_id);
auto streamer = make_sstable_streamer(_db.local().find_column_family(table_id).uses_tablets(),
_messaging, _db.local(), table_id, std::move(sstables),
_messaging, _db.local(), table_id, std::move(erm), std::move(sstables),
primary_replica_only(primary), unlink_sstables(unlink), scope);
co_await streamer->stream(progress);
@@ -749,6 +781,7 @@ future<> sstables_loader::download_task_impl::run() {
}
sstables_loader::sstables_loader(sharded<replica::database>& db,
sharded<service::storage_service>& ss,
netw::messaging_service& messaging,
sharded<db::view::view_builder>& vb,
sharded<db::view::view_building_worker>& vbw,
@@ -756,6 +789,7 @@ sstables_loader::sstables_loader(sharded<replica::database>& db,
sstables::storage_manager& sstm,
seastar::scheduling_group sg)
: _db(db)
, _ss(ss)
, _messaging(messaging)
, _view_builder(vb)
, _view_building_worker(vbw)

View File

@@ -29,6 +29,12 @@ class view_builder;
class view_building_worker;
}
}
namespace service {
class storage_service;
}
namespace locator {
class effective_replication_map;
}
struct stream_progress {
float total = 0.;
@@ -66,6 +72,7 @@ public:
private:
sharded<replica::database>& _db;
sharded<service::storage_service>& _ss;
netw::messaging_service& _messaging;
sharded<db::view::view_builder>& _view_builder;
sharded<db::view::view_building_worker>& _view_building_worker;
@@ -86,8 +93,10 @@ private:
bool primary_replica_only, bool unlink_sstables, stream_scope scope,
shared_ptr<stream_progress> progress);
future<seastar::shared_ptr<const locator::effective_replication_map>> await_topology_quiesced_and_get_erm(table_id table_id);
public:
sstables_loader(sharded<replica::database>& db,
sharded<service::storage_service>& ss,
netw::messaging_service& messaging,
sharded<db::view::view_builder>& vb,
sharded<db::view::view_building_worker>& vbw,

View File

@@ -335,6 +335,85 @@ def test_simple_batch_get_items(test_table_sb):
assert response['ConsumedCapacity'][0]['TableName'] == test_table_sb.name
assert 2 == response['ConsumedCapacity'][0]['CapacityUnits']
# This test reproduces a bug where the consumed capacity was divided by 16 MB,
# instead of 4 KB. The general formula for RCU per item is the same as for
# GetItem, namely:
#
# CEIL(ItemSizeInBytes / 4096) * (1 if strong consistency, 0.5 if eventual
# consistency)
#
# The RCU is calculated for each item individually, and the results are summed
# for the total cost of the BatchGetItem. In this case, the larger item is
# rounded up to 68KB, giving 17 RCUs, and the smaller item to 20KB, which
# results in 5 RCUs, making the total consumed capacity for this operation
# 22 RCUs.
def test_batch_get_items_large(test_table_sb):
p1 = random_string()
c1 = random_bytes()
test_table_sb.put_item(Item={'p': p1, 'c': c1, 'a': 'a' * 64 * KB})
p2 = random_string()
c2 = random_bytes()
test_table_sb.put_item(Item={'p': p2, 'c': c2, 'a': 'a' * 16 * KB})
response = test_table_sb.meta.client.batch_get_item(RequestItems = {
test_table_sb.name: {'Keys': [{'p': p1, 'c': c1}, {'p': p2, 'c': c2}], 'ConsistentRead': True}}, ReturnConsumedCapacity='TOTAL')
assert 'ConsumedCapacity' in response
assert 'TableName' in response['ConsumedCapacity'][0]
assert response['ConsumedCapacity'][0]['TableName'] == test_table_sb.name
assert 22 == response['ConsumedCapacity'][0]['CapacityUnits']
# Helper function to generate item_count items and batch write them to the
# table. Returns the list of generated items.
def prepare_items(table, item_factory, item_count=10):
items = []
with table.batch_writer() as writer:
for i in range(item_count):
item = item_factory(i)
items.append(item)
writer.put_item(Item=item)
return items
# This test verifies if querying two tables, each containing multiple ~30 byte
# items, reports the RCU correctly. A single item should consume 1 RCU, because
# the items' sizes are rounded up separately to 1 KB (ConsistentReads), and
# RCU should be reported per table. A variant of test_batch_get_items_large.
def test_batch_get_items_many_small(test_table_s, test_table_sb):
# Each item should be about 30 bytes.
items_sb = prepare_items(test_table_sb, lambda i: {'p': f'item_{i}_' + random_string(), 'c': random_bytes()})
items_s = prepare_items(test_table_s, lambda i: {'p': f'item_{i}_' + random_string()})
response = test_table_sb.meta.client.batch_get_item(RequestItems = {
test_table_sb.name: {'Keys': items_sb, 'ConsistentRead': True},
test_table_s.name: {'Keys': items_s, 'ConsistentRead': True},
}, ReturnConsumedCapacity='TOTAL')
assert 'ConsumedCapacity' in response
assert len(response['ConsumedCapacity']) == 2
expected_tables = {test_table_sb.name, test_table_s.name}
for consumption_per_table in response['ConsumedCapacity']:
assert 'TableName' in consumption_per_table
assert consumption_per_table['CapacityUnits'] == 10, f"Table {consumption_per_table['TableName']} reported {consumption_per_table['CapacityUnits']} RCUs, expected 10"
assert consumption_per_table['TableName'] in expected_tables
expected_tables.remove(consumption_per_table['TableName'])
assert not expected_tables
# This test verifies if querying a single partition reports the RCU correctly.
# This test is similar to test_batch_get_items_many_small.
def test_batch_get_items_many_small_single_partition(test_table_sb):
# Each item should be about 20 bytes.
pk = random_string()
items_sb = prepare_items(test_table_sb, lambda _: {'p': pk, 'c': random_bytes()})
response = test_table_sb.meta.client.batch_get_item(RequestItems = {
test_table_sb.name: {'Keys': items_sb, 'ConsistentRead': True},
}, ReturnConsumedCapacity='TOTAL')
assert 'ConsumedCapacity' in response
assert 'TableName' in response['ConsumedCapacity'][0]
assert response['ConsumedCapacity'][0]['TableName'] == test_table_sb.name
assert 10 == response['ConsumedCapacity'][0]['CapacityUnits']
# Validate that when getting a batch of requests
# From multiple tables we get an RCU for each of the tables
# We also validate that the eventual consistency return half the units

View File

@@ -758,10 +758,9 @@ void test_chunked_download_data_source(const client_maker_function& client_maker
}
}
};
BOOST_REQUIRE_EXCEPTION(
reader(), storage_io_error, [](const storage_io_error& e) {
return e.what() == "S3 request failed. Code: 16. Reason: "sv;
});
BOOST_REQUIRE_EXCEPTION(reader(), aws::aws_exception, [](const aws::aws_exception& e) {
return e.what() == "Injected ResourceNotFound"sv;
});
#else
testlog.info("Skipping error injection test, as it requires SCYLLA_ENABLE_ERROR_INJECTION to be enabled");
#endif

View File

@@ -25,17 +25,23 @@
SEASTAR_THREAD_TEST_CASE(test_empty) {
db::config dbcfg;
dbcfg.rf_rack_valid_keyspaces(true);
BOOST_REQUIRE_THROW(tools::load_schemas(dbcfg, "").get(), std::exception);
BOOST_REQUIRE_THROW(tools::load_schemas(dbcfg, ";").get(), std::exception);
}
SEASTAR_THREAD_TEST_CASE(test_keyspace_only) {
db::config dbcfg;
dbcfg.rf_rack_valid_keyspaces(true);
BOOST_REQUIRE_EQUAL(tools::load_schemas(dbcfg, "CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1};").get().size(), 0);
}
SEASTAR_THREAD_TEST_CASE(test_single_table) {
db::config dbcfg;
dbcfg.rf_rack_valid_keyspaces(true);
BOOST_REQUIRE_EQUAL(tools::load_schemas(dbcfg, "CREATE TABLE ks.cf (pk int PRIMARY KEY, v int)").get().size(), 1);
BOOST_REQUIRE_EQUAL(tools::load_schemas(dbcfg, "CREATE TABLE ks.cf (pk int PRIMARY KEY, v map<int, int>)").get().size(), 1);
BOOST_REQUIRE_EQUAL(tools::load_schemas(dbcfg, "CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1}; CREATE TABLE ks.cf (pk int PRIMARY KEY, v int);").get().size(), 1);
@@ -43,6 +49,8 @@ SEASTAR_THREAD_TEST_CASE(test_single_table) {
SEASTAR_THREAD_TEST_CASE(test_keyspace_replication_strategy) {
db::config dbcfg;
dbcfg.rf_rack_valid_keyspaces(true);
BOOST_REQUIRE_EQUAL(tools::load_schemas(dbcfg, "CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1}; CREATE TABLE ks.cf (pk int PRIMARY KEY, v int);").get().size(), 1);
BOOST_REQUIRE_EQUAL(tools::load_schemas(dbcfg, "CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3}; CREATE TABLE ks.cf (pk int PRIMARY KEY, v int);").get().size(), 1);
BOOST_REQUIRE_EQUAL(tools::load_schemas(dbcfg, "CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'mydc1': 1, 'mydc2': 4}; CREATE TABLE ks.cf (pk int PRIMARY KEY, v int);").get().size(), 1);
@@ -50,6 +58,8 @@ SEASTAR_THREAD_TEST_CASE(test_keyspace_replication_strategy) {
SEASTAR_THREAD_TEST_CASE(test_multiple_tables) {
db::config dbcfg;
dbcfg.rf_rack_valid_keyspaces(true);
BOOST_REQUIRE_EQUAL(tools::load_schemas(dbcfg, "CREATE TABLE ks.cf1 (pk int PRIMARY KEY, v int); CREATE TABLE ks.cf2 (pk int PRIMARY KEY, v int)").get().size(), 2);
BOOST_REQUIRE_EQUAL(tools::load_schemas(dbcfg, "CREATE TABLE ks.cf1 (pk int PRIMARY KEY, v int); CREATE TABLE ks.cf2 (pk int PRIMARY KEY, v int);").get().size(), 2);
BOOST_REQUIRE_EQUAL(tools::load_schemas(
@@ -70,6 +80,8 @@ SEASTAR_THREAD_TEST_CASE(test_multiple_tables) {
SEASTAR_THREAD_TEST_CASE(test_udts) {
db::config dbcfg;
dbcfg.rf_rack_valid_keyspaces(true);
BOOST_REQUIRE_EQUAL(tools::load_schemas(
dbcfg,
"CREATE KEYSPACE ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1}; "
@@ -107,6 +119,8 @@ SEASTAR_THREAD_TEST_CASE(test_udts) {
SEASTAR_THREAD_TEST_CASE(test_dropped_columns) {
db::config dbcfg;
dbcfg.rf_rack_valid_keyspaces(true);
BOOST_REQUIRE_EQUAL(tools::load_schemas(
dbcfg,
"CREATE TABLE ks.cf (pk int PRIMARY KEY, v1 int); "
@@ -177,6 +191,7 @@ void check_views(std::vector<schema_ptr> schemas, std::vector<view_type> views_t
SEASTAR_THREAD_TEST_CASE(test_materialized_view) {
db::config dbcfg;
dbcfg.rf_rack_valid_keyspaces(true);
check_views(
tools::load_schemas(
@@ -219,6 +234,7 @@ SEASTAR_THREAD_TEST_CASE(test_materialized_view) {
SEASTAR_THREAD_TEST_CASE(test_index) {
db::config dbcfg;
dbcfg.rf_rack_valid_keyspaces(true);
check_views(
tools::load_schemas(
@@ -269,6 +285,7 @@ SEASTAR_THREAD_TEST_CASE(test_index) {
SEASTAR_THREAD_TEST_CASE(test_mv_index) {
db::config dbcfg;
dbcfg.rf_rack_valid_keyspaces(true);
check_views(
tools::load_schemas(
@@ -308,6 +325,7 @@ void check_schema_columns(const schema& a, const schema& b, bool check_key_colum
void check_sstable_schema(sstables::test_env& env, std::filesystem::path sst_path, const utils::chunked_vector<mutation>& mutations, bool has_scylla_metadata) {
db::config dbcfg;
dbcfg.rf_rack_valid_keyspaces(true);
auto schema = tools::load_schema_from_sstable(dbcfg, sst_path).get();

View File

@@ -1758,15 +1758,16 @@ SEASTAR_TEST_CASE(time_window_strategy_correctness_test) {
buckets[bound].push_back(sstables[i]);
}
auto state = cf.as_compaction_group_view().get_compaction_strategy_state().get<compaction::time_window_compaction_strategy_state_ptr>();
auto now = api::timestamp_clock::now().time_since_epoch().count();
auto new_bucket = twcs.newest_bucket(cf.as_compaction_group_view(), *control, buckets, 4, 32,
compaction::time_window_compaction_strategy::get_window_lower_bound(duration_cast<seconds>(hours(1)), now));
compaction::time_window_compaction_strategy::get_window_lower_bound(duration_cast<seconds>(hours(1)), now), *state);
// incoming bucket should not be accepted when it has below the min threshold SSTables
BOOST_REQUIRE(new_bucket.empty());
now = api::timestamp_clock::now().time_since_epoch().count();
new_bucket = twcs.newest_bucket(cf.as_compaction_group_view(), *control, buckets, 2, 32,
compaction::time_window_compaction_strategy::get_window_lower_bound(duration_cast<seconds>(hours(1)), now));
compaction::time_window_compaction_strategy::get_window_lower_bound(duration_cast<seconds>(hours(1)), now), *state);
// incoming bucket should be accepted when it is larger than the min threshold SSTables
BOOST_REQUIRE(!new_bucket.empty());
@@ -1801,7 +1802,7 @@ SEASTAR_TEST_CASE(time_window_strategy_correctness_test) {
now = api::timestamp_clock::now().time_since_epoch().count();
new_bucket = twcs.newest_bucket(cf.as_compaction_group_view(), *control, buckets, 4, 32,
compaction::time_window_compaction_strategy::get_window_lower_bound(duration_cast<seconds>(hours(1)), now));
compaction::time_window_compaction_strategy::get_window_lower_bound(duration_cast<seconds>(hours(1)), now), *state);
// new bucket should be trimmed to max threshold of 32
BOOST_REQUIRE(new_bucket.size() == size_t(32));
});
@@ -1861,7 +1862,8 @@ SEASTAR_TEST_CASE(time_window_strategy_size_tiered_behavior_correctness) {
auto control = make_strategy_control_for_test(false);
// past window cannot be compacted because it has a single SSTable
BOOST_REQUIRE(twcs.newest_bucket(cf.as_compaction_group_view(), *control, buckets, min_threshold, max_threshold, now).size() == 0);
auto state = cf.as_compaction_group_view().get_compaction_strategy_state().get<compaction::time_window_compaction_strategy_state_ptr>();
BOOST_REQUIRE(twcs.newest_bucket(cf.as_compaction_group_view(), *control, buckets, min_threshold, max_threshold, now, *state).size() == 0);
// create min_threshold-1 sstables into current time window
for (api::timestamp_type t = 0; t < min_threshold - 1; t++) {
@@ -1874,12 +1876,12 @@ SEASTAR_TEST_CASE(time_window_strategy_size_tiered_behavior_correctness) {
// past window can now be compacted into a single SSTable because it was the previous current (active) window.
// current window cannot be compacted because it has less than min_threshold SSTables
BOOST_REQUIRE(twcs.newest_bucket(cf.as_compaction_group_view(), *control, buckets, min_threshold, max_threshold, now).size() == 2);
BOOST_REQUIRE(twcs.newest_bucket(cf.as_compaction_group_view(), *control, buckets, min_threshold, max_threshold, now, *state).size() == 2);
major_compact_bucket(past_window_ts);
// now past window cannot be compacted again, because it was already compacted into a single SSTable, now it switches to STCS mode.
BOOST_REQUIRE(twcs.newest_bucket(cf.as_compaction_group_view(), *control, buckets, min_threshold, max_threshold, now).size() == 0);
BOOST_REQUIRE(twcs.newest_bucket(cf.as_compaction_group_view(), *control, buckets, min_threshold, max_threshold, now, *state).size() == 0);
// make past window contain more than min_threshold similar-sized SSTables, allowing it to be compacted again.
for (api::timestamp_type t = 1; t < min_threshold; t++) {
@@ -1887,7 +1889,7 @@ SEASTAR_TEST_CASE(time_window_strategy_size_tiered_behavior_correctness) {
}
// now past window can be compacted again because it switched to STCS mode and has more than min_threshold SSTables.
BOOST_REQUIRE(twcs.newest_bucket(cf.as_compaction_group_view(), *control, buckets, min_threshold, max_threshold, now).size() == size_t(min_threshold));
BOOST_REQUIRE(twcs.newest_bucket(cf.as_compaction_group_view(), *control, buckets, min_threshold, max_threshold, now, *state).size() == size_t(min_threshold));
});
}

View File

@@ -13,6 +13,7 @@
#include "sstables/sstable_compressor_factory.hh"
#include "test/lib/log.hh"
#include "test/lib/random_utils.hh"
#include "test/lib/test_utils.hh"
BOOST_AUTO_TEST_SUITE(sstable_compressor_factory_test)
@@ -27,7 +28,7 @@ void test_one_numa_topology(std::span<unsigned> shard_to_numa_mapping) {
testlog.info("Testing NUMA topology {}", shard_to_numa_mapping);
// Create a compressor factory.
SCYLLA_ASSERT(shard_to_numa_mapping.size() == smp::count);
tests::require(shard_to_numa_mapping.size() == smp::count);
auto config = default_sstable_compressor_factory::config{
.numa_config = std::vector(shard_to_numa_mapping.begin(), shard_to_numa_mapping.end()),
};
@@ -68,8 +69,8 @@ void test_one_numa_topology(std::span<unsigned> shard_to_numa_mapping) {
// Check that the dictionary used by this shard lies on the same NUMA node.
// This is important to avoid cross-node memory accesses on the hot path.
BOOST_CHECK_EQUAL(our_numa_node, compressor_numa_node);
BOOST_CHECK_EQUAL(our_numa_node, decompressor_numa_node);
tests::require_equal(our_numa_node, compressor_numa_node);
tests::require_equal(our_numa_node, decompressor_numa_node);
compressor_numa_nodes[this_shard_id()] = compressor_numa_node;
decompressor_numa_nodes[this_shard_id()] = compressor_numa_node;
@@ -79,22 +80,22 @@ void test_one_numa_topology(std::span<unsigned> shard_to_numa_mapping) {
auto compressed_size = compressor->compress(
reinterpret_cast<const char*>(message.data()), message.size(),
reinterpret_cast<char*>(compressed.data()), compressed.size());
BOOST_REQUIRE_GE(compressed_size, 0);
tests::require_greater_equal(compressed_size, 0);
compressed.resize(compressed_size);
// Validate that the recommeded dict was actually used.
BOOST_CHECK(compressed.size() < message.size() / 10);
tests::require_less(compressed.size(), message.size() / 10);
auto decompressed = std::vector<char>(message.size());
auto decompressed_size = decompressor->uncompress(
reinterpret_cast<const char*>(compressed.data()), compressed.size(),
reinterpret_cast<char*>(decompressed.data()), decompressed.size());
BOOST_REQUIRE_GE(decompressed_size, 0);
tests::require_greater_equal(decompressed_size, 0);
decompressed.resize(decompressed_size);
// Validate that the roundtrip through compressor and decompressor
// resulted in the original message.
BOOST_CHECK_EQUAL_COLLECTIONS(message.begin(), message.end(), decompressed.begin(), decompressed.end());
tests::require(std::equal(message.begin(), message.end(), decompressed.begin(), decompressed.end()));
})).get();
}
@@ -102,11 +103,11 @@ void test_one_numa_topology(std::span<unsigned> shard_to_numa_mapping) {
// of NUMA nodes.
// This isn't that important, but we don't want to duplicate dictionaries
// within a NUMA node unnecessarily.
BOOST_CHECK_EQUAL(
tests::require_equal(
std::set(compressor_numa_nodes.begin(), compressor_numa_nodes.end()).size(),
std::set(shard_to_numa_mapping.begin(), shard_to_numa_mapping.end()).size()
);
BOOST_CHECK_EQUAL(
tests::require_equal(
std::set(decompressor_numa_nodes.begin(), decompressor_numa_nodes.end()).size(),
std::set(shard_to_numa_mapping.begin(), shard_to_numa_mapping.end()).size()
);

View File

@@ -0,0 +1,286 @@
#
# Copyright (C) 2025-present ScyllaDB
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
#
import logging
import math
import pytest
from dtest_class import Tester, create_ks
logger = logging.getLogger(__name__)
# Those are ideal values according to c* specifications
# they should pass
LIMIT_64_K = 64 * 1024
LIMIT_32K = 32 * 1024
LIMIT_128K = 128 * 1024
LIMIT_2GB = 2 * 1024 * 1024 * 1024
MAX_KEY_SIZE = LIMIT_64_K
MAX_BLOB_SIZE = 8388608 # theoretical limit LIMIT_2GB
MAX_COLUMNS = LIMIT_128K
MAX_TUPLES = LIMIT_32K
MAX_BATCH_SIZE = 50 * 1024
MAX_CELLS_COLUMNS = LIMIT_32K
MAX_CELLS_BATCH_SIZE = 50
MAX_CELLS = 16777216
# Those are values used to validate the tests code
# MAX_KEY_SIZE = 1000
# MAX_BLOB_SIZE = 1000
# MAX_COLUMNS = 1000
# MAX_TUPLES = 1000
# MAX_BATCH_SIZE = 1000
# MAX_CELLS_COLUMNS = 100
# MAX_CELLS_BATCH_SIZE = 100
# MAX_CELLS = 1000
@pytest.mark.single_node
class TestLimits(Tester):
def prepare(self):
"""
Sets up node to test against.
"""
cluster = self.cluster
return cluster
def _do_test_max_key_length(self, session, node, size, expect_failure=False):
print("Testing max key length for {}.{}".format(size, " Expected failure..." if expect_failure else ""))
key_name = "k" * size
c = f"CREATE TABLE test1 ({key_name} int PRIMARY KEY)"
if expect_failure:
expected_error = r"Key size too large: \d+ > 65535"
self.ignore_log_patterns += [expected_error]
with pytest.raises(Exception, match=expected_error):
session.execute(c)
return
session.execute(c)
session.execute("insert into ks.test1 (%s) values (1);" % key_name)
session.execute("insert into ks.test1 (%s) values (2);" % key_name)
node.flush()
# Select
res = session.execute(
"""
SELECT * FROM ks.test1
WHERE %s=1
"""
% key_name
)
assert len(res.current_rows) == 1
res = session.execute(
"""
SELECT * FROM ks.test1
WHERE %s=2
"""
% key_name
)
assert len(res.current_rows) == 1
session.execute("""DROP TABLE test1""")
def test_max_key_length(self):
cluster = self.prepare()
cluster.populate(1).start()
node = cluster.nodelist()[0]
session = self.patient_cql_connection(node)
create_ks(session, "ks", 1)
# biggest that will currently work in scylla
# key_name = "k" * 65526
self._do_test_max_key_length(session, node, MAX_KEY_SIZE, expect_failure=True)
self._do_test_max_key_length(session, node, MAX_KEY_SIZE - 9, expect_failure=True)
self._do_test_max_key_length(session, node, MAX_KEY_SIZE - 10)
size = MAX_KEY_SIZE // 2
while size >= 1:
self._do_test_max_key_length(session, node, size)
size >>= 3
def _do_test_blob_size(self, session, node, size):
print("Testing blob size %i" % size)
blob_a = "a" * size
blob_b = "b" * size
session.execute(
"""
CREATE TABLE test1 (
user ascii PRIMARY KEY,
payload blob,
)
"""
)
session.execute("insert into ks.test1 (user, payload) values ('tintin', textAsBlob('%s'));" % blob_a)
session.execute("insert into ks.test1 (user, payload) values ('milou', textAsBlob('%s'));" % blob_b)
node.flush()
# Select
res = session.execute(
"""
SELECT * FROM ks.test1
WHERE user='tintin'
"""
)
assert len(list(res)) == 1
res = session.execute(
"""
SELECT * FROM ks.test1
WHERE user='milou'
"""
)
assert len(list(res)) == 1
session.execute("""DROP TABLE test1""")
def test_max_column_value_size(self):
cluster = self.prepare()
cluster.populate(1).start()
node = cluster.nodelist()[0]
session = self.patient_cql_connection(node)
create_ks(session, "ks", 1)
size = 1
for i in range(int(math.log(MAX_BLOB_SIZE, 2))):
size <<= 1
self._do_test_blob_size(session, node, size - 1)
def _do_test_max_tuples(self, session, node, count):
print("Testing max tuples for %i" % count)
t = ""
v = ""
for i in range(count):
t += "int, "
v += "1, "
t = t[:-2]
v = v[:-2]
c = (
"""
CREATE TABLE stuff (
k int PRIMARY KEY,
v frozen<tuple<%s>>
);
"""
% t
)
session.execute(c)
c = "INSERT INTO stuff (k, v) VALUES(0, (%s));" % v
session.execute(c)
c = "SELECT * FROM STUFF;"
res = session.execute(c)
assert len(res.current_rows) == 1
session.execute("""DROP TABLE stuff""")
def test_max_tuple(self):
cluster = self.prepare()
cluster.populate(1).start()
node = cluster.nodelist()[0]
session = self.patient_cql_connection(node)
create_ks(session, "ks", 1)
count = 1
for i in range(int(math.log(MAX_TUPLES, 2))):
count <<= 1
self._do_test_max_tuples(session, node, count - 1)
def _do_test_max_batch_size(self, session, node, size):
print("Testing max batch size for size=%i" % size)
c = """
CREATE TABLE stuff (
k int PRIMARY KEY,
v text
);
"""
session.execute(c)
c = "BEGIN UNLOGGED BATCH\n"
row_size = 1000
overhead = 100
blob = (row_size - overhead) * "x"
rows = size // row_size
for i in range(rows):
c += "INSERT INTO stuff (k, v) VALUES(%i, '%s')\n" % (i, blob)
c += "APPLY BATCH;\n"
session.execute(c)
c = "SELECT * FROM STUFF;"
res = session.execute(c)
assert len(list(res)) == rows
session.execute("""DROP TABLE STUFF""")
def test_max_batch_size(self):
cluster = self.prepare()
cluster.populate(1).start()
node = cluster.nodelist()[0]
session = self.patient_cql_connection(node)
create_ks(session, "ks", 1)
size = 1
for i in range(int(math.log(MAX_BATCH_SIZE, 2))):
size <<= 1
self._do_test_max_batch_size(session, node, size - 1)
def _do_test_max_cell_count(self, session, cells):
print("Testing max cells count for %i" % cells)
keys = ""
keys_create = ""
columns = MAX_CELLS_COLUMNS
for i in range(columns):
keys += "key" + str(i) + ", "
keys_create += "key" + str(i) + " int, "
values = "1, " * columns
c = """CREATE TABLE test1 (%s blub int PRIMARY KEY,)""" % keys_create
session.execute(c)
batch_size = MAX_CELLS_BATCH_SIZE
rows = cells // columns
c = "BEGIN UNLOGGED BATCH\n"
for i in range(rows):
c += "insert into ks.test1 (%s blub) values (%s %i);\n" % (keys, values, i)
if i == rows - 1 or (i + 1) % batch_size == 0:
c += "APPLY BATCH;\n"
session.execute(c)
c = "BEGIN UNLOGGED BATCH\n"
session.execute("""DROP TABLE test1""")
def test_max_cells(self):
if self.cluster.scylla_mode == "debug":
pytest.skip("client times out in debug mode")
cluster = self.prepare()
cluster.set_configuration_options(values={"query_tombstone_page_limit": 9999999, "batch_size_warn_threshold_in_kb": 1024 * 1024, "batch_size_fail_threshold_in_kb": 1024 * 1024, "commitlog_segment_size_in_mb": 64})
cluster.populate(1).start(jvm_args=["--smp", "1", "--memory", "2G", "--logger-log-level", "lsa-timing=debug"])
node = cluster.nodelist()[0]
session = self.patient_cql_connection(node)
create_ks(session, "ks", 1)
cells = 1
for i in range(int(math.log(MAX_CELLS, 2))):
cells <<= 1
self._do_test_max_cell_count(session, cells - 1)

View File

@@ -0,0 +1,120 @@
#
# Copyright (C) 2025-present ScyllaDB
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
#
import asyncio
import logging
import pytest
from cassandra.cluster import Session as CassandraSession
from cassandra.protocol import InvalidRequest
from test.pylib.manager_client import ManagerClient
logger = logging.getLogger(__name__)
@pytest.mark.asyncio
@pytest.mark.parametrize("schema_kind", ["view", "index"])
# Views no longer depend on the experimental feature `views-with-tablets`,
# but let's keep these test cases to make sure it's really not needed anymore.
@pytest.mark.parametrize("views_with_tablets", [False, True])
@pytest.mark.parametrize("rf_rack_valid_keyspaces", [False, True])
async def test_mv_and_index_restrictions_in_tablet_keyspaces(manager: ManagerClient, schema_kind: str,
views_with_tablets: bool, rf_rack_valid_keyspaces: bool):
"""
Verify that creating a materialized view or a secondary index in a tablet-based keyspace
is only possible when both the configuration option `rf_rack_valid_keyspaces` is enabled.
"""
async def create_mv_or_index(cql: CassandraSession):
if schema_kind == "view":
await cql.run_async("CREATE MATERIALIZED VIEW ks.mv "
"AS SELECT * FROM ks.t "
"WHERE p IS NOT NULL AND v IS NOT NULL "
"PRIMARY KEY (v, p)")
elif schema_kind == "index":
await cql.run_async("CREATE INDEX myindex ON ks.t(v)")
else:
assert False, "Unknown schema kind"
async def try_pass(cql: CassandraSession):
try:
await cql.run_async(f"CREATE KEYSPACE ks WITH replication = "
"{'class': 'NetworkTopologyStrategy', 'replication_factor': 1} "
"AND tablets = {'enabled': true}")
await cql.run_async(f"CREATE TABLE ks.t (p int PRIMARY KEY, v int)")
await create_mv_or_index(cql)
finally:
await cql.run_async(f"DROP KEYSPACE IF EXISTS ks")
async def try_fail(cql: CassandraSession):
err = "Materialized views and secondary indexes are not supported on base tables with tablets. " \
"To be able to use them, enable the configuration option `rf_rack_valid_keyspaces` and " \
"make sure that the cluster feature `VIEWS_WITH_TABLETS` is enabled."
with pytest.raises(InvalidRequest, match=err):
await try_pass(cql)
feature = ["views-with-tablets"] if views_with_tablets else []
config = {"experimental_features": feature, "rf_rack_valid_keyspaces": rf_rack_valid_keyspaces}
srv = await manager.server_add(config=config)
# Necessary because we're restarting the node multiple times.
cql, _ = await manager.get_ready_cql([srv])
logger.debug("Obtained CassandraSession object")
# We just want to validate the statements. We don't need to wait.
assert hasattr(cql.cluster, "max_schema_agreement_wait")
cql.cluster.max_schema_agreement_wait = 0
logger.debug("Set max_schema_agreement_wait to 0")
if rf_rack_valid_keyspaces:
await try_pass(cql)
logger.debug("try_pass finished successfully")
else:
await try_fail(cql)
logger.debug("try_fail finished successfully")
@pytest.mark.asyncio
@pytest.mark.parametrize("view_type", ["view", "index"])
async def test_view_startup(manager: ManagerClient, view_type: str):
"""
Verify that starting a node with materialized views in a tablet-based
keyspace when the configuration option `rf_rack_valid_keyspaces` is disabled
leads to a warning.
"""
srv = await manager.server_add(config={"rf_rack_valid_keyspaces": True})
cql = manager.get_cql()
await cql.run_async("CREATE KEYSPACE ks WITH replication = "
"{'class': 'NetworkTopologyStrategy', 'replication_factor': 1} "
"AND tablets = {'enabled': true}")
await cql.run_async("CREATE TABLE ks.t (p int PRIMARY KEY, v int)")
if view_type == "view":
await cql.run_async("CREATE MATERIALIZED VIEW ks.mv "
"AS SELECT * FROM ks.t "
"WHERE p IS NOT NULL AND v IS NOT NULL "
"PRIMARY KEY (v, p)")
elif view_type == "index":
await cql.run_async("CREATE INDEX i ON ks.t(v)")
else:
logger.error(f"Unexpected view type: {view_type}")
assert False
await manager.server_stop(srv.server_id)
await manager.server_update_config(srv.server_id, "rf_rack_valid_keyspaces", False)
log = await manager.server_open_log(srv.server_id)
mark = await log.mark()
start_task = asyncio.create_task(manager.server_start(srv.server_id))
err = "Some of the existing keyspaces violate the requirements for using materialized " \
"views or secondary indexes. Those features require enabling the configuration " \
"option `rf_rack_valid_keyspaces` and the cluster feature `VIEWS_WITH_TABLETS`. " \
"The keyspaces that violate that condition: ks"
await log.wait_for(err, from_mark=mark)
await start_task

View File

@@ -256,6 +256,11 @@ async def test_mv_pairing_during_replace(manager: ManagerClient):
@pytest.mark.asyncio
@pytest.mark.parametrize("delayed_replica", ["base", "mv"])
@pytest.mark.parametrize("altered_dc", ["dc1", "dc2"])
# FIXME: The test relies on cross-rack tablet migrations. They're forbidden when the configuration option
# `rf_rack_valid_keyspaces` is enabled. On the other hand, materialized views in tablet-based keyspaces
# require the configuration option to be used.
# Hence, we need to rewrite this test.
@pytest.mark.skip
@skip_mode('release', 'error injections are not supported in release mode')
async def test_mv_rf_change(manager: ManagerClient, delayed_replica: str, altered_dc: str):
servers = []
@@ -331,8 +336,8 @@ async def test_mv_first_replica_in_dc(manager: ManagerClient, delayed_replica: s
# If we run the test with more than 1 shard and the tablet for the view table gets allocated on the same shard as the tablet of the base table,
# we'll perform an intranode migration of one of these tablets to the other shard. This migration can be confused with the migration to the
# new dc in the "first_migration_done()" below. To avoid this, run servers with only 1 shard.
servers.append(await manager.server_add(cmdline=['--smp', '1'], config={'rf_rack_valid_keyspaces': False}, property_file={'dc': f'dc1', 'rack': 'myrack1'}))
servers.append(await manager.server_add(cmdline=['--smp', '1'], config={'rf_rack_valid_keyspaces': False}, property_file={'dc': f'dc2', 'rack': 'myrack1'}))
servers.append(await manager.server_add(cmdline=['--smp', '1'], property_file={'dc': f'dc1', 'rack': 'myrack1'}))
servers.append(await manager.server_add(cmdline=['--smp', '1'], property_file={'dc': f'dc2', 'rack': 'myrack1'}))
cql = manager.get_cql()
await cql.run_async("CREATE KEYSPACE IF NOT EXISTS ks WITH replication = {'class': 'NetworkTopologyStrategy', 'dc1': 1} AND tablets = {'initial': 1}")

View File

@@ -614,8 +614,15 @@ CLUSTER_EVENTS: tuple[ClusterEventType, ...] = (
sleep_for_30_seconds,
add_new_table,
drop_table,
add_index,
drop_index,
# FIXME: We omit creating or dropping indexes because the random_failures
# tests still haven't been adjusted to work with `rf_rack_valid_keyspaces`.
# That option is a requirement for using materialized views
# in tablet-based keyspaces, so let's skip them.
#
# add_index,
# drop_index,
add_new_keyspace,
drop_keyspace,
add_cdc,

View File

@@ -25,6 +25,7 @@ run_first:
skip_in_release:
- test_raft_cluster_features
- test_cluster_features
- dtest/limits_test
skip_in_debug:
- test_shutdown_hang
- test_replace

View File

@@ -14,10 +14,16 @@ from typing import Any
from test.cluster.conftest import skip_mode
from test.pylib.internal_types import ServerInfo
from test.pylib.manager_client import ManagerClient
from test.pylib.rest_client import ScyllaMetrics
# main logger
logger = logging.getLogger(__name__)
async def get_metrics(manager: ManagerClient, servers: list[ServerInfo]) -> list[ScyllaMetrics]:
return await asyncio.gather(*[manager.metrics.query(s.ip_addr) for s in servers])
def get_io_read_ops(metrics: list[ScyllaMetrics]) -> int:
return int(sum([m.get("scylla_io_queue_total_read_ops") for m in metrics]))
async def live_update_config(manager: ManagerClient, servers: list[ServerInfo], key: str, value: Any):
cql, hosts = await manager.get_ready_cql(servers)
await asyncio.gather(*[manager.server_update_config(s.server_id, key, value) for s in servers])
@@ -39,7 +45,18 @@ async def test_bti_index_enable(manager: ManagerClient) -> None:
ks_name = "ks"
cf_name = "t"
servers = await manager.servers_add(n_servers, config = {
# We run with `--smp=1` because the test uses CQL tracing.
#
# Trace events are written to trace tables asynchronously w.r.t.
# the traced statements, and AFAIU there's currently no way
# (other than polling) to wait for them to appear.
#
# The Python driver has a polling mechanism for traces,
# but it is only reliable if the entire statement runs on a single shard.
# (Because it only waits for the coordinator, and not for replicas, to write their events).
# So let's just make this a single-shard test, we aren't testing
# any multi-shard mechanisms here anyway.
servers = await manager.servers_add(n_servers, cmdline=['--smp=1'], config = {
'error_injections_at_startup': [
{
'name': 'suppress_features',
@@ -98,7 +115,9 @@ async def test_bti_index_enable(manager: ManagerClient) -> None:
async def test_bti_usage_during_reads(should_use_bti: bool, use_cache: bool):
select = select_with_cache if use_cache else select_without_cache
metrics_before = await get_metrics(manager, servers)
select_result = cql.execute(select, (chosen_pk, chosen_ck), trace=True)
metrics_after = await get_metrics(manager, servers)
row = select_result.one()
assert row.pk == chosen_pk
assert row.ck == chosen_ck
@@ -113,14 +132,25 @@ async def test_bti_index_enable(manager: ManagerClient) -> None:
seen_partitions = seen_partitions or "Partitions.db" in event.description
seen_rows = seen_rows or "Rows.db" in event.description
seen_index = seen_index or "Index.db" in event.description
if should_use_bti:
assert not seen_index, "Index.db was used despite BTI preference"
assert seen_partitions, "Partitions.db was not used despite BTI preference"
assert seen_rows, "Rows.db was not used despite BTI preference"
else:
assert seen_index, "Index.db was not used despite BIG preference"
assert not seen_partitions, "Partitions.db was used despite BIG preference"
assert not seen_rows, "Rows.db was used despite BIG preference"
if not use_cache:
if should_use_bti:
assert not seen_index, "Index.db was used despite BTI preference"
assert seen_partitions, "Partitions.db was not used despite BTI preference"
assert seen_rows, "Rows.db was not used despite BTI preference"
else:
assert seen_index, "Index.db was not used despite BIG preference"
assert not seen_partitions, "Partitions.db was used despite BIG preference"
assert not seen_rows, "Rows.db was used despite BIG preference"
# Test that BYPASS CACHE does force disk reads.
io_read_ops = get_io_read_ops(metrics_after) - get_io_read_ops(metrics_before)
if should_use_bti:
# At least one read for Partitions.db, Rows.db, Data.db
assert io_read_ops >= 3
else:
# At least one read in Index.db (main index), Index.db (promoted index), Data.db
assert io_read_ops >= 3
logger.info("Step 3: Checking for BTI files (should not exist, because cluster feature is suppressed)")
await test_files_presence(bti_should_exist=False, big_should_exist=True)
@@ -143,7 +173,10 @@ async def test_bti_index_enable(manager: ManagerClient) -> None:
await asyncio.gather(*[manager.api.keyspace_upgrade_sstables(s.ip_addr, ks_name) for s in servers])
logger.info("Step 7: Checking for BTI files (should exist)")
await test_files_presence(bti_should_exist=True, big_should_exist=False)
await test_bti_usage_during_reads(should_use_bti=True, use_cache=False)
await test_bti_usage_during_reads(should_use_bti=True, use_cache=True)
# Test that BYPASS CACHE does its thing.
for _ in range(3):
await test_bti_usage_during_reads(should_use_bti=True, use_cache=False)
await test_bti_usage_during_reads(should_use_bti=True, use_cache=True)
manager.driver_close()

View File

@@ -9,6 +9,8 @@ from test.pylib.manager_client import ManagerClient
import asyncio
import pytest
from test.pylib.util import wait_for_first_completed
@pytest.mark.asyncio
async def test_different_group0_ids(manager: ManagerClient):
@@ -40,13 +42,11 @@ async def test_different_group0_ids(manager: ManagerClient):
log_file_b = await manager.server_open_log(scylla_b.server_id)
# Wait for a gossip round to finish
_, pending = await asyncio.wait([
asyncio.create_task(log_file_b.wait_for(f'InetAddress {scylla_a.ip_addr} is now UP')), # The second node joins the cluster
asyncio.create_task(log_file_a.wait_for(f'Group0Id mismatch')) # The first node discards gossip from the second node
], return_when=asyncio.FIRST_COMPLETED)
await wait_for_first_completed([
log_file_b.wait_for(f'InetAddress {scylla_a.ip_addr} is now UP'), # The second node joins the cluster
log_file_a.wait_for(f'Group0Id mismatch') # The first node discards gossip from the second node
])
for task in pending:
task.cancel()
# Check if decommissioning the second node fails.
# Repair service throws a runtime exception "zero replica after the removal"

View File

@@ -408,6 +408,10 @@ async def test_hint_to_pending(manager: ManagerClient):
assert await_sync_point(servers[0], sync_point, 30)
await manager.api.message_injection(servers[0].ip_addr, "pause_after_streaming_tablet")
await asyncio.wait([tablet_migration])
done, pending = await asyncio.wait([tablet_migration])
for task in pending:
task.cancel()
for task in done:
task.result()
assert list(await cql.run_async(f"SELECT v FROM {table} WHERE pk = 0")) == [(0,)]

View File

@@ -655,7 +655,7 @@ async def test_tablet_repair_with_incremental_option(manager: ManagerClient):
assert read1 == 0
assert skip2 == 0
assert read2 > 0
await do_repair_and_check(None, 1, rf'Starting tablet repair by API .* incremental_mode=regular.*', check1)
await do_repair_and_check(None, 1, rf'Starting tablet repair by API .* incremental_mode=incremental.*', check1)
def check2(skip1, read1, skip2, read2):
assert skip1 == skip2
@@ -665,7 +665,7 @@ async def test_tablet_repair_with_incremental_option(manager: ManagerClient):
def check3(skip1, read1, skip2, read2):
assert skip1 < skip2
assert read1 == read2
await do_repair_and_check('regular', 1, rf'Starting tablet repair by API .* incremental_mode=regular.*', check3)
await do_repair_and_check('incremental', 1, rf'Starting tablet repair by API .* incremental_mode=incremental.*', check3)
def check4(skip1, read1, skip2, read2):
assert skip1 == skip2

View File

@@ -6,7 +6,7 @@ from test.cluster.conftest import skip_mode
from test.pylib.manager_client import ManagerClient
from test.pylib.random_tables import RandomTables, Column, IntType
from test.pylib.rest_client import inject_error_one_shot
from test.pylib.util import wait_for_cql_and_get_hosts
from test.pylib.util import wait_for_first_completed
import asyncio
from datetime import datetime, timedelta
@@ -177,11 +177,7 @@ async def test_long_query_timeout_without_failure_erm(request, manager: ManagerC
server_log = await manager.server_open_log(server.server_id)
await server_log.wait_for("mapreduce_pause_parallel_dispatch: waiting for message")
async with asyncio.TaskGroup() as tg:
log_watch_tasks = [tg.create_task(wait_for_log_on_any_node(server)) for server in servers]
_, pending = await asyncio.wait(log_watch_tasks, return_when=asyncio.FIRST_COMPLETED)
for t in pending:
t.cancel()
await wait_for_first_completed([wait_for_log_on_any_node(server) for server in servers])
if enable_tablets:
logger.info("Add new node - ERM should not be blocked")

View File

@@ -199,3 +199,47 @@ async def test_shutdown_drain_during_compaction(manager: ManagerClient):
# For dropping the keyspace
await manager.server_start(server.server_id)
await reconnect_driver(manager)
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_alter_compaction_strategy_during_compaction(manager: ManagerClient):
"""
Test ALTERing compaction strategy during compaction doesn't crash the server
1. Create a single node cluster.
2. Create a table with compaction strategy = TWCS and populate it.
3. Inject error to make compaction wait when getting sstables for compaction.
4. Start compaction, wait for it to reach injection point
5. ALTER table to change compaction strategy to LCS
6. Let compaction proceed and finish
7. Verify no unexpected errors in logs
"""
node1 = await manager.server_add(cmdline=['--logger-log-level', 'compaction=debug'])
cql = manager.get_cql()
async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1};") as ks:
logger.info("Create table")
cf = "t1"
await cql.run_async(f"CREATE TABLE {ks}.{cf} (pk int, ck int, val int, PRIMARY KEY (pk, ck)) WITH compaction={{'class': 'TimeWindowCompactionStrategy'}}")
logger.info("Inject error to pause compaction midway")
injection_name="twcs_get_sstables_for_compaction"
await manager.api.enable_injection(node_ip=node1.ip_addr, injection=injection_name, one_shot=False)
server_log = await manager.server_open_log(node1.server_id)
logger.info("Populate table and start compaction")
insert_stmt = cql.prepare(f"INSERT INTO {ks}.{cf} (pk, ck, val) VALUES (?, ?, ?)")
for i in range(20):
for j in range(100):
await cql.run_async(insert_stmt, (i, j, i * j))
compaction_task = asyncio.create_task(manager.api.keyspace_compaction(node_ip=node1.ip_addr, keyspace=ks, table=cf))
logger.info("Waiting for compaction to be suspended")
await server_log.wait_for("twcs_get_sstables_for_compaction: waiting for message")
logger.info("Alter compaction strategy")
await cql.run_async(f"ALTER TABLE {ks}.{cf} WITH compaction = {{'class': 'LeveledCompactionStrategy'}};")
logger.info("Resume compaction and wait for it to finish")
await manager.api.message_injection(node_ip=node1.ip_addr, injection=injection_name)
await manager.api.disable_injection(node_ip=node1.ip_addr, injection=injection_name)
await compaction_task

View File

@@ -236,15 +236,20 @@ async def test_can_restart(manager: ManagerClient, raft_op_timeout: int) -> None
await asyncio.gather(*(manager.server_update_config(srv.server_id, 'group0_raft_op_timeout_in_ms', raft_op_timeout)
for srv in servers))
logger.info(f"Restarting {servers}")
for idx, srv in enumerate(servers):
logger.info(f"Restarting {servers[:2]} with no group 0 quorum")
for idx, srv in enumerate(servers[:2]):
await manager.server_start(srv.server_id)
# Make sure that the first two nodes restart without group 0 quorum.
if idx < 2:
with pytest.raises(Exception, match="raft operation \\[read_barrier\\] timed out, "
"there is no raft quorum, total voters count 5, "
f"alive voters count {idx + 1}"):
await read_barrier(manager.api, srv.ip_addr)
else:
with pytest.raises(Exception, match="raft operation \\[read_barrier\\] timed out, "
"there is no raft quorum, total voters count 5, "
f"alive voters count {idx + 1}"):
await read_barrier(manager.api, srv.ip_addr)
# Increase the timeout back to 300s to ensure the new group 0 leader is elected before the first read barrier below
# times out.
await asyncio.gather(*(manager.server_update_config(srv.server_id, 'group0_raft_op_timeout_in_ms', 300000)
for srv in servers))
logger.info(f"Restarting {servers[2:]} with group 0 quorum")
for srv in servers[2:]:
await manager.server_start(srv.server_id)
await read_barrier(manager.api, srv.ip_addr)

View File

@@ -19,7 +19,7 @@ from test.cluster.test_group0_schema_versioning import get_group0_schema_version
@pytest.mark.asyncio
async def test_raft_recovery_entry_lose(manager: ManagerClient):
async def test_raft_recovery_entry_loss(manager: ManagerClient):
"""
Test that the Raft-based recovery procedure works correctly if some committed group 0 entry has been permanently
lost (it has been committed only by dead nodes).
@@ -39,6 +39,9 @@ async def test_raft_recovery_entry_lose(manager: ManagerClient):
5. Check that node 1 has moved its group 0 state to v2.
6. Remove nodes 3-5 from topology using the standard removenode procedure.
7. Add a new node (a sanity check verifying that the cluster is functioning properly).
Additionally, verify that no schema pulls take place during the recovery procedure at the end of the test. This is
a regression test for https://github.com/scylladb/scylladb/issues/26569.
"""
logging.info('Adding initial servers')
servers = await manager.servers_add(5)
@@ -158,10 +161,17 @@ async def test_raft_recovery_entry_lose(manager: ManagerClient):
logging.info('Adding a new server')
new_server = await manager.server_add()
live_servers.append(new_server)
hosts = await wait_for_cql_and_get_hosts(cql, live_servers + [new_server], time.time() + 60)
hosts = await wait_for_cql_and_get_hosts(cql, live_servers, time.time() + 60)
logging.info(f'Performing consistency checks after adding {new_server}')
await wait_for_cdc_generations_publishing(cql, hosts, time.time() + 60)
await check_token_ring_and_group0_consistency(manager)
await check_system_topology_and_cdc_generations_v3_consistency(manager, hosts, ignored_hosts=dead_hosts)
logging.info(f'Checking that there were no schema pulls on {live_servers}')
log_files = await asyncio.gather(*[manager.server_open_log(srv.server_id) for srv in live_servers])
for log_file in log_files:
matches = await log_file.grep('Requesting schema pull') + await log_file.grep('Pulling schema')
assert not matches

View File

@@ -1124,9 +1124,7 @@ async def test_two_tablets_concurrent_repair_and_migration(manager: ManagerClien
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", repair_replicas.last_token)
async def migration_task():
done, pending = await asyncio.wait([asyncio.create_task(log.wait_for('Started to repair', from_mark=mark)) for log, mark in zip(logs, marks)], return_when=asyncio.FIRST_COMPLETED)
for task in pending:
task.cancel()
await wait_for_first_completed([log.wait_for('Started to repair', from_mark=mark) for log, mark in zip(logs, marks)])
await manager.api.move_tablet(servers[0].ip_addr, ks, "test", migration_replicas.replicas[0][0], migration_replicas.replicas[0][1], migration_replicas.replicas[0][0], 0 if migration_replicas.replicas[0][1] != 0 else 1, migration_replicas.last_token)
[await manager.api.message_injection(s.ip_addr, injection) for s in servers]
[await manager.api.disable_injection(s.ip_addr, injection) for s in servers]
@@ -1233,6 +1231,8 @@ async def test_two_tablets_concurrent_repair_and_migration_repair_writer_level(m
cql = await safe_rolling_restart(manager, [servers[0]], with_down=insert_with_down)
await wait_for_cql_and_get_hosts(manager.get_cql(), servers, time.time() + 30)
all_replicas = await get_all_tablet_replicas(manager, servers[1], ks, "test")
migration_replicas = all_replicas[0]

View File

@@ -1703,3 +1703,94 @@ async def test_split_correctness_on_tablet_count_change(manager: ManagerClient):
await manager.api.message_injection(server.ip_addr, "splitting_mutation_writer_switch_wait")
await asyncio.sleep(.1)
await manager.api.message_injection(server.ip_addr, "merge_completion_fiber")
# Reproducer for https://github.com/scylladb/scylladb/issues/26041.
@pytest.mark.parametrize("primary_replica_only", [False, True])
@skip_mode('release', 'error injections are not supported in release mode')
async def test_tablet_load_and_stream_and_split_synchronization(manager: ManagerClient, primary_replica_only):
logger.info("Bootstrapping cluster")
cmdline = [
'--logger-log-level', 'storage_service=debug',
'--logger-log-level', 'table=debug',
'--smp', '1',
]
servers = [await manager.server_add(config={
'tablet_load_stats_refresh_interval_in_seconds': 1
}, cmdline=cmdline)]
server = servers[0]
await manager.api.disable_tablet_balancing(servers[0].ip_addr)
cql = manager.get_cql()
initial_tablets = 1
async with new_test_keyspace(manager, f"WITH replication = {{'class': 'NetworkTopologyStrategy', 'replication_factor': 1}}") as ks:
await cql.run_async(f"CREATE TABLE {ks}.test (pk int PRIMARY KEY, c int) WITH tablets = {{'min_tablet_count': {initial_tablets}}};")
keys = range(100)
await asyncio.gather(*[cql.run_async(f"INSERT INTO {ks}.test (pk, c) VALUES ({k}, {k});") for k in keys])
async def check(ks_name: str):
logger.info("Checking table")
cql = manager.get_cql()
rows = await cql.run_async(f"SELECT * FROM {ks_name}.test BYPASS CACHE;")
assert len(rows) == len(keys)
for r in rows:
assert r.c == r.pk
await manager.api.flush_keyspace(servers[0].ip_addr, ks)
await check(ks)
node_workdir = await manager.server_get_workdir(servers[0].server_id)
cql = await safe_server_stop_gracefully(manager, servers[0].server_id)
table_dir = glob.glob(os.path.join(node_workdir, "data", ks, "test-*"))[0]
logger.info(f"Table dir: {table_dir}")
def move_sstables_to_upload(table_dir: str):
logger.info("Moving sstables to upload dir")
table_upload_dir = os.path.join(table_dir, "upload")
for sst in glob.glob(os.path.join(table_dir, "*-Data.db")):
for src_path in glob.glob(os.path.join(table_dir, sst.removesuffix("-Data.db") + "*")):
dst_path = os.path.join(table_upload_dir, os.path.basename(src_path))
logger.info(f"Moving sstable file {src_path} to {dst_path}")
os.rename(src_path, dst_path)
move_sstables_to_upload(table_dir)
await manager.server_start(servers[0].server_id)
cql = manager.get_cql()
await wait_for_cql_and_get_hosts(cql, servers, time.time() + 60)
rows = await cql.run_async(f"SELECT * FROM {ks}.test BYPASS CACHE;")
assert len(rows) == 0
await manager.api.disable_tablet_balancing(servers[0].ip_addr)
await manager.api.enable_injection(servers[0].ip_addr, "tablet_resize_finalization_post_barrier", one_shot=True)
s1_log = await manager.server_open_log(servers[0].server_id)
s1_mark = await s1_log.mark()
await manager.api.enable_tablet_balancing(servers[0].ip_addr)
await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': {initial_tablets * 2}}}")
await s1_log.wait_for(f"tablet_resize_finalization_post_barrier: waiting", from_mark=s1_mark)
await manager.api.enable_injection(servers[0].ip_addr, "stream_mutation_fragments", one_shot=True)
load_and_stream_task = asyncio.create_task(manager.api.load_new_sstables(servers[0].ip_addr, ks, "test", primary_replica_only))
await s1_log.wait_for(f"Loading new SSTables for keyspace", from_mark=s1_mark)
await manager.api.message_injection(server.ip_addr, "tablet_resize_finalization_post_barrier")
await s1_log.wait_for('Detected tablet split for table', from_mark=s1_mark)
await s1_log.wait_for(f"stream_mutation_fragments: waiting", from_mark=s1_mark)
await manager.api.message_injection(server.ip_addr, "stream_mutation_fragments")
await load_and_stream_task
await check(ks)

View File

@@ -14,7 +14,7 @@ from typing import List
from test.pylib.log_browsing import ScyllaLogFile
from test.pylib.manager_client import ManagerClient
from test.pylib.scylla_cluster import gather_safely
from test.pylib.util import wait_for_cql_and_get_hosts
from test.pylib.util import wait_for_cql_and_get_hosts, wait_for_first_completed
from test.cluster.conftest import skip_mode
from test.cluster.util import reconnect_driver, enter_recovery_state, \
delete_raft_data_and_upgrade_state, log_run_time, wait_until_upgrade_finishes as wait_until_schema_upgrade_finishes, \
@@ -26,11 +26,7 @@ async def wait_for_log_on_any_node(logs: List[ScyllaLogFile], marks: List[int],
Waits until a given line appears on any node in the cluster.
"""
assert len(logs) == len(marks)
async with asyncio.TaskGroup() as tg:
log_watch_tasks = [tg.create_task(l.wait_for(pattern)) for l, m in zip(logs, marks)]
_, pending = await asyncio.wait(log_watch_tasks, return_when=asyncio.FIRST_COMPLETED)
for t in pending:
t.cancel()
await wait_for_first_completed([l.wait_for(pattern) for l, m in zip(logs, marks)])
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
@@ -67,14 +63,14 @@ async def test_topology_upgrade_stuck(request, manager: ManagerClient):
logging.info("Waiting until upgrade gets stuck due to error injection")
logs = await gather_safely(*(manager.server_open_log(s.server_id) for s in servers))
marks = [l.mark() for l in logs]
marks = await gather_safely(*(l.mark() for l in logs))
await wait_for_log_on_any_node(logs, marks, "failed to build topology coordinator state due to error injection")
logging.info("Isolate one of the nodes via error injection")
await manager.api.enable_injection(to_be_isolated_node.ip_addr, "raft_drop_incoming_append_entries", one_shot=False)
logging.info("Disable the error injection that causes upgrade to get stuck")
marks = [l.mark() for l in logs]
marks = await gather_safely(*(l.mark() for l in logs))
await gather_safely(*(manager.api.disable_injection(s.ip_addr, "topology_coordinator_fail_to_build_state_during_upgrade") for s in servers))
logging.info("Wait for the topology coordinator to observe upgrade as finished")

View File

@@ -307,12 +307,23 @@ async def test_alter_base_schema_while_build_in_progress(manager: ManagerClient,
@pytest.mark.asyncio
@skip_mode("release", "error injections are not supported in release mode")
async def test_change_rf_while_build_in_progress(manager: ManagerClient, change: str):
node_count = 4
servers = await manager.servers_add(node_count, config={"rf_rack_valid_keyspaces": "false", "enable_tablets": "true"}, cmdline=cmdline_loggers)
if change == "increase":
node_count = 2
rack_layout = ["rack1", "rack2"]
elif change == "decrease":
node_count = 3
rack_layout = ["rack1", "rack1", "rack2"]
else:
assert False
property_file = [{"dc": "dc1", "rack": rack} for rack in rack_layout]
servers = await manager.servers_add(node_count, config={"enable_tablets": "true"}, cmdline=cmdline_loggers,
property_file=property_file)
cql, _ = await manager.get_ready_cql(servers)
await disable_tablet_load_balancing_on_all_servers(manager)
rf = 3
rf = node_count - 1
async with new_test_keyspace(manager, f"WITH replication = {{'class': 'NetworkTopologyStrategy', 'replication_factor': {rf}}} AND tablets = {{'enabled': true}}") as ks:
await cql.run_async(f"CREATE TABLE {ks}.tab (key int, c int, v text, PRIMARY KEY (key, c))")
await populate_base_table(cql, ks, "tab")
@@ -326,7 +337,7 @@ async def test_change_rf_while_build_in_progress(manager: ManagerClient, change:
await wait_for_some_view_build_tasks_to_get_stuck(manager, marks)
new_rf = rf + 1 if change == "increase" else rf - 1
await cql.run_async(f"ALTER KEYSPACE {ks} WITH replication = {{'class': 'NetworkTopologyStrategy', 'datacenter1': {new_rf}}}")
await cql.run_async(f"ALTER KEYSPACE {ks} WITH replication = {{'class': 'NetworkTopologyStrategy', 'dc1': {new_rf}}}")
await unpause_view_building_tasks(manager)
@@ -337,8 +348,18 @@ async def test_change_rf_while_build_in_progress(manager: ManagerClient, change:
@pytest.mark.asyncio
@skip_mode("release", "error injections are not supported in release mode")
async def test_node_operation_during_view_building(manager: ManagerClient, operation: str):
node_count = 4 if operation == "remove" or operation == "decommission" else 3
servers = await manager.servers_add(node_count, config={"rf_rack_valid_keyspaces": "false", "enable_tablets": "true"}, cmdline=cmdline_loggers)
if operation == "remove" or operation == "decommission":
node_count = 4
rack_layout = ["rack1", "rack2", "rack3", "rack3"]
else:
node_count = 3
rack_layout = ["rack1", "rack2", "rack3"]
property_file = [{"dc": "dc1", "rack": rack} for rack in rack_layout]
servers = await manager.servers_add(node_count, config={"enable_tablets": "true"},
cmdline=cmdline_loggers,
property_file=property_file)
cql, _ = await manager.get_ready_cql(servers)
await disable_tablet_load_balancing_on_all_servers(manager)
@@ -354,7 +375,8 @@ async def test_node_operation_during_view_building(manager: ManagerClient, opera
await wait_for_some_view_build_tasks_to_get_stuck(manager, marks)
if operation == "add":
await manager.server_add(config={"rf_rack_valid_keyspaces": "false", "enable_tablets": "true"}, cmdline=cmdline_loggers)
property_file = servers[-1].property_file()
await manager.server_add(config={"enable_tablets": "true"}, cmdline=cmdline_loggers, property_file=property_file)
node_count = node_count + 1
elif operation == "remove":
await manager.server_stop_gracefully(servers[-1].server_id)
@@ -364,9 +386,11 @@ async def test_node_operation_during_view_building(manager: ManagerClient, opera
await manager.decommission_node(servers[-1].server_id)
node_count = node_count - 1
elif operation == "replace":
property_file = servers[-1].property_file()
await manager.server_stop_gracefully(servers[-1].server_id)
replace_cfg = ReplaceConfig(replaced_id = servers[-1].server_id, reuse_ip_addr = False, use_host_id = True)
await manager.server_add(replace_cfg, config={"rf_rack_valid_keyspaces": "false", "enable_tablets": "true"}, cmdline=cmdline_loggers)
await manager.server_add(replace_cfg, config={"enable_tablets": "true"}, cmdline=cmdline_loggers,
property_file=property_file)
await unpause_view_building_tasks(manager)
await wait_for_view(cql, 'mv_cf_view', node_count)
@@ -770,7 +794,7 @@ async def test_file_streaming(manager: ManagerClient):
# because last token after tablet merge = last token of tablet2 before merge
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
@pytest.mark.xfail(reason="#26244")
@pytest.mark.skip(reason="#26244")
async def test_staging_sstables_with_tablet_merge(manager: ManagerClient):
node_count = 2
servers = await manager.servers_add(node_count, cmdline=cmdline_loggers, property_file=[

View File

@@ -6,6 +6,7 @@
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "auth/common.hh"
#include "auth/standard_role_manager.hh"
#include "auth/ldap_role_manager.hh"
#include "auth/password_authenticator.hh"
@@ -407,7 +408,7 @@ SEASTAR_TEST_CASE(ldap_delegates_query_all) {
auto m = make_ldap_manager(env);
m->start().get();
create_ldap_roles(env, *m);
const auto roles = m->query_all().get();
const auto roles = m->query_all(auth::internal_distributed_query_state()).get();
BOOST_REQUIRE_EQUAL(1, roles.count("role1"));
BOOST_REQUIRE_EQUAL(1, roles.count("role2"));
BOOST_REQUIRE_EQUAL(1, roles.count("jsmith"));
@@ -442,7 +443,7 @@ SEASTAR_TEST_CASE(ldap_delegates_attributes) {
do_with_mc(env, [&] (service::group0_batch& b) {
m->create("r", auth::role_config{}, b).get();
});
BOOST_REQUIRE(!m->get_attribute("r", "a").get());
BOOST_REQUIRE(!m->get_attribute("r", "a", auth::internal_distributed_query_state()).get());
do_with_mc(env, [&] (service::group0_batch& b) {
m->set_attribute("r", "a", "3", b).get();
});
@@ -451,7 +452,7 @@ SEASTAR_TEST_CASE(ldap_delegates_attributes) {
do_with_mc(env, [&] (service::group0_batch& b) {
m->remove_attribute("r", "a", b).get();
});
BOOST_REQUIRE(!m->get_attribute("r", "a").get());
BOOST_REQUIRE(!m->get_attribute("r", "a", auth::internal_distributed_query_state()).get());
});
}

View File

@@ -1098,6 +1098,11 @@ private:
startlog.info("Verifying that all of the keyspaces are RF-rack-valid");
_db.local().check_rf_rack_validity(cfg->rf_rack_valid_keyspaces(), _token_metadata.local().get());
// Materialized views and secondary indexes are still restricted and require specific configuration
// options to work. Make sure that if there are existing views or indexes, they don't violate
// the requirements imposed on them.
_db.local().validate_tablet_views_indexes();
utils::loading_cache_config perm_cache_config;
perm_cache_config.max_size = cfg->permissions_cache_max_entries();
perm_cache_config.expiry = std::chrono::milliseconds(cfg->permissions_validity_in_ms());

View File

@@ -176,8 +176,8 @@ def test_repair_keyspace_tablets_with_colocated_table(nodetool):
"table": "mv",
"tokens": "all"},
response={"message": """
Cannot set repair request on table ac77e950-9303-11f0-a718-c65bb2c481c4 because it is colocated with the
base table a6044cd0-9303-11f0-a718-c65bb2c481c4. Repair requests can be made only on the base table.
Cannot set repair request on table 'ks'.'mv' because it is colocated with the
base table 'ks'.'table1'. Repair requests can be made only on the base table.
Repairing the base table will also repair all tables colocated with it."""
, "code": 400}, response_status=400),
expected_request(
@@ -424,7 +424,7 @@ def test_repair_keyspace(nodetool):
]},
["error processing arguments: nodetool cluster repair repairs only tablet keyspaces. To repair vnode keyspaces use nodetool repair."])
@pytest.mark.parametrize("mode", ["disabled", "regular", "full"])
@pytest.mark.parametrize("mode", ["disabled", "incremental", "full"])
def test_repair_incremenatal_repair(nodetool, mode):
id1 = "ef1b7a61-66c8-494c-bb03-6f65724e6eee"
res = nodetool("cluster", "repair", "--incremental-mode", mode, "ks", "table1", expected_requests=[

View File

@@ -269,6 +269,7 @@ def setup_cgroup(is_required: bool) -> None:
if _is_cgroup_rw() and is_docker:
subprocess.run(
[
"sudo",
"mount",
"-o",
"remount,rw",
@@ -278,8 +279,8 @@ def setup_cgroup(is_required: bool) -> None:
)
if is_podman or is_docker:
subprocess.run(['chown', '-R', f"{getpass.getuser()}:{getpass.getuser()}", '/sys/fs/cgroup'],
check=True)
cmd = ["chown", "-R", f"{getpass.getuser()}:{getpass.getuser()}", '/sys/fs/cgroup']
subprocess.run(["sudo"] + cmd if is_docker else cmd,check=True)
configured = False
for directory in [CGROUP_INITIAL, CGROUP_TESTS]:

View File

@@ -261,12 +261,31 @@ async def wait_for_view(cql: Session, name: str, node_count: int, timeout: int =
await wait_for(view_is_built, deadline)
async def wait_for_first_completed(coros: list[Coroutine]):
done, pending = await asyncio.wait([asyncio.create_task(c) for c in coros], return_when=asyncio.FIRST_COMPLETED)
for t in pending:
t.cancel()
for t in done:
await t
async def wait_for_first_completed(coros: list[Coroutine], timeout: int|None = None):
tasks = [asyncio.create_task(c) for c in coros]
done, pending = await asyncio.wait(tasks, return_when=asyncio.FIRST_COMPLETED, timeout=timeout)
if not done:
# Timeout occurred, cancel all
for task in tasks:
task.cancel()
await asyncio.gather(*tasks, return_exceptions=True)
raise asyncio.TimeoutError("No task completed within timeout")
# Cancel pending tasks
for task in pending:
task.cancel()
# Get first result
list_done = list(done)
first_task = list_done.pop(0)
result = await first_task
# Clean up
cleanup = list(pending) + list_done
if cleanup:
await asyncio.gather(*cleanup, return_exceptions=True)
return result
def ninja(target: str) -> str:

View File

@@ -30,6 +30,7 @@
#include <seastar/util/short_streams.hh>
#include <seastar/net/tcp.hh>
#include <variant>
#include <vector>
namespace {
@@ -87,12 +88,18 @@ auto repeat_until(milliseconds timeout, std::function<future<bool>()> func) -> f
co_return true;
}
constexpr auto STANDARD_WAIT = std::chrono::seconds(5);
constexpr auto STANDARD_WAIT = std::chrono::seconds(10);
struct abort_source_timeout {
auto repeat_until(std::function<future<bool>()> func) -> future<bool> {
return repeat_until(STANDARD_WAIT, std::move(func));
}
class abort_source_timeout {
abort_source as;
timer<> t;
public:
explicit abort_source_timeout(milliseconds timeout = STANDARD_WAIT)
: t(timer([&]() {
as.request_abort();
@@ -100,10 +107,11 @@ struct abort_source_timeout {
t.arm(timeout);
}
void reset(milliseconds timeout = STANDARD_WAIT) {
abort_source& reset(milliseconds timeout = STANDARD_WAIT) {
t.cancel();
as = abort_source();
t.arm(timeout);
return as;
}
};
@@ -283,12 +291,22 @@ auto make_unavailable_server(uint16_t port = 0) -> future<std::unique_ptr<unavai
class vs_mock_server {
public:
struct ann_req {
sstring path;
sstring body;
};
struct ann_resp {
status_type status;
sstring body;
};
explicit vs_mock_server(uint16_t port)
: _port(port) {
}
explicit vs_mock_server(status_type status)
: _status(status) {
explicit vs_mock_server(ann_resp next_ann_response)
: _next_ann_response(std::move(next_ann_response)) {
}
vs_mock_server() = default;
@@ -309,8 +327,12 @@ public:
return _port;
}
size_t requests() const {
return _requests;
const std::vector<ann_req>& requests() const {
return _ann_requests;
}
void next_ann_response(ann_resp response) {
_next_ann_response = std::move(response);
}
private:
@@ -321,7 +343,7 @@ private:
auto ann = [this](std::unique_ptr<request> req, std::unique_ptr<reply> rep) -> future<std::unique_ptr<reply>> {
return handle_request(std::move(req), std::move(rep));
};
r.add(operation_type::POST, url("/api/v1/indexes/ks/idx").remainder("ann"), new function_handler(ann, "json"));
r.add(operation_type::POST, url(INDEXES_PATH).remainder("path"), new function_handler(ann, "json"));
},
host.c_str(), _port);
_http_server = std::move(s);
@@ -331,17 +353,19 @@ private:
}
future<std::unique_ptr<reply>> handle_request(std::unique_ptr<request> req, std::unique_ptr<reply> rep) {
_requests++;
rep->write_body("json", CORRECT_RESPONSE_FOR_TEST_TABLE);
rep->set_status(_status);
ann_req r{.path = INDEXES_PATH + "/" + req->get_path_param("path"), .body = co_await util::read_entire_stream_contiguous(*req->content_stream)};
_ann_requests.push_back(std::move(r));
rep->set_status(_next_ann_response.status);
rep->write_body("json", _next_ann_response.body);
co_return rep;
}
uint16_t _port = 0;
sstring _host;
size_t _requests = 0;
std::unique_ptr<http_server> _http_server;
status_type _status = status_type::ok;
std::vector<ann_req> _ann_requests;
ann_resp _next_ann_response{status_type::ok, CORRECT_RESPONSE_FOR_TEST_TABLE};
const sstring INDEXES_PATH = "/api/v1/indexes";
};
template <typename... Args>
@@ -375,11 +399,11 @@ SEASTAR_TEST_CASE(vector_store_client_test_dns_started) {
cfg.vector_store_primary_uri.set("http://good.authority.here:6080");
auto vs = vector_store_client{cfg};
configure(vs).with_dns({{"good.authority.here", "127.0.0.1"}});
auto as = abort_source_timeout();
vs.start_background_tasks();
auto as = abort_source_timeout();
auto addr = co_await vector_store_client_tester::resolve_hostname(vs, as.as);
auto addr = co_await vector_store_client_tester::resolve_hostname(vs, as.reset());
BOOST_REQUIRE(!addr.empty());
BOOST_CHECK_EQUAL(print_addr(addr[0]), "127.0.0.1");
@@ -390,13 +414,13 @@ SEASTAR_TEST_CASE(vector_store_client_test_dns_started) {
SEASTAR_TEST_CASE(vector_store_client_test_dns_resolve_failure) {
auto cfg = config();
cfg.vector_store_primary_uri.set("http://good.authority.here:6080");
auto as = abort_source_timeout();
auto vs = vector_store_client{cfg};
configure(vs).with_dns({{"good.authority.here", std::nullopt}});
vs.start_background_tasks();
auto as = abort_source_timeout();
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.as);
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.reset());
BOOST_CHECK(addrs.empty());
@@ -409,6 +433,7 @@ SEASTAR_TEST_CASE(vector_store_client_test_dns_resolving_repeated) {
cfg.vector_store_primary_uri.set("http://good.authority.here:6080");
auto vs = vector_store_client{cfg};
auto count = 0;
auto as = abort_source_timeout();
configure(vs)
.with_dns_refresh_interval(milliseconds(10))
.with_wait_for_client_timeout(milliseconds(20))
@@ -423,33 +448,28 @@ SEASTAR_TEST_CASE(vector_store_client_test_dns_resolving_repeated) {
vs.start_background_tasks();
auto as = abort_source_timeout();
BOOST_CHECK(co_await repeat_until(seconds(1), [&vs, &as]() -> future<bool> {
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.as);
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.reset());
co_return addrs.size() == 1;
}));
BOOST_CHECK_EQUAL(count, 3);
as.reset();
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.as);
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.reset());
BOOST_REQUIRE_EQUAL(addrs.size(), 1);
BOOST_CHECK_EQUAL(print_addr(addrs[0]), "127.0.0.3");
vector_store_client_tester::trigger_dns_resolver(vs);
BOOST_CHECK(co_await repeat_until(seconds(1), [&vs, &as]() -> future<bool> {
as.reset();
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.as);
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.reset());
co_return addrs.empty();
}));
BOOST_CHECK(co_await repeat_until(seconds(1), [&vs, &as]() -> future<bool> {
as.reset();
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.as);
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.reset());
co_return addrs.size() == 1;
}));
BOOST_CHECK_EQUAL(count, 6);
as.reset();
addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.as);
addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.reset());
BOOST_REQUIRE_EQUAL(addrs.size(), 1);
BOOST_CHECK_EQUAL(print_addr(addrs[0]), "127.0.0.6");
@@ -463,6 +483,7 @@ SEASTAR_TEST_CASE(vector_store_client_test_dns_refresh_respects_interval) {
cfg.vector_store_primary_uri.set("http://good.authority.here:6080");
auto vs = vector_store_client{cfg};
auto count = 0;
auto as = abort_source_timeout();
configure(vs).with_dns_refresh_interval(milliseconds(10)).with_dns_resolver([&count](auto const& host) -> future<std::optional<inet_address>> {
BOOST_CHECK_EQUAL(host, "good.authority.here");
count++;
@@ -472,8 +493,7 @@ SEASTAR_TEST_CASE(vector_store_client_test_dns_refresh_respects_interval) {
vs.start_background_tasks();
co_await sleep(milliseconds(20)); // wait for the first DNS refresh
auto as = abort_source_timeout();
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.as);
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.reset());
BOOST_REQUIRE_EQUAL(addrs.size(), 1);
BOOST_CHECK_EQUAL(print_addr(addrs[0]), "127.0.0.1");
BOOST_CHECK_EQUAL(count, 1);
@@ -485,8 +505,7 @@ SEASTAR_TEST_CASE(vector_store_client_test_dns_refresh_respects_interval) {
vector_store_client_tester::trigger_dns_resolver(vs);
co_await sleep(milliseconds(100)); // wait for the next DNS refresh
as.reset();
addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.as);
addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.reset());
BOOST_REQUIRE_EQUAL(addrs.size(), 1);
BOOST_CHECK_EQUAL(print_addr(addrs[0]), "127.0.0.1");
BOOST_CHECK_GE(count, 1);
@@ -499,6 +518,7 @@ SEASTAR_TEST_CASE(vector_store_client_test_dns_refresh_respects_interval) {
SEASTAR_TEST_CASE(vector_store_client_test_dns_refresh_aborted) {
auto cfg = config();
cfg.vector_store_primary_uri.set("http://good.authority.here:6080");
auto as = abort_source_timeout(milliseconds(10));
auto vs = vector_store_client{cfg};
configure(vs).with_dns_refresh_interval(milliseconds(10)).with_dns_resolver([&](auto const& host) -> future<std::optional<inet_address>> {
BOOST_CHECK_EQUAL(host, "good.authority.here");
@@ -508,8 +528,7 @@ SEASTAR_TEST_CASE(vector_store_client_test_dns_refresh_aborted) {
vs.start_background_tasks();
auto as = abort_source_timeout(milliseconds(10));
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.as);
auto addrs = co_await vector_store_client_tester::resolve_hostname(vs, as.reset());
BOOST_CHECK(addrs.empty());
co_await vs.stop();
@@ -517,11 +536,11 @@ SEASTAR_TEST_CASE(vector_store_client_test_dns_refresh_aborted) {
SEASTAR_TEST_CASE(vector_store_client_ann_test_disabled) {
co_await do_with_cql_env([](cql_test_env& env) -> future<> {
auto as = abort_source_timeout();
auto schema = co_await create_test_table(env, "ks", "vs");
auto& vs = env.local_qp().vector_store_client();
auto as = abort_source_timeout();
auto keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
auto keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
BOOST_REQUIRE(!keys);
BOOST_CHECK(std::holds_alternative<vector_store_client::disabled>(keys.error()));
});
@@ -533,13 +552,13 @@ SEASTAR_TEST_CASE(vector_store_client_test_ann_addr_unavailable) {
co_await do_with_cql_env(
[](cql_test_env& env) -> future<> {
auto schema = co_await create_test_table(env, "ks", "vs");
auto as = abort_source_timeout();
auto& vs = env.local_qp().vector_store_client();
configure(vs).with_dns_refresh_interval(seconds(1)).with_dns({{"bad.authority.here", std::nullopt}});
vs.start_background_tasks();
auto as = abort_source_timeout();
auto keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
auto keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
BOOST_REQUIRE(!keys);
BOOST_CHECK(std::holds_alternative<vector_store_client::addr_unavailable>(keys.error()));
},
@@ -553,13 +572,13 @@ SEASTAR_TEST_CASE(vector_store_client_test_ann_service_unavailable) {
co_await do_with_cql_env(
[&server](cql_test_env& env) -> future<> {
auto schema = co_await create_test_table(env, "ks", "vs");
auto as = abort_source_timeout();
auto& vs = env.local_qp().vector_store_client();
configure(vs).with_dns_refresh_interval(seconds(1)).with_dns({{"good.authority.here", server->host()}});
vs.start_background_tasks();
auto as = abort_source_timeout();
auto keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
auto keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
BOOST_REQUIRE(!keys);
BOOST_CHECK(std::holds_alternative<vector_store_client::service_unavailable>(keys.error()));
},
@@ -576,6 +595,7 @@ SEASTAR_TEST_CASE(vector_store_client_test_ann_service_aborted) {
co_await do_with_cql_env(
[&server](cql_test_env& env) -> future<> {
auto schema = co_await create_test_table(env, "ks", "vs");
auto as = abort_source_timeout();
auto& vs = env.local_qp().vector_store_client();
configure(vs).with_dns_refresh_interval(milliseconds(10)).with_dns_resolver([&server](auto const& host) -> future<std::optional<inet_address>> {
BOOST_CHECK_EQUAL(host, "good.authority.here");
@@ -585,8 +605,8 @@ SEASTAR_TEST_CASE(vector_store_client_test_ann_service_aborted) {
vs.start_background_tasks();
auto as = abort_source_timeout(milliseconds(10));
auto keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
auto keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset(milliseconds(10)));
BOOST_REQUIRE(!keys);
BOOST_CHECK(std::holds_alternative<vector_store_client::aborted>(keys.error()));
},
@@ -596,106 +616,72 @@ SEASTAR_TEST_CASE(vector_store_client_test_ann_service_aborted) {
});
}
SEASTAR_TEST_CASE(vector_store_client_test_ann_request) {
auto ann_replies = make_lw_shared<std::queue<std::tuple<sstring, sstring>>>();
auto [server, addr] = co_await new_http_server([ann_replies](routes& r) {
auto ann = [ann_replies](std::unique_ptr<request> req, std::unique_ptr<reply> rep) -> future<std::unique_ptr<reply>> {
BOOST_REQUIRE(!ann_replies->empty());
auto [req_exp, rep_inp] = ann_replies->front();
auto const req_inp = co_await util::read_entire_stream_contiguous(*req->content_stream);
BOOST_CHECK_EQUAL(req_inp, req_exp);
ann_replies->pop();
rep->set_status(status_type::ok);
rep->write_body("json", rep_inp);
co_return rep;
};
r.add(operation_type::POST, url("/api/v1/indexes/ks/idx").remainder("ann"), new function_handler(ann, "json"));
});
auto server = co_await make_vs_mock_server();
auto cfg = cql_test_config();
cfg.db_config->vector_store_primary_uri.set(format("http://good.authority.here:{}", addr.port()));
cfg.db_config->vector_store_primary_uri.set(format("http://good.authority.here:{}", server->port()));
co_await do_with_cql_env(
[&ann_replies](cql_test_env& env) -> future<> {
[&server](cql_test_env& env) -> future<> {
auto schema = co_await create_test_table(env, "ks", "idx");
auto as = abort_source_timeout();
auto& vs = env.local_qp().vector_store_client();
configure(vs).with_dns_refresh_interval(seconds(1)).with_dns({{"good.authority.here", "127.0.0.1"}});
vs.start_background_tasks();
// set the wrong idx (wrong endpoint) - service should return 404
auto as = abort_source_timeout();
auto keys = co_await vs.ann("ks", "idx2", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
// server responds with 404 - client should return service_error
server->next_ann_response({status_type::not_found, "idx2 not found"});
auto keys = co_await vs.ann("ks", "idx2", schema, std::vector<float>{0.3, 0.2, 0.1}, 1, as.reset());
BOOST_REQUIRE(!server->requests().empty());
BOOST_REQUIRE_EQUAL(server->requests().back().body, R"({"vector":[0.3,0.2,0.1],"limit":1})");
BOOST_REQUIRE_EQUAL(server->requests().back().path, "/api/v1/indexes/ks/idx2/ann");
BOOST_REQUIRE(!keys);
auto* err = std::get_if<vector_store_client::service_error>(&keys.error());
BOOST_CHECK(err != nullptr);
BOOST_CHECK_EQUAL(err->status, status_type::not_found);
// missing primary_keys in the reply - service should return format error
ann_replies->emplace(std::make_tuple(R"({"vector":[0.1,0.2,0.3],"limit":2})",
R"({"primary_keys1":{"pk1":[5,6],"pk2":[7,8],"ck1":[9,1],"ck2":[2,3]},"distances":[0.1,0.2]})"));
auto const now = lowres_clock::now();
for (;;) {
as.reset();
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
BOOST_REQUIRE(lowres_clock::now() - now < STANDARD_WAIT);
BOOST_REQUIRE(!keys);
// if the service is unavailable or 400, retry, seems http server is not ready yet
auto* const unavailable = std::get_if<vector_store_client::service_unavailable>(&keys.error());
auto* const service_error = std::get_if<vector_store_client::service_error>(&keys.error());
if ((unavailable == nullptr && service_error == nullptr) ||
(service_error != nullptr && service_error->status != status_type::bad_request)) {
break;
}
}
server->next_ann_response({status_type::ok, R"({"primary_keys1":{"pk1":[5,6],"pk2":[7,8],"ck1":[9,1],"ck2":[2,3]},"distances":[0.1,0.2]})"});
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
BOOST_REQUIRE(!server->requests().empty());
BOOST_REQUIRE_EQUAL(server->requests().back().body, R"({"vector":[0.1,0.2,0.3],"limit":2})");
BOOST_REQUIRE_EQUAL(server->requests().back().path, "/api/v1/indexes/ks/idx/ann");
BOOST_REQUIRE(!keys);
BOOST_CHECK(std::holds_alternative<vector_store_client::service_reply_format_error>(keys.error()));
// missing distances in the reply - service should return format error
ann_replies->emplace(std::make_tuple(R"({"vector":[0.1,0.2,0.3],"limit":2})",
R"({"primary_keys":{"pk1":[5,6],"pk2":[7,8],"ck1":[9,1],"ck2":[2,3]},"distances1":[0.1,0.2]})"));
as.reset();
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
server->next_ann_response({status_type::ok, R"({"primary_keys":{"pk1":[5,6],"pk2":[7,8],"ck1":[9,1],"ck2":[2,3]},"distances1":[0.1,0.2]})"});
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
BOOST_REQUIRE(!keys);
BOOST_CHECK(std::holds_alternative<vector_store_client::service_reply_format_error>(keys.error()));
// missing pk1 key in the reply - service should return format error
ann_replies->emplace(std::make_tuple(R"({"vector":[0.1,0.2,0.3],"limit":2})",
R"({"primary_keys":{"pk11":[5,6],"pk2":[7,8],"ck1":[9,1],"ck2":[2,3]},"distances":[0.1,0.2]})"));
as.reset();
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
server->next_ann_response({status_type::ok, R"({"primary_keys":{"pk11":[5,6],"pk2":[7,8],"ck1":[9,1],"ck2":[2,3]},"distances":[0.1,0.2]})"});
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
BOOST_REQUIRE(!keys);
BOOST_CHECK(std::holds_alternative<vector_store_client::service_reply_format_error>(keys.error()));
// missing ck1 key in the reply - service should return format error
ann_replies->emplace(std::make_tuple(R"({"vector":[0.1,0.2,0.3],"limit":2})",
R"({"primary_keys":{"pk1":[5,6],"pk2":[7,8],"ck11":[9,1],"ck2":[2,3]},"distances":[0.1,0.2]})"));
as.reset();
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
server->next_ann_response({status_type::ok, R"({"primary_keys":{"pk1":[5,6],"pk2":[7,8],"ck11":[9,1],"ck2":[2,3]},"distances":[0.1,0.2]})"});
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
BOOST_REQUIRE(!keys);
BOOST_CHECK(std::holds_alternative<vector_store_client::service_reply_format_error>(keys.error()));
// wrong size of pk2 key in the reply - service should return format error
ann_replies->emplace(std::make_tuple(
R"({"vector":[0.1,0.2,0.3],"limit":2})", R"({"primary_keys":{"pk1":[5,6],"pk2":[78],"ck1":[9,1],"ck2":[2,3]},"distances":[0.1,0.2]})"));
as.reset();
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
server->next_ann_response({status_type::ok, R"({"primary_keys":{"pk1":[5,6],"pk2":[7],"ck1":[9,1],"ck2":[2,3]},"distances":[0.1,0.2]})"});
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
BOOST_REQUIRE(!keys);
BOOST_CHECK(std::holds_alternative<vector_store_client::service_reply_format_error>(keys.error()));
// wrong size of ck2 key in the reply - service should return format error
ann_replies->emplace(std::make_tuple(
R"({"vector":[0.1,0.2,0.3],"limit":2})", R"({"primary_keys":{"pk1":[5,6],"pk2":[7,8],"ck1":[9,1],"ck2":[23]},"distances":[0.1,0.2]})"));
as.reset();
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
server->next_ann_response({status_type::ok, R"({"primary_keys":{"pk1":[5,6],"pk2":[7,8],"ck1":[9,1],"ck2":[2]},"distances":[0.1,0.2]})"});
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
BOOST_REQUIRE(!keys);
BOOST_CHECK(std::holds_alternative<vector_store_client::service_reply_format_error>(keys.error()));
// correct reply - service should return keys
ann_replies->emplace(std::make_tuple(R"({"vector":[0.1,0.2,0.3],"limit":2})",
R"({"primary_keys":{"pk1":[5,6],"pk2":[7,8],"ck1":[9,1],"ck2":[2,3]},"distances":[0.1,0.2]})"));
as.reset();
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
server->next_ann_response({status_type::ok, CORRECT_RESPONSE_FOR_TEST_TABLE});
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
BOOST_REQUIRE(keys);
BOOST_REQUIRE_EQUAL(keys->size(), 2);
BOOST_CHECK_EQUAL(seastar::format("{}", keys->at(0).partition.key().explode()), "[05, 07]");
@@ -722,7 +708,7 @@ SEASTAR_TEST_CASE(vector_store_client_uri_update_to_empty) {
vs.start_background_tasks();
// Wait for initial DNS resolution
BOOST_CHECK(co_await repeat_until(std::chrono::seconds(5), [&]() -> future<bool> {
BOOST_CHECK(co_await repeat_until([&]() -> future<bool> {
co_return count > 0;
}));
@@ -763,6 +749,7 @@ SEASTAR_TEST_CASE(vector_store_client_uri_update_to_invalid) {
auto cfg = config();
cfg.vector_store_primary_uri.set("http://good.authority.here:6080");
auto vs = vector_store_client{cfg};
configure{vs};
vs.start_background_tasks();
@@ -777,8 +764,8 @@ SEASTAR_TEST_CASE(vector_store_client_uri_update) {
// Test verifies that when vector store uri is update, the client
// will switch to the new uri within the DNS refresh interval.
// To avoid race condition we wait twice long as DNS refresh interval before checking the result.
auto s1 = co_await make_vs_mock_server(status_type::not_found);
auto s2 = co_await make_vs_mock_server(status_type::service_unavailable);
auto s1 = co_await make_vs_mock_server(vs_mock_server::ann_resp(status_type::not_found, "Not found"));
auto s2 = co_await make_vs_mock_server(vs_mock_server::ann_resp(status_type::service_unavailable, "Service unavailable"));
constexpr auto is_s2_response = [](const auto& keys) -> bool {
return !keys && std::holds_alternative<vector_store_client::service_error>(keys.error()) &&
@@ -801,8 +788,7 @@ SEASTAR_TEST_CASE(vector_store_client_uri_update) {
// Wait until requests are handled by s2
BOOST_CHECK(co_await repeat_until(DNS_REFRESH_INTERVAL * 2, [&]() -> future<bool> {
as.reset();
co_return is_s2_response(co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as));
co_return is_s2_response(co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset()));
}));
},
cfg)
@@ -830,8 +816,8 @@ SEASTAR_TEST_CASE(vector_store_client_multiple_ips_high_availability) {
// Because requests are distributed in random order due to load balancing,
// repeat the ANN query until the unavailable server is queried.
BOOST_CHECK(co_await repeat_until(std::chrono::seconds(10), [&]() -> future<bool> {
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
BOOST_CHECK(co_await repeat_until([&]() -> future<bool> {
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
co_return unavail_s->connections() > 1;
}));
@@ -864,9 +850,9 @@ SEASTAR_TEST_CASE(vector_store_client_multiple_ips_load_balancing) {
// Wait until requests are handled by both servers.
// The load balancing algorithm is random, so we send requests in a loop
// until both servers have received at least one, verifying that load is distributed.
BOOST_CHECK(co_await repeat_until(std::chrono::seconds(10), [&]() -> future<bool> {
co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
co_return s1->requests() > 0 && s2->requests() > 0;
BOOST_CHECK(co_await repeat_until([&]() -> future<bool> {
co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
co_return !s1->requests().empty() && !s2->requests().empty();
}));
},
cfg)
@@ -894,8 +880,8 @@ SEASTAR_TEST_CASE(vector_store_client_multiple_uris_high_availability) {
// Because requests are distributed in random order due to load balancing,
// repeat the ANN query until the unavailable server is queried.
BOOST_CHECK(co_await repeat_until(std::chrono::seconds(10), [&]() -> future<bool> {
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
BOOST_CHECK(co_await repeat_until([&]() -> future<bool> {
keys = co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
co_return unavail_s->connections() > 1;
}));
@@ -928,9 +914,9 @@ SEASTAR_TEST_CASE(vector_store_client_multiple_uris_load_balancing) {
// Wait until requests are handled by both servers.
// The load balancing algorithm is random, so we send requests in a loop
// until both servers have received at least one, verifying that load is distributed.
BOOST_CHECK(co_await repeat_until(std::chrono::seconds(10), [&]() -> future<bool> {
co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.as);
co_return s1->requests() > 0 && s2->requests() > 0;
BOOST_CHECK(co_await repeat_until([&]() -> future<bool> {
co_await vs.ann("ks", "idx", schema, std::vector<float>{0.1, 0.2, 0.3}, 2, as.reset());
co_return !s1->requests().empty() && !s2->requests().empty();
}));
},
cfg)
@@ -941,28 +927,23 @@ SEASTAR_TEST_CASE(vector_store_client_multiple_uris_load_balancing) {
}
SEASTAR_TEST_CASE(vector_search_metrics_test) {
auto cfg = cql_test_config();
cfg.db_config->vector_store_primary_uri.set("http://good.authority.here:6080");
co_await do_with_cql_env(
[](cql_test_env& env) -> future<> {
auto as = abort_source();
auto as = abort_source_timeout();
auto schema = co_await create_test_table(env, "ks", "test");
auto result = co_await env.execute_cql("CREATE CUSTOM INDEX idx ON ks.test (embedding) USING 'vector_index'");
result.get()->throw_if_exception();
auto& vs = env.local_qp().vector_store_client();
configure{vs};
vs.start_background_tasks();
auto all_metrics = seastar::metrics::impl::get_values();
auto dns = get_metrics_value("vector_store_dns_refreshes", all_metrics)->i();
dns++;
vector_store_client_tester::trigger_dns_resolver(vs);
BOOST_CHECK(co_await repeat_until(seconds(1), [&dns]() -> future<bool> {
co_await sleep(milliseconds(10));
auto all_metrics = seastar::metrics::impl::get_values();
auto new_dns = get_metrics_value("vector_store_dns_refreshes", all_metrics)->i();
co_return dns == new_dns;
}));
co_await vector_store_client_tester::resolve_hostname(vs, as.reset());
auto metrics = seastar::metrics::impl::get_values();
BOOST_CHECK_EQUAL(get_metrics_value("vector_store_dns_refreshes", metrics)->i(), 1);
},
cfg);
}

View File

@@ -563,9 +563,9 @@ void cluster_repair_operation(scylla_rest_client& client, const bpo::variables_m
if (vm.contains("incremental-mode")) {
auto mode = vm["incremental-mode"].as<sstring>();
const std::unordered_set<sstring> supported_mode{"disabled", "regular", "full"};
const std::unordered_set<sstring> supported_mode{"disabled", "incremental", "full"};
if (!supported_mode.contains(mode)) {
throw std::invalid_argument("nodetool cluster repair --incremental-mode only supports: disabled, regular, full");
throw std::invalid_argument("nodetool cluster repair --incremental-mode only supports: disabled, incremental, full");
}
repair_params["incremental_mode"] = mode;
}
@@ -3624,7 +3624,7 @@ const std::map<operation, operation_action>& get_operations_with_func() {
"backup",
"copy SSTables from a specified keyspace's snapshot to a designated bucket in object storage",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/backup.html")),
{
typed_option<sstring>("keyspace", "Name of a keyspace to copy SSTables from"),
@@ -3723,13 +3723,13 @@ nodetool cluster repair on one node only.
Note that nodetool cluster repair repairs only tablet keyspaces.
To repair vnode keyspaces use nodetool repair.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/cluster/repair.html")),
{
typed_option<std::vector<sstring>>("in-dc", "Constrain repair to specific datacenter(s)"),
typed_option<std::vector<sstring>>("in-hosts", "Constrain repair to the specific host(s)"),
typed_option<std::vector<sstring>>("tablet-tokens", "Tokens owned by the tablets to repair."),
typed_option<sstring>("incremental-mode", "Specify the incremental repair mode: disabled, regular, full"),
typed_option<sstring>("incremental-mode", "Specify the incremental repair mode: disabled, incremental, full"),
},
{
typed_option<sstring>("keyspace", "The keyspace to repair, if missing all keyspaces are repaired", 1),
@@ -3887,7 +3887,7 @@ For more information, see: {}
"disablebinary",
"Disable the CQL native protocol",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/disablebinary.html")),
},
{
@@ -3899,7 +3899,7 @@ For more information, see: {}"
"disablegossip",
"Disable the gossip protocol",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/disablegossip.html")),
},
{
@@ -3918,7 +3918,7 @@ before upgrading a node to a new version or before any maintenance action is
performed. When you want to simply flush memtables to disk, use the nodetool
flush command.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/drain.html")),
},
{
@@ -3930,7 +3930,7 @@ For more information, see: {}"
"enableautocompaction",
"Enables automatic compaction for the given keyspace and table(s)",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/enableautocompaction.html")),
{ },
{
@@ -3947,7 +3947,7 @@ For more information, see: {}"
"enablebackup",
"Enables incremental backup",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/enablebackup.html")),
},
{
@@ -3961,7 +3961,7 @@ For more information, see: {}"
fmt::format(R"(
The native protocol is enabled by default.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/enablebinary.html")),
},
{
@@ -3975,7 +3975,7 @@ For more information, see: {}"
fmt::format(R"(
The gossip protocol is enabled by default.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/enablegossip.html")),
},
{
@@ -3991,7 +3991,7 @@ Flush memtables to on-disk SSTables in the specified keyspace and table(s).
If no keyspace is specified, all keyspaces are flushed.
If no table(s) are specified, all tables in the specified keyspace are flushed.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/flush.html")),
{ },
{
@@ -4008,7 +4008,7 @@ For more information, see: {}"
"getendpoints",
"Print the end points that owns the key",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/getendpoints.html")),
{ },
{
@@ -4038,7 +4038,7 @@ Prints a table with the name and current logging level for each logger in Scylla
"getsstables",
"Get the sstables that contain the given key",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/getsstables.html")),
{
typed_option<>("hex-format", "The key is given in hex dump format"),
@@ -4060,7 +4060,7 @@ For more information, see: {}"
fmt::format(R"(
This value is the probability for tracing a request. To change this value see settraceprobability.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/gettraceprobability.html")),
},
{
@@ -4072,7 +4072,7 @@ For more information, see: {}"
"gossipinfo",
"Shows the gossip information for the cluster",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/gossipinfo.html")),
},
{
@@ -4098,7 +4098,7 @@ For more information, see: {}"
"info",
"Print node information (uptime, load, ...)",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/info.html")),
{
typed_option<>("tokens,T", "Display all tokens"),
@@ -4115,7 +4115,7 @@ For more information, see: {}"
fmt::format(R"(
Dropped tables (column family) will not be part of the listsnapshots.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/listsnapshots.html")),
{ },
{ },
@@ -4145,7 +4145,7 @@ This operation is not supported.
"netstats",
"Print network information on provided host (connecting node by default)",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/netstats.html")),
{
typed_option<>("human-readable,H", "Display bytes in human readable form, i.e. KiB, MiB, GiB, TiB"),
@@ -4164,7 +4164,7 @@ fmt::format(R"(
Provide the latency request that is recorded by the coordinator.
This command is helpful if you encounter slow node operations.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/proxyhistograms.html")),
{ },
{
@@ -4186,7 +4186,7 @@ Scylla first figures out which ranges the local node (the one we want to rebuild
is responsible for. Then which node in the cluster contains the same ranges.
Finally, Scylla streams the data to the local node.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/rebuild.html")),
{
typed_option<>("force", "Enforce the source_dc option, even if it unsafe to use for rebuild"),
@@ -4210,7 +4210,7 @@ Materialized Views (MV) and Secondary Indexes (SI) of the upload table, and if
they exist, they are automatically updated. Uploading MV or SI SSTables is not
required and will fail.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/refresh.html")),
{
typed_option<>("load-and-stream", "Allows loading sstables that do not belong to this node, in which case they are automatically streamed to the owning nodes"),
@@ -4238,7 +4238,7 @@ Provide the Host ID of the node to specify which node you want to remove.
Important: use this command *only* on nodes that are not reachable by other nodes
by any means!
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/removenode.html")),
{
typed_option<sstring>("ignore-dead-nodes", "Comma-separated list of dead node host IDs to ignore during removenode"),
@@ -4268,7 +4268,7 @@ all of the nodes in the cluster, or let ScyllaDB Manager do it for you.
Note that nodetool repair repairs only vnode keyspaces. To repair tablet
keyspaces use nodetool cluster repair.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/repair.html")),
{
typed_option<>("dc-parallel", "Repair datacenters in parallel"),
@@ -4313,7 +4313,7 @@ For more information, see: https://opensource.docs.scylladb.com/stable/operating
"restore",
"Copy SSTables from a designated bucket in object store to a specified keyspace or table",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/restore.html")),
{
typed_option<sstring>("endpoint", "ID of the configured object storage endpoint to copy SSTables from"),
@@ -4336,7 +4336,7 @@ For more information, see: {}"
"ring",
"Print information about the token ring",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/ring.html")),
{
typed_option<>("resolve-ip,r", "Show node domain names instead of IPs")
@@ -4355,7 +4355,7 @@ For more information, see: {}"
"scrub",
"Scrub the SSTable files in the specified keyspace or table(s)",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/scrub.html")),
{
typed_option<>("no-snapshot", "Do not take a snapshot of scrubbed tables before starting scrub (default false)"),
@@ -4382,7 +4382,7 @@ For more information, see: {}"
fmt::format(R"(
Resetting the log level of one or all loggers is not supported yet.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/setlogginglevel.html")),
{ },
{
@@ -4403,7 +4403,7 @@ Value is trace probability between 0 and 1. 0 the trace will never happen and 1
the trace will always happen. Anything in between is a percentage of the time,
converted into a decimal. For example, 60% would be 0.6.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/settraceprobability.html")),
{ },
{
@@ -4419,7 +4419,7 @@ For more information, see: {}"
"snapshot",
"Take a snapshot of specified keyspaces or a snapshot of the specified table",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/snapshot.html")),
{
typed_option<sstring>("table", "The table(s) to snapshot, multiple ones can be joined with ','"),
@@ -4440,7 +4440,7 @@ For more information, see: {}"
"sstableinfo",
"Information about sstables per keyspace/table",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/sstableinfo.html")),
{ },
{
@@ -4457,7 +4457,7 @@ For more information, see: {}"
"status",
"Displays cluster information for a table in a keyspace, a single keyspace or all keyspaces",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/status.html")),
{
typed_option<>("resolve-ip,r", "Show node domain names instead of IPs"),
@@ -4480,7 +4480,7 @@ Results can be one of the following: `running` or `not running`.
By default, the incremental backup status is `not running`.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/statusbackup.html")),
},
{
@@ -4499,7 +4499,7 @@ Results can be one of the following: `running` or `not running`.
By default, the native transport is `running`.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/statusbinary.html")),
},
{
@@ -4516,7 +4516,7 @@ Results can be one of the following: `running` or `not running`.
By default, the gossip protocol is `running`.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/statusgossip.html")),
},
{
@@ -4530,7 +4530,7 @@ For more information, see: {}"
fmt::format(R"(
This command is usually used to stop compaction that has a negative impact on the performance of a node.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/stop.html")),
{
typed_option<int>("id", "The id of the compaction operation to stop (not implemented)"),
@@ -4555,7 +4555,7 @@ since the last time you ran the nodetool cfhistograms command.
Also invokable as "cfhistograms".
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/cfhistograms.html")),
{ },
{
@@ -4572,7 +4572,7 @@ For more information, see: {}"
{"cfstats"},
"Print statistics on tables",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/tablestats.html")),
{
typed_option<bool>("ignore,i", false, "Ignore the list of tables and display the remaining tables"),
@@ -4604,7 +4604,7 @@ fmt::format(R"(
Aborts a task with given id. If the task is not abortable, appropriate message
will be printed, depending on why the abort failed.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/tasks/abort.html")),
{ },
{
@@ -4618,7 +4618,7 @@ fmt::format(R"(
Unregisters all finished local tasks from the specified module. If a module is not specified,
all modules are drained.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/tasks/drain.html")),
{
typed_option<sstring>("module", "The module name; if specified, only the tasks from this module are unregistered"),
@@ -4632,7 +4632,7 @@ fmt::format(R"(
Gets or sets the time in seconds for which tasks started by user will be kept in task manager after
they are finished.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/tasks/user-ttl.html")),
{
typed_option<uint32_t>("set", "New user_task_ttl value", -1),
@@ -4648,7 +4648,7 @@ keyspace, table, entity, shard, start_time, and end_time) of tasks in a specifie
Allows to monitor tasks for extended time.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/tasks/list.html")),
{
typed_option<>("internal", "Show internal tasks"),
@@ -4665,7 +4665,7 @@ For more information, see: {}"
"modules",
"Gets a list of modules supported by task manager",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/tasks/modules.html")),
{ },
{ },
@@ -4674,7 +4674,7 @@ For more information, see: {}"
"status",
"Gets a status of the task",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/tasks/status.html")),
{ },
{
@@ -4688,7 +4688,7 @@ fmt::format(R"(
Lists statuses of a specified task and all its descendants in BFS order.
If id param isn't specified, trees of all non-internal tasks are listed.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/tasks/tree.html")),
{ },
{
@@ -4702,7 +4702,7 @@ fmt::format(R"(
Gets or sets the time in seconds for which tasks will be kept in task manager after
they are finished.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/tasks/ttl.html")),
{
typed_option<uint32_t>("set", "New task_ttl value", -1),
@@ -4722,7 +4722,7 @@ If quiet flag is set, nothing is printed. Instead the right exit code is returne
- 124, if the operation timed out;
- 125, if there was an error.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/tasks/wait.html")),
{
typed_option<>("quiet,q", "If set, status isn't printed"),
@@ -4771,7 +4771,7 @@ For more information, see: {}"
"toppartitions",
"Sample and print the most active partitions for a given column family",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/toppartitions.html")),
{
typed_option<int>("duration,d", 5000, "Duration in milliseconds"),
@@ -4802,7 +4802,7 @@ Can also be used to upgrade all sstables to the latest sstable version.
Note that this command is not needed for changes described above to take effect. They take effect gradually as new sstables are written and old ones are compacted.
This command should be used when it is desired that such changes take effect right away.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/upgradesstables.html")),
{
typed_option<>("include-all-sstables,a", "Include all sstables, even those already on the current version"),
@@ -4822,7 +4822,7 @@ For more information, see: {}"
"viewbuildstatus",
"Show progress of a materialized view build",
fmt::format(R"(
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/viewbuildstatus.html")),
{},
{
@@ -4842,7 +4842,7 @@ Displays the Apache Cassandra version which your version of Scylla is most
compatible with, not your current Scylla version. To display the Scylla version,
run `scylla --version`.
For more information, see: {}"
For more information, see: {}
)", doc_link("operating-scylla/nodetool-commands/version.html")),
},
{

View File

@@ -36,6 +36,7 @@
#include "sstables/sstables_manager.hh"
#include "sstables/sstable_directory.hh"
#include "sstables/open_info.hh"
#include "release.hh"
#include "replica/schema_describe_helper.hh"
#include "test/lib/cql_test_env.hh"
#include "tools/json_writer.hh"
@@ -2110,7 +2111,7 @@ const std::map<operation, operation_func> operations_with_func{
/* dump-data */
{{"dump-data",
"Dump content of sstable(s)",
R"(
fmt::format(R"(
Dump the content of the data component. This component contains the data-proper
of the sstable. This might produce a huge amount of output. In general the
human-readable output will be larger than the binary file.
@@ -2122,9 +2123,8 @@ format.
Supports both a text and JSON output. The text output uses the built-in scylla
printers, which are also used when logging mutation-related data structures.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#dump-data
for more information on this operation, including the schema of the JSON output.
)",
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#dump-data")),
{
typed_option<std::vector<sstring>>("partition", "partition(s) to filter for, partitions are expected to be in the hex format"),
typed_option<sstring>("partitions-file", "file containing partition(s) to filter for, partitions are expected to be in the hex format"),
@@ -2136,7 +2136,7 @@ for more information on this operation, including the schema of the JSON output.
/* dump-index */
{{"dump-index",
"Dump content of sstable index(es)",
R"(
fmt::format(R"(
Dump the content of the index component. Contains the partition-index of the data
component. This is effectively a list of all the partitions in the sstable, with
their starting position in the data component and optionally a promoted index,
@@ -2144,80 +2144,73 @@ which contains a sampled index of the clustering rows in the partition.
Positions (both that of partition and that of rows) is valid for uncompressed
data.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#dump-index
for more information on this operation, including the schema of the JSON output.
)"},
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#dump-index"))},
dump_index_operation},
/* dump-compression-info */
{{"dump-compression-info",
"Dump content of sstable compression info(s)",
R"(
fmt::format(R"(
Dump the content of the compression-info component. Contains compression
parameters and maps positions into the uncompressed data to that into compressed
data. Note that compression happens over chunks with configurable size, so to
get data at a position in the middle of a compressed chunk, the entire chunk has
to be decompressed.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#dump-compression-info
for more information on this operation, including the schema of the JSON output.
)"},
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#dump-compression-info"))},
dump_compression_info_operation},
/* dump-summary */
{{"dump-summary",
"Dump content of sstable summary(es)",
R"(
fmt::format(R"(
Dump the content of the summary component. The summary is a sampled index of the
content of the index-component. An index of the index. Sampling rate is chosen
such that this file is small enough to be kept in memory even for very large
sstables.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#dump-summary
for more information on this operation, including the schema of the JSON output.
)"},
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#dump-summary"))},
dump_summary_operation},
/* dump-statistics */
{{"dump-statistics",
"Dump content of sstable statistics(s)",
R"(
fmt::format(R"(
Dump the content of the statistics component. Contains various metadata about the
data component. In the sstable 3 format, this component is critical for parsing
the data component.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#dump-statistics
for more information on this operation, including the schema of the JSON output.
)"},
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#dump-statistics"))},
dump_statistics_operation},
/* dump-scylla-metadata */
{{"dump-scylla-metadata",
"Dump content of sstable scylla metadata(s)",
R"(
fmt::format(R"(
Dump the content of the scylla-metadata component. Contains scylla-specific
metadata about the data component. This component won't be present in sstables
produced by Apache Cassandra.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#dump-scylla-metadata
for more information on this operation, including the schema of the JSON output.
)"},
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#dump-scylla-metadata"))},
dump_scylla_metadata_operation},
/* validate */
{{"validate",
"Validate the sstable(s), same as scrub in validate mode",
R"(
fmt::format(R"(
Validates the content of the sstable on the mutation-fragment level, see
https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#sstable-content
for more details.
Any parsing errors will also be detected, but after successful parsing the
validation will happen on the fragment level.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#validate
for more information on this operation.
)"},
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#validate"))},
validate_operation},
/* scrub */
{{"scrub",
"Scrub the sstable(s), in the specified mode",
R"(
fmt::format(R"(
Read and re-write the sstable, getting rid of or fixing broken parts, depending
on the selected mode.
Output sstables are written to the directory specified via `--output-directory`.
@@ -2230,9 +2223,8 @@ abort the scrub. This can be overridden by the
be aborted if an sstable cannot be written because its generation clashes with
pre-existing sstables in the directory.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#scrub
for more information on this operation, including what the different modes do.
)",
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#scrub")),
{
typed_option<std::string>("scrub-mode", "scrub mode to use, one of (abort, skip, segregate, validate)"),
typed_option<std::string>("output-dir", ".", "directory to place the scrubbed sstables to"),
@@ -2242,12 +2234,11 @@ for more information on this operation, including what the different modes do.
/* validate-checksums */
{{"validate-checksums",
"Validate the checksums of the sstable(s)",
R"(
fmt::format(R"(
Validate both the whole-file and the per-chunk checksums of the data component.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#validate-checksums
for more information on this operation.
)"},
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#validate-checksums"))},
validate_checksums_operation},
/* decompress */
{{"decompress",
@@ -2266,7 +2257,7 @@ the output will be:
/* write */
{{"write",
"Write an sstable",
R"(
fmt::format(R"(
Write an sstable based on a JSON representation of the content. The JSON
representation has to have the same schema as that of a single sstable
from the output of the dump-data operation (corresponding to the $SSTABLE
@@ -2286,9 +2277,8 @@ format and random UUID generation (printed to stdout). By default it is
placed in the local directory, can be changed with --output-dir. If the
output sstable clashes with an existing sstable, the write will fail.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#write
for more information on this operation, including the schema of the JSON input.
)",
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#write")),
{
typed_option<std::string>("input-file", "the file containing the input"),
typed_option<std::string>("output-dir", ".", "directory to place the output sstable(s) to"),
@@ -2299,13 +2289,12 @@ for more information on this operation, including the schema of the JSON input.
/* script */
{{"script",
"Run a script on content of an sstable",
R"(
fmt::format(R"(
Read the sstable(s) and pass the resulting fragment stream to the script
specified by `--script-file`. Currently only Lua scripts are supported.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#script
for more information on this operation, including the API documentation.
)",
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#script")),
{
typed_option<>("merge", "merge all sstables into a single mutation fragment stream (use a combining reader over all sstable readers)"),
typed_option<std::string>("script-file", "script file to load and execute"),
@@ -2326,7 +2315,7 @@ for more information on this operation, including the API documentation.
/* query */
{{"query",
"Run a query on the content of the sstable(s)",
R"(
fmt::format(R"(
The query is run on the combined content of all input sstables.
By default, the following query is run: SELECT * FROM $table.
@@ -2350,9 +2339,8 @@ cql_test_env. This temporary directory will have a size of a couple of megabytes
By default it will create this in /tmp, this can be changed with the `TEMPDIR`
environment variable. This temporary directory is removed on exit.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#query
for more information on this operation, including usage examples.
)",
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#query")),
{
typed_option<std::string>("query,q", "execute the query provided on the command-line"),
typed_option<std::string>("query-file", "execute the query from the file, the file is expected to contain a single query"),
@@ -2362,7 +2350,7 @@ for more information on this operation, including usage examples.
/* upgrade */
{{"upgrade",
"Upgrade sstable(s) to the highest supported version and apply the latest schema",
R"(
fmt::format(R"(
This command is an offline version of nodetool upgradesstables.
Applies the latest sstable version and the latest schema to the sstables.
@@ -2379,9 +2367,8 @@ versions which are supported for writing: mc, md, me, ms.
Mapping of input sstables to output sstables is printed to stdout.
See https://docs.scylladb.com/operating-scylla/admin-tools/scylla-sstable#upgrade
for more information on this operation, including usage examples.
)",
For more information, see: {}
)", doc_link("operating-scylla/admin-tools/scylla-sstable#upgrade")),
{
typed_option<std::string>("output-dir", ".", "directory to place the output sstable(s) to"),
typed_option<std::string>("sstable-version", "sstable version to use, defaults to the same version as ScyllaDB would"),

View File

@@ -57,7 +57,6 @@ target_sources(utils
s3/client.cc
s3/retryable_http_client.cc
s3/retry_strategy.cc
s3/s3_retry_strategy.cc
s3/credentials_providers/aws_credentials_provider.cc
s3/credentials_providers/environment_aws_credentials_provider.cc
s3/credentials_providers/instance_profile_credentials_provider.cc

View File

@@ -205,7 +205,7 @@ public:
}
named_value(config_file* file, std::string_view name, liveness liveness_, value_status vs, const T& t = T(), std::string_view desc = {},
std::initializer_list<T> allowed_values = {})
: named_value(file, name, {}, liveness_, vs, t, desc) {
: named_value(file, name, {}, liveness_, vs, t, desc, std::move(allowed_values)) {
}
named_value(config_file* file, std::string_view name, std::string_view alias, value_status vs, const T& t = T(), std::string_view desc = {},
std::initializer_list<T> allowed_values = {})

View File

@@ -34,7 +34,6 @@
#include <seastar/util/lazy.hh>
#include <seastar/http/request.hh>
#include <seastar/http/exception.hh>
#include "s3_retry_strategy.hh"
#include "db/config.hh"
#include "utils/assert.hh"
#include "utils/s3/aws_error.hh"
@@ -121,10 +120,7 @@ client::client(std::string host, endpoint_config_ptr cfg, semaphore& mem, global
_creds_update_timer.arm(lowres_clock::now());
if (!_retry_strategy) {
_retry_strategy = std::make_unique<aws::s3_retry_strategy>([this]() -> future<> {
auto units = co_await get_units(_creds_sem, 1);
co_await update_credentials_and_rearm();
});
_retry_strategy = std::make_unique<aws::default_retry_strategy>();
}
}
@@ -235,6 +231,11 @@ void client::group_client::register_metrics(std::string class_name, std::string
sm::description("Total time spend writing data to objects"), {ep_label, sg_label}),
sm::make_counter("total_read_prefetch_bytes", [this] { return prefetch_bytes; },
sm::description("Total number of bytes requested from object"), {ep_label, sg_label}),
sm::make_counter("downloads_blocked_on_memory",
[this] { return downloads_blocked_on_memory; },
sm::description("Counts the number of times S3 client downloads were delayed due to insufficient memory availability"),
{ep_label, sg_label})
});
}
@@ -300,19 +301,50 @@ client::group_client& client::find_or_create_client() {
}
}
future<> client::make_request(http::request req, http::experimental::client::reply_handler handle, std::optional<http::reply::status_type> expected, seastar::abort_source* as) {
co_await authorize(req);
auto& gc = find_or_create_client();
co_await gc.retryable_client.make_request(std::move(req), std::move(handle), expected, as);
future<> client::make_request(http::request req,
http::experimental::client::reply_handler handle,
std::optional<http::reply::status_type> expected,
seastar::abort_source* as) {
auto request = std::move(req);
constexpr size_t max_attempts = 3;
size_t attempts = 0;
while (true) {
co_await authorize(request);
auto& gc = find_or_create_client();
try {
co_return co_await gc.retryable_client.make_request(
request, [&handle](const http::reply& reply, input_stream<char>&& body) { return handle(reply, std::move(body)); }, expected, as);
} catch (const aws::aws_exception& ex) {
if (++attempts <= max_attempts) {
if (ex.error().get_error_type() == aws::aws_error_type::REQUEST_TIME_TOO_SKEWED) {
s3l.warn("Request failed with REQUEST_TIME_TOO_SKEWED. Machine time: {}, request timestamp: {}",
utils::aws::format_time_point(db_clock::now()),
request.get_header("x-amz-date"));
continue;
}
if (ex.error().get_error_type() == aws::aws_error_type::EXPIRED_TOKEN) {
s3l.warn("Request failed with EXPIRED_TOKEN. Resetting credentials");
_credentials = {};
continue;
}
}
map_s3_client_exception(std::current_exception());
} catch (const storage_io_error&) {
throw;
} catch (const abort_requested_exception&) {
throw;
} catch (...) {
map_s3_client_exception(std::current_exception());
}
}
}
future<> client::make_request(http::request req, reply_handler_ext handle_ex, std::optional<http::reply::status_type> expected, seastar::abort_source* as) {
co_await authorize(req);
auto& gc = find_or_create_client();
auto handle = [&gc, handle = std::move(handle_ex)] (const http::reply& rep, input_stream<char>&& in) {
return handle(gc, rep, std::move(in));
};
co_await gc.retryable_client.make_request(std::move(req), std::move(handle), expected, as);
co_await make_request(std::move(req), std::move(handle), expected, as);
}
future<> client::get_object_header(sstring object_name, http::experimental::client::reply_handler handler, seastar::abort_source* as) {
@@ -1142,13 +1174,15 @@ class client::chunked_download_source final : public seastar::data_source_impl {
s3l.trace("Fiber starts cycle for object '{}'", _object_name);
while (!_is_finished) {
try {
if (_buffers_size >= _max_buffers_size * _buffers_low_watermark) {
co_await _bg_fiber_cv.when([this] { return _buffers_size < _max_buffers_size * _buffers_low_watermark; });
if (!_is_finished && _buffers_size >= _max_buffers_size * _buffers_low_watermark) {
co_await _bg_fiber_cv.when([this] { return _is_finished || (_buffers_size < _max_buffers_size * _buffers_low_watermark); });
}
if (auto units = try_get_units(_client->_memory, _socket_buff_size); !_buffers.empty() && !units) {
if (auto units = try_get_units(_client->_memory, _socket_buff_size); !_is_finished && !_buffers.empty() && !units) {
auto& gc = _client->find_or_create_client();
++gc.downloads_blocked_on_memory;
co_await _bg_fiber_cv.when([this] {
return _buffers.empty() || try_get_units(_client->_memory, _socket_buff_size);
return _is_finished || _buffers.empty() || try_get_units(_client->_memory, _socket_buff_size);
});
}
@@ -1203,7 +1237,7 @@ class client::chunked_download_source final : public seastar::data_source_impl {
while (_buffers_size < _max_buffers_size && !_is_finished) {
utils::get_local_injector().inject("kill_s3_inflight_req", [] {
// Inject non-retryable error to emulate source failure
throw aws::aws_exception(aws::aws_error::get_errors().at("ResourceNotFound"));
throw aws::aws_exception(aws::aws_error(aws::aws_error_type::RESOURCE_NOT_FOUND, "Injected ResourceNotFound", aws::retryable::no));
});
s3l.trace("Fiber for object '{}' will try to read within range {}", _object_name, _range);
temporary_buffer<char> buf;
@@ -1247,20 +1281,23 @@ class client::chunked_download_source final : public seastar::data_source_impl {
co_await in.close();
if (ex) {
auto aws_ex = aws::aws_error::from_exception_ptr(ex);
if (aws_ex.is_retryable()) {
s3l.debug("Fiber for object '{}' rethrowing filler aws_exception {}", _object_name, ex);
throw filler_exception(format("{}", ex).c_str());
}
std::rethrow_exception(ex);
s3l.debug("Fiber for object '{}' rethrowing filler aws_exception {}", _object_name, ex);
throw filler_exception(
ex, aws_ex.is_retryable() == aws::retryable::no && aws_ex.get_error_type() != aws::aws_error_type::EXPIRED_TOKEN);
}
},
{},
_as);
_is_contiguous_mode = _buffers_size < _max_buffers_size * _buffers_high_watermark;
} catch (const filler_exception& ex) {
s3l.warn("Fiber for object '{}' experienced an error in buffer filling loop. Reason: {}. Re-issuing the request", _object_name, ex);
if (ex._should_abort) {
s3l.info("Fiber for object '{}' experienced a non-retryable error in buffer filling loop. Reason: {}. Exiting", _object_name, ex._original_exception);
_get_cv.broken(ex._original_exception);
co_return;
}
s3l.info("Fiber for object '{}' experienced an error in buffer filling loop. Reason: {}. Re-issuing the request", _object_name, ex._original_exception);
} catch (...) {
s3l.trace("Fiber for object '{}' failed: {}, exiting", _object_name, std::current_exception());
s3l.info("Fiber for object '{}' failed: {}, exiting", _object_name, std::current_exception());
_get_cv.broken(std::current_exception());
co_return;
}

View File

@@ -90,8 +90,10 @@ struct stats {
std::time_t last_modified;
};
struct filler_exception final : std::runtime_error {
explicit filler_exception(const char* msg) : std::runtime_error(msg) {}
struct filler_exception final : std::exception {
filler_exception(std::exception_ptr original_exception, bool should_abort) : _original_exception(std::move(original_exception)), _should_abort(should_abort) {}
std::exception_ptr _original_exception;
bool _should_abort{false};
};
future<> ignore_reply(const http::reply& rep, input_stream<char>&& in_);
@@ -131,6 +133,7 @@ class client : public enable_shared_from_this<client> {
io_stats read_stats;
io_stats write_stats;
uint64_t prefetch_bytes = 0;
uint64_t downloads_blocked_on_memory = 0;
seastar::metrics::metric_groups metrics;
group_client(std::unique_ptr<http::experimental::connection_factory> f, unsigned max_conn, const aws::retry_strategy& retry_strategy);
void register_metrics(std::string class_name, std::string host);

View File

@@ -28,7 +28,7 @@ retryable_http_client::retryable_http_client(std::unique_ptr<http::experimental:
assert(_error_handler);
}
future<> retryable_http_client::do_retryable_request(http::request req, http::experimental::client::reply_handler handler, seastar::abort_source* as) {
future<> retryable_http_client::do_retryable_request(const seastar::http::request& req, http::experimental::client::reply_handler handler, seastar::abort_source* as) {
// TODO: the http client does not check abort status on entry, and if
// we're already aborted when we get here we will paradoxally not be
// interrupted, because no registration etc will be done. So do a quick
@@ -52,7 +52,10 @@ future<> retryable_http_client::do_retryable_request(http::request req, http::ex
e = std::current_exception();
request_ex = aws_exception(aws_error::from_exception_ptr(e));
}
if (request_ex.error().get_error_type() == aws::aws_error_type::REQUEST_TIME_TOO_SKEWED ||
request_ex.error().get_error_type() == aws::aws_error_type::EXPIRED_TOKEN) {
co_await coroutine::return_exception_ptr(std::move(e));
}
if (!co_await _retry_strategy.should_retry(request_ex.error(), retries)) {
break;
}
@@ -65,12 +68,12 @@ future<> retryable_http_client::do_retryable_request(http::request req, http::ex
}
}
future<> retryable_http_client::make_request(http::request req,
future<> retryable_http_client::make_request(const seastar::http::request& req,
http::experimental::client::reply_handler handle,
std::optional<http::reply::status_type> expected,
seastar::abort_source* as) {
co_await do_retryable_request(
std::move(req),
req,
[handler = std::move(handle), expected](const http::reply& rep, input_stream<char>&& in) mutable -> future<> {
auto payload = std::move(in);
auto status_class = http::reply::classify_status(rep._status);

View File

@@ -21,7 +21,7 @@ public:
error_handler error_func,
seastar::http::experimental::client::retry_requests should_retry,
const aws::retry_strategy& retry_strategy);
seastar::future<> make_request(seastar::http::request req,
seastar::future<> make_request(const seastar::http::request& req,
seastar::http::experimental::client::reply_handler handle,
std::optional<seastar::http::reply::status_type> expected = std::nullopt,
seastar::abort_source* = nullptr);
@@ -31,7 +31,7 @@ public:
private:
seastar::future<>
do_retryable_request(seastar::http::request req, seastar::http::experimental::client::reply_handler handler, seastar::abort_source* as = nullptr);
do_retryable_request(const seastar::http::request& req, seastar::http::experimental::client::reply_handler handler, seastar::abort_source* as = nullptr);
seastar::http::experimental::client http;
const aws::retry_strategy& _retry_strategy;

View File

@@ -1,32 +0,0 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#include "s3_retry_strategy.hh"
#include "aws_error.hh"
#include "utils/log.hh"
using namespace std::chrono_literals;
namespace aws {
static logging::logger s3_retry_logger("s3_retry_strategy");
s3_retry_strategy::s3_retry_strategy(credentials_refresher creds_refresher, unsigned max_retries, unsigned scale_factor)
: default_retry_strategy(max_retries, scale_factor), _creds_refresher(std::move(creds_refresher)) {
}
seastar::future<bool> s3_retry_strategy::should_retry(const aws_error& error, unsigned attempted_retries) const {
if (attempted_retries < _max_retries && error.get_error_type() == aws_error_type::EXPIRED_TOKEN) {
s3_retry_logger.info("Credentials are expired, renewing");
co_await _creds_refresher();
co_return true;
}
co_return co_await default_retry_strategy::should_retry(error, attempted_retries);
}
} // namespace aws

View File

@@ -1,27 +0,0 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#pragma once
#include "retry_strategy.hh"
namespace aws {
class aws_error;
class s3_retry_strategy : public default_retry_strategy {
public:
using credentials_refresher = std::function<seastar::future<>()>;
s3_retry_strategy(credentials_refresher creds_refresher, unsigned max_retries = 10, unsigned scale_factor = 25);
seastar::future<bool> should_retry(const aws_error& error, unsigned attempted_retries) const override;
private:
credentials_refresher _creds_refresher;
};
} // namespace aws

View File

@@ -45,10 +45,8 @@ auto wait_for_timeout(lowres_clock::duration timeout, abort_source& as) -> futur
dns::dns(logging::logger& logger, std::vector<seastar::sstring> hosts, listener_type listener, uint64_t& refreshes_counter)
: vslogger(logger)
, _refresh_interval(DNS_REFRESH_INTERVAL)
, _resolver([this, &refreshes_counter](auto const& host) -> future<address_type> {
, _resolver([this](auto const& host) -> future<address_type> {
auto f = co_await coroutine::as_future(net::dns::get_host_by_name(host));
refreshes_counter++;
if (f.failed()) {
auto err = f.get_exception();
if (try_catch<std::system_error>(err) != nullptr) {
@@ -61,7 +59,8 @@ dns::dns(logging::logger& logger, std::vector<seastar::sstring> hosts, listener_
co_return addr.addr_list;
})
, _hosts(std::move(hosts))
, _listener(std::move(listener)) {
, _listener(std::move(listener))
, _refreshes_counter(refreshes_counter) {
}
void dns::start_background_tasks() {
@@ -120,6 +119,7 @@ seastar::future<> dns::refresh_addr() {
host_address_map new_addrs;
auto copy = _hosts;
co_await coroutine::parallel_for_each(std::move(copy), [this, &new_addrs](const sstring& host) -> future<> {
++_refreshes_counter;
new_addrs[host] = co_await _resolver(host);
});
if (new_addrs != _addresses) {

View File

@@ -73,6 +73,7 @@ private:
std::vector<seastar::sstring> _hosts;
host_address_map _addresses;
listener_type _listener;
uint64_t& _refreshes_counter;
};
} // namespace vector_search