Compare commits

..

91 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
9c401e260a Initial setup: Fix configure.py for Ubuntu/Debian platform
Temporarily add Ubuntu/Debian support in kmiplib() function to allow
configuration to proceed on Ubuntu systems.

Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
2025-12-16 11:18:56 +00:00
copilot-swe-agent[bot]
1824b04e2a Initial plan 2025-12-16 10:58:17 +00:00
Anna Stuchlik
ea6f2a21c6 doc: remove references to ScyllaDB versions 4.3 and 4.4
We should never refer to the no longer supported OSS versions.
This is a leftover - other mentions were removed long time ago.

Fixes https://github.com/scylladb/scylladb/issues/19569

Closes scylladb/scylladb#27656
2025-12-16 06:58:53 +02:00
Yaniv Kaul
30c4bc3f96 Fix for __iter__ method returns a non-iterator
To fix this issue, the std_list_iterator class defined within
std_list.__iter__ should implement the full iterator protocol by defining
an __iter__() method that returns self. This change ensures any instance
of std_list_iterator can be used as an iterator in Python for loops and
other iteration contexts, as required. The fix is to add a small method
definition inside the std_list_iterator class, ideally after the __init__
or in a logical place with the other dunder methods.

Only the code inside the std_list class's __iter__ function (lines around
the definition of the inner class and its methods) needs to be edited.

Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

Closes scylladb/scylladb#27642
2025-12-16 06:57:19 +02:00
Piotr Smaron
77fa936edc doc: audit: update to present how to enable both syslog and table
Supporting both sinks have been introduced in
https://github.com/scylladb/scylladb/pull/26613, but it missed the docs
changes, so here they are.

Closes scylladb/scylladb#27607
2025-12-16 06:56:39 +02:00
Piotr Smaron
0ec485845b Clarify documentation build instructions
Closes scylladb/scylladb#27606
2025-12-16 06:56:00 +02:00
Botond Dénes
dace39fd6c Merge 'Make commitlog replay handle files with corrupt file header (non-zero) as data loss, not startup failure' from Calle Wilund
Fixes #26744

If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category.

However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up.

The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of losing data as expected.

Closes scylladb/scylladb#27556

* github.com:scylladb/scylladb:
  test::cluster::dtest::tools::files: Remove file
  commitlog_replay: Handle fully corrupt files same as partial corruption.
  test::pylib::suite::base: Split options.name test specifier only once
2025-12-16 06:55:42 +02:00
Calle Wilund
5f8f724d78 repair: Don't use off-strategy as repair destination with tablet tables
Fixes #17384

Bypasses enabling off-strategy storage/placement for repair streams
when table repaired is using tablets. Instead, the resulting sstable(s)
will be placed in the "normal" set of sstables, and bypass a post-repair
off-strategy compaction.

v2:
Bypass off-strat for whatever reason iff dest is tablets.

Closes scylladb/scylladb#27500
2025-12-16 06:54:07 +02:00
Michał Chojnowski
df93ea626b test/scylla_gdb: use gcore instead of signal SIGSEGV to generate a coredump on failure
The test fails in CI sometimes, and we want a coredump from a failure
to debug that. We made the test send a `signal SIGSEGV` to Scylla
on failure, but apparently that doesn't work as intended on our CI
hosts. (The CI runner seemingly can't find any coredump afterwards).

We can use gdb's `gcore` command to produce a coredump in a more
predictable way.

Refs scylladb/scylladb#22501

Closes scylladb/scylladb#27498
2025-12-16 06:53:43 +02:00
Botond Dénes
74347625f9 Merge 'test/alternator: add reproducers for more issues' from Nadav Har'El
This series adds an xfailing reproducers for two issue: #8070 and #27037:

27037 is about where even with alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON representation - a Alternator Streams
event is unduly produced.

8070 is about the ability to write malformed values into the database and then fail during read - instead of failing, as expected, during the write. This issue was known for years, but we never really had a reproducer for it - it's not possible to reproduce it using clean boto3 code and we need to build a request manually.

The first two patches are two small cleanups (including fixes #27372)  that I did while preparing the real tests - which are in the final two patches.

Closes scylladb/scylladb#27376

* github.com:scylladb/scylladb:
  test/alternator: add reproducer for bug with storing invalid values
  test/alternator: reproducer for issue 27375
  utils/rjson: fix error messages from rjson::parse()
  test/alternator: extract get_signed_request() to util.py
2025-12-16 06:53:14 +02:00
Patryk Jędrzejczak
844545bb74 Merge 'treewide: fix cases of improper re-throwing of std::exception_ptr' from Emil Maskovsky
Fix multiple cases where the captured `std::exception_ptr` has been re-thrown via simple `throw eptr;`, which results in losing the original exception type and details.

Resolved at various places found by clang-tidy:

1. db::schema_applier

When applying schema changes, the previous implementation attempted to handle exceptions by catching and rethrowing them, but did so incorrectly: using `throw ex` with a `std::exception_ptr` loses the original exception type and details.

However, in this case, explicit exception handling is unnecessary. The only reason for catching was to ensure `ap.destroy()` is called before propagating the exception. This can be more cleanly and safely achieved using Seastar's `.finally()` continuation, which guarantees cleanup regardless of success or failure.

2. directories

The `std::exception_ptr()` has been captured for logging and then again re-thrown incorrectly via `throw ex;`. We could use `std::rethrow_exception()` here instead, but it seems to be simpler to just use regular `throw;` to rethrow the original exception, and only use the `std::current_exception()` for logging (which is a pattern used in other places as well).

3. storage_service

Here the exception has been re-thrown incorrectly in a coroutine. There it is best to use the `co_await coroutine::return_exception_ptr` to propagate exception more efficiently in a coroutine-friendly manner.

Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501

No backport: This fixes an error logging issue, that isn't a production problem by itself (only found in test), therefore not backporting to older branches.

Closes scylladb/scylladb#27613

* https://github.com/scylladb/scylladb:
  db: schema_applier: improve exception-safe cleanup
  directories: fix exception rethrowing
  storage_service: use coroutine-friendly exception propagation in join_node_response_handler
2025-12-15 13:56:45 +01:00
Nadav Har'El
ccacea621f test/cqlpy: fix flaky test test_view_in_system_tables
The cqlpy test test_materialized_view.py::test_view_in_system_tables
checks that the system table "system.built_views" can inform us that
a view has been built. This test was flaky, starting to fail quite
often recently, and this patch fixes the problem in the test.

For historic reasons  this test began by calling a utility function
wait_for_view_built() - which uses a different system table,
system_distributed.view_build_status, to wait until the view was built.
The test then immediately tries to verify that also system.built_views
lists this view.

But there is no real reason why we could assume - or want to assume -
that these two tables are updated in this order, or how much time
passed between the two tables being changed. The authors of this
test already acknowledged there is a problem - they included a hack
purporting to be a "read barrier" that claimed to solve this exact
problem - but it seems it doesn't, or at least no longer does after
recent changes to the view builder's implementation.

The solution is simple - just remove the call to wait_for_view_built()
and the "hack" after it. We should just wait in a loop (until a timeout)
for the system table that we really wanted to check - system.built_views.
It's as simple as that. No need for any other assumptions or hacks.

Fixes #27296

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27626
2025-12-15 15:29:08 +03:00
Benny Halevy
f4a4671ad6 table: seal_snapshot: avoid oversized allocation when dumping manifest.json
Currently, we first print the json contents into a stringstream buffer
and then we write it as a whole to the manifest.json file output stream.

This is not scalable and may cause large allocation for large enough number
of files.

Fixes #24216

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27542
2025-12-15 15:19:24 +03:00
Pavel Emelyanov
3f7ee3ce5d Merge 'batchlog: make replay (flush) faster' from Botond Dénes
The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied.
When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table.

Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed).  Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones.  Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range.

When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC.
The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions.
This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

The new schema organization has the following goals:
1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard column.
4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.

Fixes: https://github.com/scylladb/scylladb/issues/23358

This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`.

Closes scylladb/scylladb#26671

* github.com:scylladb/scylladb:
  db/config: change batchlog_replay_cleanup_after_replays default to 1
  test/boost/batchlog_manager_test: add test for batchlog cleanup
  replica/mutation_dump: always set position weight for clustering positions
  service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
  test/lib: introduce error_injection.hh
  utils/error_injection: add debug log to disable() and disable_all()
  test/lib/cql_test_env: forward config to batchlog
  test/lib/cql_test_env: add batch type to execute_batch()
  test/lib/cql_assertions: add with_size(predicate) overload
  test/lib/cql_assertions: add source location to fail messages
  test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
  test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check
  test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
  db/batchlog_manager: config: s/write_timeout/reply_timeot/
  db,service: switch to system.batchlog_v2
  db/system_keyspace: introduce system.batchlog_v2
  service,db: extract generation of batchlog delete mutation
  service,db: extract get_batchlog_mutation_for() from storage-proxy
  db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
  db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
  data_dictionary: table: add get_truncation_time()
  db/batchlog_manager: batch(): replace map_reduce() with simple loop
  db/batchlog_manager: finish coroutinizing replay_all_failed_batches
  db/batchlog_manager: improve replayAllFailedBatches logs
2025-12-15 15:05:19 +03:00
Dawid Mędrek
1e14c08eee locator/token_metadata: Remove get_host_id()
The function is declared, but it's not defined or used anywhere.

Closes scylladb/scylladb#27374
2025-12-15 10:36:52 +01:00
Michael Litvak
b9ec1180f5 alternator: require rf_rack_valid_keyspaces when creating index
When creating an alternator table with tablets, if it has an index, LSI
or GSI, require the config option rf_rack_valid_keyspaces to be enabled.

The option is required for materialized views in tablets keyspaces to
function properly and avoid consistency issues that could happen due to
cross-rack migrations and pairing switches when RF-rack validity is not
enforced.

Currently the option is validated when creating a materialized view via
the CQL interface, but it's missing from the alternator interface. Since
alternator indexes are based on materialized views, the same check
should be added there as well.

Fixes scylladb/scylladb#27612

Closes scylladb/scylladb#27622
2025-12-15 10:36:57 +02:00
Michał Hudobski
12483d8c3c vector_search: throw an error when we restrict primary in vector search
We currently allow restrictions on single column primary key,
but we ignore the restriction and return all results.
This can confuse the users. We change it so such a restriction
will throw an error and add a test to validate it.

Fixes: VECTOR-331

Closes scylladb/scylladb#27143
2025-12-15 09:45:56 +02:00
Jenkins Promoter
d5641398f5 Update pgo profiles - aarch64 2025-12-15 05:16:31 +02:00
Nadav Har'El
c06e63daed Merge 'auth: start using SHA 512 hashing originated from musl with added yielding' from Andrzej Jackowski
This patch series contains the following changes:
 - Incorporation of `crypt_sha512.c` from musl to out codebase
 - Conversion of `crypt_sha512.c` to C++ and coroutinization
 - Coroutinization of `auth::passwords::check`
 - Enabling use of `__crypt_sha512` orignated from `crypt_sha512.c` for
   computing SHA 512 passwords of length <=255
 - Addition of yielding in the aforementioned hashing implementation.

The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (https://github.com/scylladb/scylladb/issues/24524).
However, because there is only one alien thread, overall hashing throughput was reduced
(see, e.g., https://github.com/scylladb/scylla-enterprise/issues/5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding is introduced in this patch series.

Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.

The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
 - Alien thread: 41.5 new connections/s per shard
 - Reverted alien thread: 244.1 new connections/s per shard
 - This commit (yielding in hashing): 198.4 new connections/s per shard

The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.

On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
 - Alien thread: 109 stalls/shard total
 - Reverted alien thread: 13186 stalls/shard total
 - This commit (yielding in hashing): 149 stalls/shard total

Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
 - Alien thread: 1087 ms/shard total
 - Reverted alien thread: 72839 ms/shard total
 - This commit (yielding in hashing): 1623 ms/shard total

To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.

Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711

No automatic backport. After this PR is completed, the alien thread should be rather reverted from older branches (2025.2-2025.4 because on 2025.1 it's already removed). Backporting of the other commits needs further discussion.

Closes scylladb/scylladb#26860

* github.com:scylladb/scylladb:
  test/boost: add too_long_password to auth_passwords_test
  test/boost: add same_hashes_as_crypt_r to auth_passwords_test
  auth: utils: add yielding to crypt_sha512
  auth: change return type of passwords::check to future
  auth: remove code duplication in verify_scheme
  test/boost: coroutinize auth_passwords_test
  utils: coroutinize crypt_sha512
  utils: make crypt_sha512.cc to compile
  utils: license: import crypt_sha512.c from musl to the project
  Revert "auth: move passwords::check call to alien thread"
2025-12-14 14:01:01 +02:00
David Garcia
c1c3b2c5bb docs: fix local build
prevents early exits in metrics docs generation to break the local build.

Fixes #27497

Closes scylladb/scylladb#27615
2025-12-14 11:48:48 +02:00
Emil Maskovsky
5e7456936e db: schema_applier: improve exception-safe cleanup
When applying schema changes, the previous implementation attempted to
handle exceptions by catching and rethrowing them, but did so
incorrectly: using `throw ex` with a `std::exception_ptr` loses the
original exception type and details. The correct approach is to use
`std::rethrow_exception()`.

However, in this case, explicit exception handling is unnecessary. The
only reason for catching was to ensure `ap.destroy()` is called before
propagating the exception. This can be more cleanly and safely achieved
using Seastar's `.finally()` continuation, which guarantees cleanup
regardless of success or failure.

This change removes the manual try/catch/rethrow and uses `.finally()`
to ensure proper cleanup, letting exceptions propagate naturally and
preserving their type and information.

Fixes: SCYLLADB-94
Refs: scylladb/scylladb#27501
2025-12-12 18:18:31 +01:00
Emil Maskovsky
e6f5f2537e directories: fix exception rethrowing
Fix location identified by clang-tidy where `std::exception_ptr` was
incorrectly rethrown using `throw ep;`. The correct approach is to use
`std::rethrow_exception(ep)`, which preserves the original exception
type and stack trace.

But this can be even further simplified by logging the current exception
with `std::current_exception()` and rethrowing using `throw;` instead of
capturing and rethrowing a `std::exception_ptr`. This matches the
idiomatic pattern used elsewhere in the codebase and improves clarity.

This change ensures proper exception propagation and avoids type slicing
or loss of diagnostic information.
2025-12-12 18:10:20 +01:00
Emil Maskovsky
76aacc00f2 storage_service: use coroutine-friendly exception propagation in join_node_response_handler
Improve exception handling in join_node_response_handler by using
`co_await coroutine::return_exception_ptr` to propagate exceptions.
This replaces the incorrect direct throw of `std::exception_ptr` and
ensures proper coroutine-friendly exception propagation.
2025-12-12 17:59:54 +01:00
Botond Dénes
7e7e378a4b Merge 'Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"' from null
Reverts commit 8192f45e84.

The merge exposed a critical bug where truncate operations during table drop with auto-snapshot fail, causing Raft applier fiber to stop with unhandled exceptions. This leads to schema inconsistencies across nodes and test failures with "Keyspace does not exist" errors.

**Root Cause**

Commit 19b6207f modified `truncate_table_on_all_shards` to set `use_sstable_identifier = true`:

```cpp
// Before (working)
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name);

// After (broken)
auto opts = db::snapshot_options{.use_sstable_identifier = true};
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name, opts);
```

This triggers exceptions during snapshot that propagate through Raft state machine, causing:
- Raft applier stops: `raft::state_machine_error` at `raft/server.cc:1369`
- Schema changes fail to propagate
- Nodes report non-existent keyspaces for valid schemas

**Changes**

Reverts 15 files (200 deletions, 74 insertions):
- Removes `use_sstable_identifier` from truncate/snapshot code paths
- Reverts `snapshot_options` struct back to simple `skip_flush` boolean
- Removes REST API and nodetool `--use-sstable-identifier` parameter
- Removes feature tests from `test/boost/database_test.cc`

No backport required - the original feature was merged to master only and never released.

<!-- START COPILOT ORIGINAL PROMPT -->

<details>

<summary>Original prompt</summary>

----

*This section details on the original issue you should resolve*

<issue_title>test_table_drop_with_auto_snapshot failed with InvalidRequest</issue_title>
<issue_description>Seen in: https://jenkins.scylladb.com/job/scylla-master/job/next/9968//testReport
Logs: [download](https://downloads.scylladb.com/unstable/scylla/master/testLogs/2025-12-08T15:05:42Z/)

Error message:
```
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist"
```

Stacktrace:
```
manager = <test.pylib.manager_client.ManagerClient object at 0xffff734e6c10>

    @pytest.mark.asyncio
    async def test_table_drop_with_auto_snapshot(manager: ManagerClient):
        logger.info("Bootstrapping cluster")
        cfg = { 'auto_snapshot': True }
        servers = await manager.servers_add(3, config = cfg)

        cql = manager.get_cql()

        # Increases the chance of tablet migration concurrent with schema change
        await inject_error_on(manager, "tablet_allocator_shuffle", servers)

        for i in range(3):
            await cql.run_async("DROP KEYSPACE IF EXISTS test;")
            await cql.run_async("CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 8 };")
            await cql.run_async("CREATE TABLE IF NOT EXISTS test.tbl_sample_kv (id int, value text, PRIMARY KEY (id));")
>           await cql.run_async("INSERT INTO test.tbl_sample_kv (id, value) VALUES (1, 'ala');")
E           cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist"

test/cluster/test_tablets2.py:173: InvalidRequest
```
</issue_description>

<agent_instructions>this issue was exposed by commit 8192f45e84, please send a pull request reverting that merge commit and mark it as fixing this github issue.</agent_instructions>

<comments>
<comment_new><author>@yaronkaikov</author><body>
@denesb is this something in your team area? if not , please feel free to delegate it or un-assign yourself :-)</body></comment_new>
<comment_new><author>@nyh</author><body>
This is very strange. Clearly the keyspace `test` does exist at this point, because we created it two lines above and also we ran `CREATE TABLE .. test.tbl_sample_kv` which would have failed if the keyspace `test` didn't exist - so it must exit, no?

In the past, we had a bug where the running `CREATE KEYSPACE IF NOT EXISTS` forgot to set the "schema modified" event in the response so it failed to wait for schema agreement, but 1. we fixed this bug (https://github.com/scylladb/scylladb/pull/18819 by @nuivall ) and 2. this bug didn't happen in this case, where CREATE TABLE deed had work to do.

But I just realized something... Our fix in https://github.com/scylladb/scylladb/pull/18819 only applies to CREATE KEYSPACE / TABLE / VIEW / TYPE statements. It wasn't applied to `DROP KEYSPACE` - and it should have been....

But I don't have a good theory how a bug like https://github.com/scylladb/scylladb/pull/18819 can explain this specific test failure. Different schema operations are already linearized, so if a `CREATE TABLE test.tbl_sample_kv` succeeded, I don't see how there could possibly be any earlier `DROP KEYSPACE test` that suddenly springs to life. Unless we have a serious bug in our raft-based schema operations.</body></comment_new>
<comment_new><author>@nyh</author><body>
Another bug we could have in theory is that the Python driver's async `cql.run_async` might have a bug where it is not waiting for the schema agreement despite being told to wait. If it doesn't wait for schema agreement, this can easily explain this bug:
1. the CREATE KEYSPACE, CREATE TABLE both are sent to node A, but
2. the last INSERT INTO is sent to node B which is not yet aware of this new keyspace and table, and fails.

Copilot claims that **execute_async() does have this bug!**

> For schema-altering statements, schema agreement (meaning all nodes agree on the new schema) is important before running follow-up operations, but this is enforced only by synchronous helpers like Session.execute(), not the asynchronous version.
> If you use execute_async() for schema operations, you are responsible for checking schema agreement yourself, using [Session.check_schema_agreement()](https://docs.datastax.com/en/developer/python-driver/latest/api/cassandra/cluster/#cassandra.cluster.Session.check_schema_agreement) or (in newer code) ResponseFuture.check_schema_agreement.
> According to [a discussion on the DataStax support forum](https://support.datastax.com/s/article/Does-the-Python-Driver-for-Cassandra-Wait-for-Schema-Agreement-after-a-Schema-Change?language=en_US) and the [driver’s source code](7f12a5e1c6/cassandra/cluster.py (L487)), schema agreement is not ch...

</details>

<!-- START COPILOT CODING AGENT SUFFIX -->

- Fixes scylladb/scylladb#27501

<!-- START COPILOT CODING AGENT TIPS -->
---

Closes scylladb/scylladb#27604

* github.com:scylladb/scylladb:
  Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
  Initial plan
2025-12-12 13:20:49 +02:00
copilot-swe-agent[bot]
77ee7f3417 Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
This reverts commit 8192f45e84.

The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.

The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
2025-12-12 03:55:13 +00:00
copilot-swe-agent[bot]
0ff89a58be Initial plan 2025-12-12 03:48:12 +00:00
Yaron Kaikov
f7ffa395a8 workflows: trigger CI automatically when conflicts label is removed
Add pull_request_target event with unlabeled type to trigger-scylla-ci
workflow. This allows automatic CI triggering when the 'conflicts' label
is removed from a PR, in addition to the existing manual trigger via
comment.

The workflow now runs when:
- A user posts a comment with '@scylladbbot trigger-ci' (existing)
- The 'conflicts' label is removed from a PR (new)

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-84

Closes scylladb/scylladb#27521
2025-12-11 16:48:06 +02:00
Piotr Smaron
3fa3b920de Update CODEOWNERS to remove redundant entries
Removing myself as I have no maintainer's permissions to review the code

Closes scylladb/scylladb#27576
2025-12-11 16:47:08 +02:00
Botond Dénes
e7ca52ee79 Merge 'api: storage_service/tablets/repair: disable incremental repair by default' from Benny Halevy
Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414

** Backport to 2025.4 where 611918056a was introduced **

Closes scylladb/scylladb#27530

* github.com:scylladb/scylladb:
  api: storage_service/tablets/repair: disable incremental repair by default
  docs: nodetool-commands: cluster: repair: fix incremental-mode example
2025-12-11 15:23:09 +02:00
Botond Dénes
730eca5dac Merge 'Remove noexcept from storage_group and table functions to allow exception propagation' from null
Fixed a critical bug where `storage_group::for_each_compaction_group()` was incorrectly marked `noexcept`, causing `std::terminate` when actions threw exceptions (e.g., `utils::memory_limit_reached` during memory-constrained reader creation).

**Changes made:**
1. Removed `noexcept` from `storage_group::for_each_compaction_group()` declaration and implementation
2. Removed `noexcept` from `storage_group::compaction_groups()` overloads (they call for_each_compaction_group)
3. Removed `noexcept` from `storage_group::live_disk_space_used()` and `memtable_count()` (they call compaction_groups())
4. Kept `noexcept` on `storage_group::flush()` - it's a coroutine that automatically captures exceptions and returns them as exceptional futures
5. Removed `noexcept` from `table_load_stats()` functions in base class, table, and storage group managers

**Rationale:**

As noted by reviewers, there's no reason to kill the server if these functions throw. For coroutines returning futures, `noexcept` is appropriate because Seastar automatically captures exceptions and returns them as exceptional futures. For other functions, proper exception handling allows the system to recover gracefully instead of terminating.

Fixes #27475

Closes scylladb/scylladb#27476

* github.com:scylladb/scylladb:
  replica: Remove unnecessary noexcept
  replica: Remove noexcept from compaction_groups() functions
  replica: Remove noexcept from storage_group::for_each_compaction_group
2025-12-11 15:17:35 +02:00
Benny Halevy
c8cff94a5a api: storage_service/tablets/repair: disable incremental repair by default
Change the default incremental_mode to `disabled` due to
https://github.com/scylladb/scylladb/issues/26041 and
https://github.com/scylladb/scylladb/issues/27414

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-11 14:25:21 +02:00
Benny Halevy
5fae4cdf80 docs: nodetool-commands: cluster: repair: fix incremental-mode example
There is no 'regular' incremental mode anymore.
The example seems have meant 'disabled'.

Fixes #27587

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-11 14:25:11 +02:00
Marcin Maliszkiewicz
8bbcaacba1 auth: always catch by const reference
This is best practice.

Closes scylladb/scylladb#27525
2025-12-11 12:42:30 +01:00
Yaron Kaikov
3dfa5ebd7f Add JIRA issue validation to backport PR fixes check
Extend the Fixes validation pattern to also accept JIRA issue references
(format: [A-Z]+-\d+) in addition to GitHub issue references. This allows
backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'.

Fixes: https://github.com/scylladb/scylladb/issues/27571

Closes scylladb/scylladb#27572
2025-12-11 12:23:16 +02:00
Avi Kivity
24264e24bb Revert "repair: Add tablet repair progress report support"
This reverts commit faad0167d7. It causes
a regression in

test_two_tablets_concurrent_repair_and_migration_repair_writer_level

in debug mode (with ~5%-10% probability).

Fixes #27510.

Closes scylladb/scylladb#27560
2025-12-11 12:18:11 +02:00
Nadav Har'El
0c64e3be9a Merge 'Unify and fix rjson string and string_view conversions' from Marcin Maliszkiewicz
This patch-set consolidates and corrects rjson string conversion handling.
It removes unnecessary string copies, ensures proper length usage and
replaces ad-hoc conversions with consistent helper functions.

Overall, the changes make rjson string handling safer, faster, and more uniform across the codebase.

Backport: no, it's a refactor

Closes scylladb/scylladb#27394

* github.com:scylladb/scylladb:
  fix rjson::value to bytes conversion with missing GetStringLength call
  alternator: change type from string to string_view in should_add_capacity
  fix rjson::value to string_view conversion with missing GetStringLength call
  use rjson::to_string_view when rjson::value gets converted using GetStringLength
  use rjson::to_sstring and rjson::to_string for various string conversions
  utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds
  utils: move rjson::to_string_view func to string related place
  utils: add to_sstring and to_string rjson helper
2025-12-11 12:05:41 +02:00
Nadav Har'El
b3b0860e7c test/alternator: add reproducer for bug with storing invalid values
This patch adds a reproducer for a long-known bug, #8070, where
Alternator can store invalid values which are just blindly stored as
JSON, and we will only see the failure when reading the item back -
and either the client will fail to parse it, or sometimes even Alternator's
own code (e.g., FilterExpression) will fail to parse it. The right
behavior is to fail the write - not the read.

The included test checks writing different kinds of invalid values using
PutItem, UpdateItem, and BatchWriteItem. The new tests pass on DynamoDB,
but fail on Alternator so marked as "xfail".

Refs #8070.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:58:22 +02:00
Nadav Har'El
db15c212a6 test/alternator: reproducer for issue 27375
This patch adds a reproducer for issue #27375, where even with
alternator_streams_increased_compatibility set to true, if an attribute
is set to the same value it had but using a different JSON
representation - a Alternator Streams event is unduly produced.
For example, if a map {'dog': 1, 'cat': 2} is changed to
{'cat': 2, 'dog': 1}, this non-change should not be reported.

The new test added in this patch passes on DynamoDB (an event
is not generated) but fails on Alternator (an event is generated),
so the new test is marked with xfail.

Refs #27375.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:34:19 +02:00
Nadav Har'El
3595941020 utils/rjson: fix error messages from rjson::parse()
rjson::parse() when parsing JSON stored in a chunked_content (a vector
of temporary buffers) failed to initialize its byte counter to 0,
resulting in garbage positions in error messages like:

  Parsing JSON failed: Missing a name for object member. at 1452254

These error messages were most noticable in Alternator, which parses
JSON requests using a chunked_content, and reports these errors back
to the user.

The fix is trivial: add the missing initialization of the counter.

The patch also adds a regression test for this bug - it sends a JSON
corrupt at position 1, and expect to see "at 1" and not some large
random number.

Fixes #27372

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:17:01 +02:00
Nadav Har'El
102516a787 test/alternator: extract get_signed_request() to util.py
get_signed_request() started in test_manual_requests.py as a way to sign
a manually-created DynamoDB-API request - for sending requests that boto3
can't.

Over time, we started to use this function in additional test files, and
it's about time to move it to util.py - which is more natural to import
from multiple files.

This patch also adds a new function, manual_request(), which combines
get_signed_request() and actually sending the request via
requests.post(). New tests should prefer it, because it's easier to use.
We'll use the new function in tests that we add in the next patches.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-12-11 11:16:42 +02:00
Marcin Maliszkiewicz
d5b63df46e transport: remove redundant futurize_invoke from counted data sink and source
Closes scylladb/scylladb#27526
2025-12-11 10:32:16 +03:00
Dario Mirovic
f545ed37bc test: dtest: audit_test.py: fix audit error log detection
`test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py`
has an insert statement that is expected to fail. Dtest environment uses
`FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails
means we expect 6 error audit logs.

The test failed because `create keyspace ks` failed once, then succeeded on retry.
It allowed the test to proceed properly, but the last part of the test that expects
exactly 6 failed queries actually had 7.

The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed
queries, counting only the query expected to fail. If other queries fail with
successful retry, it's fine. If other queries fail without successful retry, the test
will fail, as it should in such situations. They are not related to this expected
failed insert statement.

Fixes #27322

Closes scylladb/scylladb#27378
2025-12-11 10:17:07 +03:00
Benny Halevy
5f13880a91 utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout
When waiting for the condition variable times out
we call on_internal_error, but unfortunately, the backtrace
it generates is obfuscated by
`coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`.

To make the log more useful, print the error injection name
and the caller's source_location in the timeout error message.

Fixes #27531

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27532
2025-12-10 23:25:54 +01:00
Calle Wilund
8c4ac457af test::cluster::dtest::tools::files: Remove file
This contained only one routine; `corrupt_file`, which is
highly problematic, and not used. If you want to "corrupt" a
file, it should be done controlled, not at random.
2025-12-10 15:37:04 +01:00
Calle Wilund
e48170ca8e commitlog_replay: Handle fully corrupt files same as partial corruption.
Fixes #26744

If a segment to replay is broken such that the main header is not zero,
but still broken, we throw header_checksum_error. This was not handled in
replayer, which grouped this into the "user error/fundamental problem"
category. However, assuming we allow for "real" disk corruption, this should
really be treated same as data corruption, i.e. reported data loss, not
failure to start up.

The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes
provoked this issue, by doing random file wrecking, which on rare occasions
provoked this, and thus failed test due to scylla not starting up, instead
of loosing data as expected.

Changed test to consistently cause this exact error instead.
2025-12-10 15:37:04 +01:00
Andrzej Jackowski
11ad32c85e test/boost: add too_long_password to auth_passwords_test
The test documents the current behavior of hashing algorithms that
fail if the passphrase has 512 bytes or more.

Moreover, it documents the behavior of the current bcrypt
implementation that compares only the first 72 bytes of the password.
Although we don't typically use bcrypt for password hashing, it is
possible to insert such a hash using
`CREATE ROLE ... WITH HASHED PASSWORD ...`.

Refs: scylladb/scylladb#26842
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
4c8c9cd548 test/boost: add same_hashes_as_crypt_r to auth_passwords_test
The test verifies that the old and new implementation of SHA-512
hashing returns exactly the same values.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
98f431dd81 auth: utils: add yielding to crypt_sha512
This change allows yielding during hashing computations to prevent
stalls.

The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
 - Alien thread: 41.5 new connections/s per shard
 - Reverted alien thread: 244.1 new connections/s per shard
 - This commit (yielding in hashing): 198.4 new connections/s per shard

The alien thread is limited by a single-core hashing throughput,
which is roughly 400-500 hashes/s in the test environment. Therefore,
with smp=10, the throughput is below 50 hashes/s, and the difference
between the alien thread and other solutions further increases with
higer smp.

The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.

On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
 - Alien thread: 109 stalls/shard total
 - Reverted alien thread: 13186 stalls/shard total
 - This commit (yielding in hashing): 149 stalls/shard total

Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
 - Alien thread: 1087 ms/shard total
 - Reverted alien thread: 72839 ms/shard total
 - This commit (yielding in hashing): 1623 ms/shard total

To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.

Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
4ffdb0721f auth: change return type of passwords::check to future
Introduce a new `passwords::hash_with_salt_async` and change the return
type of `passwords::check` to `future<bool>`. This enables yielding
during password computations later in this patch series.

The old method, `hash_with_salt`, is marked as deprecated because
new code should use the new `hash_with_salt_async` function.
We are not removing `hash_with_salt` now to reduce the regression risk
of changing the hashing implementation—at least the methods that change
persistent hashes (CREATE, ALTER) will continue to use the old hashing
method. However, in the future, `hash_with_salt` should be entirely
removed.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
775906d749 auth: remove code duplication in verify_scheme
Refactoring: create a new function `verify_hashing_output` to reuse
code in `hash_with_salt` and `verify_scheme`. The change is introduced
to facilitate verification of hashing output when the implementation
is extended later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
11eca621b0 test/boost: coroutinize auth_passwords_test
This commit prepares `auth_passwords_test` for using coroutines,
because later in this patch series `auth::passwords::check` and other
similar functions will return Seastar futures.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
d7818b56df utils: coroutinize crypt_sha512
Change `sha512crypt` and `__crypt_sha512` to coroutines to allow
yielding during hash computations later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
033fed5734 utils: make crypt_sha512.cc to compile
The purpose of this change is to allow the usage of Seastar futures in
crypt_sha512 later in this patch series.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
c6c30b7d0a utils: license: import crypt_sha512.c from musl to the project
This patch imports the `crypt_sha512.c` file from the musl library.
We need it to incorporate yielding in the `crypt_r` function to avoid
reactor stalls during long hashing computations.

Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.

Both `crypt_sha512.c` and musl license are obtained from
git.musl-libc.org:
 - https://git.musl-libc.org/cgit/musl/tree/src/crypt/crypt_sha512.c
 - https://git.musl-libc.org/cgit/musl/tree/COPYRIGHT

Import commit:
  commit 1b76ff0767d01df72f692806ee5adee13c67ef88
  Author: Alex Rønne Petersen <alex@alexrp.com>
  Date:   Sun Oct 12 05:35:19 2025 +0200

  s390x: shuffle register usage in __tls_get_offset to avoid r0 as address

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
5afcec4a3d Revert "auth: move passwords::check call to alien thread"
The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (scylladb/scylladb#24524). However, because
there is only one alien thread, overall hashing throughput was reduced
(see, e.g., scylladb/scylla-enterprise#5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding will be introduced later in this patch series.

This reverts commit 9574513ec1.
2025-12-10 15:36:09 +01:00
Calle Wilund
9b5f3d12a3 test::pylib::suite::base: Split options.name test specifier only once
For some arcane reason, we split optional the test pattern given to
test.py twice across '::' to get the file + case specifiers later given
to pytest etc. This means that for a test with a class group (such as some
migrated dtests), we cannot really specify the exact test to run
(pattern <file>::<class>::test).

Simply splitting only on first '::' fixes this. Should not affect any
other tests.
2025-12-10 15:35:12 +01:00
Tomasz Grabiec
0e51a1f812 replica: Remove unnecessary noexcept
Can potentially lead to unnecessary abort.

compaction_groups() and for_each_compaction_group() can throw.

Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>
2025-12-10 14:51:35 +01:00
Tomasz Grabiec
8b807b299e replica: Remove noexcept from compaction_groups() functions
They can throw during merge, when the number of compaction groups
is higher than 3.

Callers can deal with that, so we shouldn't abort.
2025-12-10 14:48:23 +01:00
Tomasz Grabiec
07ff659849 replica: Remove noexcept from storage_group::for_each_compaction_group
They don't really have to be noexcept.

And "action" may actually throw, leading to abort.

It was observed to throw when creating memtable readers:

terminate called after throwing an instance of 'utils::memory_limit_reached'
   what():  kill limit triggered on semaphore sl:users by permit xxx
Aborting on shard 4, in scheduling group sl:users.

std::terminate() at ??:0
__clang_call_terminate at main.cc:0
replica::storage_group::for_each_compaction_group(std::function<void (seastar::lw_shared_ptr<replica::compaction_group> const&)>) const at ./replica/table.cc:920
 (inlined by) replica::table::add_memtables_to_reader_list(std::vector<mutation_reader, std::allocator<mutation_reader>>&, seastar::lw_shared_ptr<schema const> const&, reader_permit const&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr const&, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>, std::function<void (unsigned long)>) const at ./replica/table.cc:196
 (inlined by) replica::table::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:243
 (inlined by) replica::table::as_mutation_source() const::$_0::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:3673
 (inlined by) mutation_reader std::__invoke_impl<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(std::__invoke_other, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>, mutation_reader>::type std::__invoke_r<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:114
 (inlined by) std::_Function_handler<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>), replica::table::as_mutation_source() const::$_0>::_M_invoke(std::_Any_data const&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290
 (inlined by) std::function<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>)>::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591
 (inlined by) mutation_source::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ././readers/mutation_source.hh:143
query::querier_base::querier_base(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, mutation_source const&, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:91
 (inlined by) query::querier::querier(mutation_source const&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:164
 (inlined by) replica::table::query(seastar::lw_shared_ptr<schema const>, reader_permit, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, query::result_memory_limiter&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::optional<query::querier>*) at ./replica/table.cc:3583
replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0::operator()(reader_permit) const at ./replica/database.cc:1533
 (inlined by) seastar::noncopyable_function<seastar::future<void> (reader_permit)>::indirect_vtable_for<replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0>::call(seastar::noncopyable_function<seastar::future<void> (reader_permit)> const*, reader_permit) (.llvm.13537529942037499926) at ././seastar/include/seastar/util/noncopyable_function.hh:158
seastar::noncopyable_function<seastar::future<void> (reader_permit)>::operator()(reader_permit) const at ././seastar/include/seastar/util/noncopyable_function.hh:215
 (inlined by) reader_concurrency_semaphore::execution_loop() (.resume) at ./reader_concurrency_semaphore.cc:980
std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242
 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ./build/release/seastar/./seastar/include/seastar/core/coroutine.hh:122
 (inlined by) seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2627
 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3099
seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3267
seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0::operator()() const at ./build/release/seastar/./seastar/src/core/reactor.cc:4591
 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61
 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>, void>::type std::__invoke_r<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111
 (inlined by) std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290
std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591

Fixes #27475

Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>
2025-12-10 14:48:11 +01:00
Marcin Maliszkiewicz
be9992cfb3 fix rjson::value to bytes conversion with missing GetStringLength call 2025-12-09 19:27:22 +01:00
Marcin Maliszkiewicz
daf00a7f24 alternator: change type from string to string_view in should_add_capacity
It avoids allocation.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
62962f33bb fix rjson::value to string_view conversion with missing GetStringLength call
In some cases we unnecessarily convert to string which
causes a copy. In other we convert without calling
GetStringLength which causes iteration to dermine length
which is already known. In some cases we do even both.
This commit fixes that.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
060c2f7c0d use rjson::to_string_view when rjson::value gets converted using GetStringLength
This commit is only cosmetics, changes calls to GetStringLength
into rjson::to_string_view with the same underlying implementation.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
64149b57c3 use rjson::to_sstring and rjson::to_string for various string conversions
In some cases we ommit size checking which is wrong
as according to rapid json documentation strings may
contain \0 byte in the middle.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
4b004fcdfc utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds
So that we can use our common utility functions.
2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
5e38b3071b utils: move rjson::to_string_view func to string related place 2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz
225b3351fc utils: add to_sstring and to_string rjson helper
So that conversion code is common and it's easier
to avoid accidental type conversions. Additionally
according to rapid json library size must be checked
explicitly, this also avoids extra iteration in char*
to (s)string conversion.
2025-12-09 19:27:21 +01:00
Botond Dénes
e762027943 db/config: change batchlog_replay_cleanup_after_replays default to 1
Now that batchlog cleanup is cheap, on account of memtable flush on the
system.batchlog table garbage-collecting tombstones (previous patch), we
can afford to do cleanup on each replay, keeping the memtable size small
and more importantly -- the amount of tombstones in the memtable small.
2025-12-02 14:21:26 +02:00
Botond Dénes
8edd5b80ab test/boost/batchlog_manager_test: add test for batchlog cleanup
Add more tests covering different aspects of batchlog replay, cleanup,
replay timeout and finally v1 -> v2 migration.
2025-12-02 14:21:26 +02:00
Botond Dénes
fb84b30f88 replica/mutation_dump: always set position weight for clustering positions
SELECT * FROM MUTATION_FRAGMENTS() queries have a transformed schema
(mutation-fragment schema), which is a superset of that of the queried
table's. The mutation fragment schema represents position_in_partition
of mutation fragments expressed as clustering columns. This presents
some challenges, as some position_in_partition fields are null for some
positions. This was solved by setting these clustering keys components
to bytes{}. In the process, a mistake was made: when the clustering key
is missing in the position_in_partition, the position_weight is also set
to bytes{}. This is not correct, it is possible for some positions to
have no key but to still have a position_weight. An example is
position_in_partition::before_all_clustered_rows().
Fix this by always filling in the position_weight for positions which
have region() == clustered, instead of the earlier condition on the key
presence.

This is a minor bug affecting range tombstone changes at the two
extremes: position_in_partition::{before,after}_all_clustered_rows(). In
both cases, the position_weight can be deduced by a human looking at the
results, based on the position of the range tombstone change, relative
to other fragments.
2025-12-02 14:21:26 +02:00
Botond Dénes
8545f7eedd service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
Rename to make it more explicit where the error injection happens.
Also change how the error is injected, use the lambda overload instead
of is_enabled(), the former leaves better trace in logs, which helps
when debugging tests.
2025-12-02 14:21:26 +02:00
Botond Dénes
e52e1f842e test/lib: introduce error_injection.hh
Test-specific helpers for working with error injection. Provides an
RAII object to enable/disable error injection points, on all shards.
2025-12-02 14:21:26 +02:00
Botond Dénes
0a7df4b8ac utils/error_injection: add debug log to disable() and disable_all()
enable() and friends already has debug logs.
2025-12-02 14:21:26 +02:00
Botond Dénes
9bb8156f02 test/lib/cql_test_env: forward config to batchlog
Currently all batchlog config items are hardcoded. Make the two
important ones configurable: replay_timeout and
replay_cleanup_after_replays.
2025-12-02 14:21:26 +02:00
Botond Dénes
d1b796bc43 test/lib/cql_test_env: add batch type to execute_batch()
Allow executing logged batches too. Also add a trace log to the method.
2025-12-02 14:21:26 +02:00
Botond Dénes
1ad64731bc test/lib/cql_assertions: add with_size(predicate) overload
Sometimes, expected size is not a single number. Add predicate overload
to allow expressing more complicated expectations.
2025-12-02 14:21:26 +02:00
Botond Dénes
abadb8ebfb test/lib/cql_assertions: add source location to fail messages
For tests that contain multiple assert_that() invokations, identifying
the one that failed is very challenging. Add source location to fail
messages to allow convenient identification of the call-site.
2025-12-02 14:21:26 +02:00
Botond Dénes
54f16f9019 test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
Invoke the passed in function with a columns_assertions instance for the
current row, allowing for sweeping checks across columns of all rows.
2025-12-02 14:21:26 +02:00
Botond Dénes
b584e1e18e test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check 2025-12-02 14:21:26 +02:00
Botond Dénes
aa1d3f1170 test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
To enable assertions on columns which are sometimes null.
One existing user of with_typed_column() needs adjustment, because the
previous version of with_typed_column() covered up silently for null
value, but after this patch this caused a failure.
2025-12-02 14:21:26 +02:00
Botond Dénes
e309b5dbe1 db/batchlog_manager: config: s/write_timeout/reply_timeot/
Although the value of this item is indeed derived from the write timeout
config, the name doesn't reflect what it is used for. Change it to
reflect it better.
2025-12-02 14:21:26 +02:00
Botond Dénes
846b656610 db,service: switch to system.batchlog_v2
New batchlogs are written to the batchlog_v2 table and replay also uses
the v2 table.
The content of system.batchlog is attempted to be migrated to
system.batchlog_v2 after each start of the batchlog_manager service.
The migration is retried on each replay if it fails. This is reduntant
but simple.

Batchlog cleanup now doesn't involve flushing memtables, the only
remaining user of replica/database.hh is gone, so the include is
dropped.
2025-12-02 14:21:26 +02:00
Botond Dénes
ee851266be db/system_keyspace: introduce system.batchlog_v2
Rearranges the system.batchlog schema as follows:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

With the following goals:
1) Make post-replay batchlog cleanup possible with a simple
   range-tombstone. This allows dropping the individual dead batchlog
   entries, as they are shadowed by a higher level tombstone. This
   enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component:
   batchlog entries that fail the first replay attempt, are moved to the
   failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard
   column.
4) Make batchlog entries ordered by the batchlog create time (id). This
   allows for selecting batchlogs to replay, without post-filtering of
   batchlogs that are too young to be replayed.
2025-12-02 14:21:25 +02:00
Botond Dénes
9434ec2fd1 service,db: extract generation of batchlog delete mutation
Don't build batchlog delete mutations in storage-proxy code. Move this
code into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
   db/batchlog_manager.cc
2) Reduce current and future code duplication.
3) Make future changes to this logic easier.
2025-12-02 14:21:25 +02:00
Botond Dénes
f54602daf0 service,db: extract get_batchlog_mutation_for() from storage-proxy
Don't build batchlog mutations in storage-proxy code. Move this code
into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
   db/batchlog_manager.cc
2) Reduce current and future code duplication.
2) Make future changes to this logic easier.
2025-12-02 14:21:25 +02:00
Botond Dénes
097c2cd676 db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
The propagation delay has no effect for other tombstone gc strategies,
so ignore it when tombstone-gc != repair.
2025-12-02 14:21:25 +02:00
Botond Dénes
4f30807f01 db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
Just skip the mutation(s) whose tables were dropped instead.
Use the newly introduced data_dictionary::table::get_truncation_time()
to avoid looking up real table object.
2025-12-02 14:21:25 +02:00
Botond Dénes
55704908a0 data_dictionary: table: add get_truncation_time()
So the batchlog manager can avoid looking up the real table and instead
just work with data dictionary.
2025-12-02 14:21:25 +02:00
Botond Dénes
337f417b13 db/batchlog_manager: batch(): replace map_reduce() with simple loop
The map_reduce achieves no concurrency, both map and reduce are
synchronous. It only achieves two redundant lookups for the table and
hard-to-read code. Convert it into a simple loop. Preserve the
stall-protection by adding a maybe_yield() to the loop.
2025-12-02 12:05:10 +02:00
Botond Dénes
705af2bc16 db/batchlog_manager: finish coroutinizing replay_all_failed_batches
It was coroutinized already but strangely, some continuations also
remained. The `batch` lambda is still left in continuation style.
2025-12-02 10:42:28 +02:00
Botond Dénes
5b5f9120d0 db/batchlog_manager: improve replayAllFailedBatches logs
Add cleanup flag value to start message and drop cpu, it is redundant as
Scylla already adds the shard number to the logs.
Add all_replayed to finish message.
2025-12-02 10:42:28 +02:00
127 changed files with 2660 additions and 1136 deletions

8
.github/CODEOWNERS vendored
View File

@@ -1,5 +1,5 @@
# AUTH
auth/* @nuivall @ptrsmrn
auth/* @nuivall
# CACHE
row_cache* @tgrabiec
@@ -25,11 +25,11 @@ compaction/* @raphaelsc
transport/*
# CQL QUERY LANGUAGE
cql3/* @tgrabiec @nuivall @ptrsmrn
cql3/* @tgrabiec @nuivall
# COUNTERS
counters* @nuivall @ptrsmrn
tests/counter_test* @nuivall @ptrsmrn
counters* @nuivall
tests/counter_test* @nuivall
# DOCS
docs/* @annastuchlik @tzach

View File

@@ -18,7 +18,7 @@ jobs:
// Regular expression pattern to check for "Fixes" prefix
// Adjusted to dynamically insert the repository full name
const pattern = `Fixes:? (?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)`;
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|([A-Z]+-\\d+))`;
const regex = new RegExp(pattern);
if (!regex.test(body)) {

View File

@@ -7,7 +7,7 @@ on:
- enterprise
paths:
- '**/*.cc'
- 'scripts/metrics-config.yml'
- 'scripts/metrics-config.yml'
- 'scripts/get_description.py'
- 'docs/_ext/scylladb_metrics.py'
@@ -15,20 +15,20 @@ jobs:
validate-metrics:
runs-on: ubuntu-latest
name: Check metrics documentation coverage
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
submodules: true
- name: Set up Python
uses: actions/setup-python@v6
with:
python-version: '3.10'
- name: Install dependencies
run: pip install PyYAML
- name: Validate metrics
run: python3 scripts/get_description.py --validate -c scripts/metrics-config.yml

View File

@@ -3,10 +3,13 @@ name: Trigger Scylla CI Route
on:
issue_comment:
types: [created]
pull_request_target:
types:
- unlabeled
jobs:
trigger-jenkins:
if: github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')
if: (github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')) || github.event.label.name == 'conflicts'
runs-on: ubuntu-latest
steps:
- name: Trigger Scylla-CI-Route Jenkins Job

View File

@@ -42,7 +42,7 @@ comparison_operator_type get_comparison_operator(const rjson::value& comparison_
if (!comparison_operator.IsString()) {
throw api_error::validation(fmt::format("Invalid comparison operator definition {}", rjson::print(comparison_operator)));
}
std::string op = comparison_operator.GetString();
std::string op = rjson::to_string(comparison_operator);
auto it = ops.find(op);
if (it == ops.end()) {
throw api_error::validation(fmt::format("Unsupported comparison operator {}", op));
@@ -377,8 +377,8 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara
return cmp(unwrap_number(*v1, cmp.diagnostic), unwrap_number(v2, cmp.diagnostic));
}
if (kv1.name == "S") {
return cmp(std::string_view(kv1.value.GetString(), kv1.value.GetStringLength()),
std::string_view(kv2.value.GetString(), kv2.value.GetStringLength()));
return cmp(rjson::to_string_view(kv1.value),
rjson::to_string_view(kv2.value));
}
if (kv1.name == "B") {
auto d_kv1 = unwrap_bytes(kv1.value, v1_from_query);
@@ -470,9 +470,9 @@ static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const r
return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);
}
if (kv_v.name == "S") {
return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),
std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),
std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),
return check_BETWEEN(rjson::to_string_view(kv_v.value),
rjson::to_string_view(kv_lb.value),
rjson::to_string_view(kv_ub.value),
bounds_from_query);
}
if (kv_v.name == "B") {

View File

@@ -8,6 +8,8 @@
#include "consumed_capacity.hh"
#include "error.hh"
#include "utils/rjson.hh"
#include <fmt/format.h>
namespace alternator {
@@ -32,12 +34,12 @@ bool consumed_capacity_counter::should_add_capacity(const rjson::value& request)
if (!return_consumed->IsString()) {
throw api_error::validation("Non-string ReturnConsumedCapacity field in request");
}
std::string consumed = return_consumed->GetString();
std::string_view consumed = rjson::to_string_view(*return_consumed);
if (consumed == "INDEXES") {
throw api_error::validation("INDEXES consumed capacity is not supported");
}
if (consumed != "TOTAL") {
throw api_error::validation("Unknown consumed capacity "+ consumed);
throw api_error::validation(fmt::format("Unknown consumed capacity {}", consumed));
}
return true;
}

View File

@@ -419,7 +419,7 @@ static std::optional<std::string> find_table_name(const rjson::value& request) {
if (!table_name_value->IsString()) {
throw api_error::validation("Non-string TableName field in request");
}
std::string table_name = table_name_value->GetString();
std::string table_name = rjson::to_string(*table_name_value);
return table_name;
}
@@ -546,7 +546,7 @@ get_table_or_view(service::storage_proxy& proxy, const rjson::value& request) {
// does exist but the index does not (ValidationException).
if (proxy.data_dictionary().has_schema(keyspace_name, orig_table_name)) {
throw api_error::validation(
fmt::format("Requested resource not found: Index '{}' for table '{}'", index_name->GetString(), orig_table_name));
fmt::format("Requested resource not found: Index '{}' for table '{}'", rjson::to_string_view(*index_name), orig_table_name));
} else {
throw api_error::resource_not_found(
fmt::format("Requested resource not found: Table: {} not found", orig_table_name));
@@ -587,7 +587,7 @@ static std::string get_string_attribute(const rjson::value& value, std::string_v
throw api_error::validation(fmt::format("Expected string value for attribute {}, got: {}",
attribute_name, value));
}
return std::string(attribute_value->GetString(), attribute_value->GetStringLength());
return rjson::to_string(*attribute_value);
}
// Convenience function for getting the value of a boolean attribute, or a
@@ -1080,8 +1080,8 @@ static void add_column(schema_builder& builder, const std::string& name, const r
}
for (auto it = attribute_definitions.Begin(); it != attribute_definitions.End(); ++it) {
const rjson::value& attribute_info = *it;
if (attribute_info["AttributeName"].GetString() == name) {
auto type = attribute_info["AttributeType"].GetString();
if (rjson::to_string_view(attribute_info["AttributeName"]) == name) {
std::string_view type = rjson::to_string_view(attribute_info["AttributeType"]);
data_type dt = parse_key_type(type);
if (computed_column) {
// Computed column for GSI (doesn't choose a real column as-is
@@ -1116,7 +1116,7 @@ static std::pair<std::string, std::string> parse_key_schema(const rjson::value&
throw api_error::validation("First element of KeySchema must be an object");
}
const rjson::value *v = rjson::find((*key_schema)[0], "KeyType");
if (!v || !v->IsString() || v->GetString() != std::string("HASH")) {
if (!v || !v->IsString() || rjson::to_string_view(*v) != "HASH") {
throw api_error::validation("First key in KeySchema must be a HASH key");
}
v = rjson::find((*key_schema)[0], "AttributeName");
@@ -1124,14 +1124,14 @@ static std::pair<std::string, std::string> parse_key_schema(const rjson::value&
throw api_error::validation("First key in KeySchema must have string AttributeName");
}
validate_attr_name_length(supplementary_context, v->GetStringLength(), true, "HASH key in KeySchema - ");
std::string hash_key = v->GetString();
std::string hash_key = rjson::to_string(*v);
std::string range_key;
if (key_schema->Size() == 2) {
if (!(*key_schema)[1].IsObject()) {
throw api_error::validation("Second element of KeySchema must be an object");
}
v = rjson::find((*key_schema)[1], "KeyType");
if (!v || !v->IsString() || v->GetString() != std::string("RANGE")) {
if (!v || !v->IsString() || rjson::to_string_view(*v) != "RANGE") {
throw api_error::validation("Second key in KeySchema must be a RANGE key");
}
v = rjson::find((*key_schema)[1], "AttributeName");
@@ -1799,6 +1799,11 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
}
}
}
// Creating an index in tablets mode requires the rf_rack_valid_keyspaces option to be enabled.
// GSI and LSI indexes are based on materialized views which require this option to avoid consistency issues.
if (!view_builders.empty() && ksm->uses_tablets() && !sp.data_dictionary().get_config().rf_rack_valid_keyspaces()) {
co_return api_error::validation("GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
}
try {
schema_mutations = service::prepare_new_keyspace_announcement(sp.local_db(), ksm, ts);
} catch (exceptions::already_exists_exception&) {
@@ -1887,8 +1892,8 @@ future<executor::request_return_type> executor::create_table(client_state& clien
std::string def_type = type_to_string(def.type);
for (auto it = attribute_definitions.Begin(); it != attribute_definitions.End(); ++it) {
const rjson::value& attribute_info = *it;
if (attribute_info["AttributeName"].GetString() == def.name_as_text()) {
auto type = attribute_info["AttributeType"].GetString();
if (rjson::to_string_view(attribute_info["AttributeName"]) == def.name_as_text()) {
std::string_view type = rjson::to_string_view(attribute_info["AttributeType"]);
if (type != def_type) {
throw api_error::validation(fmt::format("AttributeDefinitions redefined {} to {} already a key attribute of type {} in this table", def.name_as_text(), type, def_type));
}
@@ -2019,6 +2024,10 @@ future<executor::request_return_type> executor::update_table(client_state& clien
co_return api_error::validation(fmt::format(
"LSI {} already exists in table {}, can't use same name for GSI", index_name, table_name));
}
if (p.local().local_db().find_keyspace(keyspace_name).get_replication_strategy().uses_tablets() &&
!p.local().data_dictionary().get_config().rf_rack_valid_keyspaces()) {
co_return api_error::validation("GlobalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
}
elogger.trace("Adding GSI {}", index_name);
// FIXME: read and handle "Projection" parameter. This will
@@ -2362,7 +2371,7 @@ put_or_delete_item::put_or_delete_item(const rjson::value& item, schema_ptr sche
_cells = std::vector<cell>();
_cells->reserve(item.MemberCount());
for (auto it = item.MemberBegin(); it != item.MemberEnd(); ++it) {
bytes column_name = to_bytes(it->name.GetString());
bytes column_name = to_bytes(rjson::to_string_view(it->name));
validate_value(it->value, "PutItem");
const column_definition* cdef = find_attribute(*schema, column_name);
validate_attr_name_length("", column_name.size(), cdef && cdef->is_primary_key());
@@ -2783,10 +2792,10 @@ static void verify_all_are_used(const rjson::value* field,
return;
}
for (auto it = field->MemberBegin(); it != field->MemberEnd(); ++it) {
if (!used.contains(it->name.GetString())) {
if (!used.contains(rjson::to_string(it->name))) {
throw api_error::validation(
format("{} has spurious '{}', not used in {}",
field_name, it->name.GetString(), operation));
field_name, rjson::to_string_view(it->name), operation));
}
}
}
@@ -3000,7 +3009,7 @@ future<executor::request_return_type> executor::delete_item(client_state& client
}
static schema_ptr get_table_from_batch_request(const service::storage_proxy& proxy, const rjson::value::ConstMemberIterator& batch_request) {
sstring table_name = batch_request->name.GetString(); // JSON keys are always strings
sstring table_name = rjson::to_sstring(batch_request->name); // JSON keys are always strings
try {
return proxy.data_dictionary().find_schema(sstring(executor::KEYSPACE_NAME_PREFIX) + table_name, table_name);
} catch(data_dictionary::no_such_column_family&) {
@@ -3386,7 +3395,7 @@ static bool hierarchy_filter(rjson::value& val, const attribute_path_map_node<T>
}
rjson::value newv = rjson::empty_object();
for (auto it = v.MemberBegin(); it != v.MemberEnd(); ++it) {
std::string attr = it->name.GetString();
std::string attr = rjson::to_string(it->name);
auto x = members.find(attr);
if (x != members.end()) {
if (x->second) {
@@ -3606,7 +3615,7 @@ static std::optional<attrs_to_get> calculate_attrs_to_get(const rjson::value& re
const rjson::value& attributes_to_get = req["AttributesToGet"];
attrs_to_get ret;
for (auto it = attributes_to_get.Begin(); it != attributes_to_get.End(); ++it) {
attribute_path_map_add("AttributesToGet", ret, it->GetString());
attribute_path_map_add("AttributesToGet", ret, rjson::to_string(*it));
validate_attr_name_length("AttributesToGet", it->GetStringLength(), false);
}
if (ret.empty()) {
@@ -4272,12 +4281,12 @@ inline void update_item_operation::apply_attribute_updates(const std::unique_ptr
attribute_collector& modified_attrs, bool& any_updates, bool& any_deletes) const {
for (auto it = _attribute_updates->MemberBegin(); it != _attribute_updates->MemberEnd(); ++it) {
// Note that it.key() is the name of the column, *it is the operation
bytes column_name = to_bytes(it->name.GetString());
bytes column_name = to_bytes(rjson::to_string_view(it->name));
const column_definition* cdef = _schema->get_column_definition(column_name);
if (cdef && cdef->is_primary_key()) {
throw api_error::validation(format("UpdateItem cannot update key column {}", it->name.GetString()));
throw api_error::validation(format("UpdateItem cannot update key column {}", rjson::to_string_view(it->name)));
}
std::string action = (it->value)["Action"].GetString();
std::string action = rjson::to_string((it->value)["Action"]);
if (action == "DELETE") {
// The DELETE operation can do two unrelated tasks. Without a
// "Value" option, it is used to delete an attribute. With a
@@ -5474,7 +5483,7 @@ calculate_bounds_conditions(schema_ptr schema, const rjson::value& conditions) {
std::vector<query::clustering_range> ck_bounds;
for (auto it = conditions.MemberBegin(); it != conditions.MemberEnd(); ++it) {
std::string key = it->name.GetString();
sstring key = rjson::to_sstring(it->name);
const rjson::value& condition = it->value;
const rjson::value& comp_definition = rjson::get(condition, "ComparisonOperator");
@@ -5482,13 +5491,13 @@ calculate_bounds_conditions(schema_ptr schema, const rjson::value& conditions) {
const column_definition& pk_cdef = schema->partition_key_columns().front();
const column_definition* ck_cdef = schema->clustering_key_size() > 0 ? &schema->clustering_key_columns().front() : nullptr;
if (sstring(key) == pk_cdef.name_as_text()) {
if (key == pk_cdef.name_as_text()) {
if (!partition_ranges.empty()) {
throw api_error::validation("Currently only a single restriction per key is allowed");
}
partition_ranges.push_back(calculate_pk_bound(schema, pk_cdef, comp_definition, attr_list));
}
if (ck_cdef && sstring(key) == ck_cdef->name_as_text()) {
if (ck_cdef && key == ck_cdef->name_as_text()) {
if (!ck_bounds.empty()) {
throw api_error::validation("Currently only a single restriction per key is allowed");
}
@@ -5889,7 +5898,7 @@ future<executor::request_return_type> executor::list_tables(client_state& client
rjson::value* exclusive_start_json = rjson::find(request, "ExclusiveStartTableName");
rjson::value* limit_json = rjson::find(request, "Limit");
std::string exclusive_start = exclusive_start_json ? exclusive_start_json->GetString() : "";
std::string exclusive_start = exclusive_start_json ? rjson::to_string(*exclusive_start_json) : "";
int limit = limit_json ? limit_json->GetInt() : 100;
if (limit < 1 || limit > 100) {
co_return api_error::validation("Limit must be greater than 0 and no greater than 100");

View File

@@ -496,7 +496,7 @@ const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value&
return {"", nullptr};
}
auto it = v.MemberBegin();
const std::string it_key = it->name.GetString();
const std::string it_key = rjson::to_string(it->name);
if (it_key != "SS" && it_key != "BS" && it_key != "NS") {
return {std::move(it_key), nullptr};
}

View File

@@ -979,8 +979,9 @@ client_data server::ongoing_request::make_client_data() const {
// and keep "driver_version" unset.
cd.driver_name = _user_agent;
// Leave "protocol_version" unset, it has no meaning in Alternator.
// Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset for Alternator.
// Note: CQL sets ssl_protocol and ssl_cipher_suite via generic_server::connection base class.
// Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset.
// As reported in issue #9216, we never set these fields in CQL
// either (see cql_server::connection::make_client_data()).
return cd;
}

View File

@@ -93,7 +93,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
if (v->GetStringLength() < 1 || v->GetStringLength() > 255) {
co_return api_error::validation("The length of AttributeName must be between 1 and 255");
}
sstring attribute_name(v->GetString(), v->GetStringLength());
sstring attribute_name = rjson::to_sstring(*v);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::ALTER, _stats);
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {

View File

@@ -729,14 +729,6 @@
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"use_sstable_identifier",
"description":"Use the sstable identifier UUID, if available, rather than the sstable generation.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
},
@@ -3059,7 +3051,7 @@
},
{
"name":"incremental_mode",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
"required":false,
"allowMultiple":false,
"type":"string",

View File

@@ -2020,16 +2020,12 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
auto tag = req->get_query_param("tag");
auto column_families = split(req->get_query_param("cf"), ",");
auto sfopt = req->get_query_param("sf");
auto usiopt = req->get_query_param("use_sstable_identifier");
db::snapshot_options opts = {
.skip_flush = strcasecmp(sfopt.c_str(), "true") == 0,
.use_sstable_identifier = strcasecmp(usiopt.c_str(), "true") == 0
};
auto sf = db::snapshot_ctl::skip_flush(strcasecmp(sfopt.c_str(), "true") == 0);
std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");
try {
if (column_families.empty()) {
co_await snap_ctl.local().take_snapshot(tag, keynames, opts);
co_await snap_ctl.local().take_snapshot(tag, keynames, sf);
} else {
if (keynames.empty()) {
throw httpd::bad_param_exception("The keyspace of column families must be specified");
@@ -2037,7 +2033,7 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
if (keynames.size() > 1) {
throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");
}
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, opts);
co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, sf);
}
co_return json_void();
} catch (...) {
@@ -2072,8 +2068,7 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_
auto info = parse_scrub_options(ctx, std::move(req));
if (!info.snapshot_tag.empty()) {
db::snapshot_options opts = {.skip_flush = false, .use_sstable_identifier = false};
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, opts);
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, db::snapshot_ctl::skip_flush::no);
}
compaction::compaction_stats stats;

View File

@@ -146,8 +146,7 @@ void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::
auto info = parse_scrub_options(ctx, std::move(req));
if (!info.snapshot_tag.empty()) {
db::snapshot_options opts = {.skip_flush = false, .use_sstable_identifier = false};
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, opts);
co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, db::snapshot_ctl::skip_flush::no);
}
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

View File

@@ -9,7 +9,6 @@
#include "auth/allow_all_authenticator.hh"
#include "service/migration_manager.hh"
#include "utils/alien_worker.hh"
#include "utils/class_registrator.hh"
namespace auth {
@@ -23,7 +22,6 @@ static const class_registrator<
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&,
utils::alien_worker&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
cache&> registration("org.apache.cassandra.auth.AllowAllAuthenticator");
}

View File

@@ -14,7 +14,6 @@
#include "auth/authenticator.hh"
#include "auth/cache.hh"
#include "auth/common.hh"
#include "utils/alien_worker.hh"
namespace cql3 {
class query_processor;
@@ -30,7 +29,7 @@ extern const std::string_view allow_all_authenticator_name;
class allow_all_authenticator final : public authenticator {
public:
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&, utils::alien_worker&) {
allow_all_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&) {
}
virtual future<> start() override {

View File

@@ -35,14 +35,13 @@ static const class_registrator<auth::authenticator
, cql3::query_processor&
, ::service::raft_group0_client&
, ::service::migration_manager&
, auth::cache&
, utils::alien_worker&> cert_auth_reg(CERT_AUTH_NAME);
, auth::cache&> cert_auth_reg(CERT_AUTH_NAME);
enum class auth::certificate_authenticator::query_source {
subject, altname
};
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, auth::cache&, utils::alien_worker&)
auth::certificate_authenticator::certificate_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, auth::cache&)
: _queries([&] {
auto& conf = qp.db().get_config();
auto queries = conf.auth_certificate_role_queries();
@@ -77,9 +76,9 @@ auth::certificate_authenticator::certificate_authenticator(cql3::query_processor
throw std::invalid_argument(fmt::format("Invalid source: {}", map.at(cfg_source_attr)));
}
continue;
} catch (std::out_of_range&) {
} catch (const std::out_of_range&) {
// just fallthrough
} catch (boost::regex_error&) {
} catch (const boost::regex_error&) {
std::throw_with_nested(std::invalid_argument(fmt::format("Invalid query expression: {}", map.at(cfg_query_attr))));
}
}

View File

@@ -10,7 +10,6 @@
#pragma once
#include "auth/authenticator.hh"
#include "utils/alien_worker.hh"
#include <boost/regex_fwd.hpp> // IWYU pragma: keep
namespace cql3 {
@@ -34,7 +33,7 @@ class certificate_authenticator : public authenticator {
enum class query_source;
std::vector<std::pair<query_source, boost::regex>> _queries;
public:
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&, utils::alien_worker&);
certificate_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
~certificate_authenticator();
future<> start() override;

View File

@@ -94,7 +94,7 @@ static future<> create_legacy_metadata_table_if_missing_impl(
try {
co_return co_await mm.announce(co_await ::service::prepare_new_column_family_announcement(qp.proxy(), table, ts),
std::move(group0_guard), format("auth: create {} metadata table", table->cf_name()));
} catch (exceptions::already_exists_exception&) {}
} catch (const exceptions::already_exists_exception&) {}
}
}

View File

@@ -256,7 +256,7 @@ future<> default_authorizer::revoke_all(std::string_view role_name, ::service::g
} else {
co_await collect_mutations(_qp, mc, query, {sstring(role_name)});
}
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions of {}: {}", role_name, e);
}
}
@@ -293,13 +293,13 @@ future<> default_authorizer::revoke_all_legacy(const resource& resource) {
[resource](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
}
});
});
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
alogger.warn("CassandraAuthorizer failed to revoke all permissions on {}: {}", resource, e);
return make_ready_future();
}

View File

@@ -49,8 +49,7 @@ static const class_registrator<
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&,
utils::alien_worker&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
cache&> password_auth_reg("org.apache.cassandra.auth.PasswordAuthenticator");
static thread_local auto rng_for_salt = std::default_random_engine(std::random_device{}());
@@ -64,14 +63,13 @@ std::string password_authenticator::default_superuser(const db::config& cfg) {
password_authenticator::~password_authenticator() {
}
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache, utils::alien_worker& hashing_worker)
password_authenticator::password_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache)
: _qp(qp)
, _group0_client(g0)
, _migration_manager(mm)
, _cache(cache)
, _stopped(make_ready_future<>())
, _superuser(default_superuser(qp.db().get_config()))
, _hashing_worker(hashing_worker)
{}
static bool has_salted_hash(const cql3::untyped_result_set_row& row) {
@@ -330,20 +328,18 @@ future<authenticated_user> password_authenticator::authenticate(
}
salted_hash = role->salted_hash;
}
const bool password_match = co_await _hashing_worker.submit<bool>([password = std::move(password), salted_hash] {
return passwords::check(password, *salted_hash);
});
const bool password_match = co_await passwords::check(password, *salted_hash);
if (!password_match) {
throw exceptions::authentication_exception("Username and/or password are incorrect");
}
co_return username;
} catch (std::system_error &) {
} catch (const std::system_error &) {
std::throw_with_nested(exceptions::authentication_exception("Could not verify password"));
} catch (exceptions::request_execution_exception& e) {
} catch (const exceptions::request_execution_exception& e) {
std::throw_with_nested(exceptions::authentication_exception(e.what()));
} catch (exceptions::authentication_exception& e) {
} catch (const exceptions::authentication_exception& e) {
std::throw_with_nested(e);
} catch (exceptions::unavailable_exception& e) {
} catch (const exceptions::unavailable_exception& e) {
std::throw_with_nested(exceptions::authentication_exception(e.get_message()));
} catch (...) {
std::throw_with_nested(exceptions::authentication_exception("authentication failed"));

View File

@@ -18,7 +18,6 @@
#include "auth/passwords.hh"
#include "auth/cache.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/alien_worker.hh"
namespace db {
class config;
@@ -49,13 +48,12 @@ class password_authenticator : public authenticator {
shared_promise<> _superuser_created_promise;
// We used to also support bcrypt, SHA-256, and MD5 (ref. scylladb#24524).
constexpr static auth::passwords::scheme _scheme = passwords::scheme::sha_512;
utils::alien_worker& _hashing_worker;
public:
static db::consistency_level consistency_for_user(std::string_view role_name);
static std::string default_superuser(const db::config&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&, utils::alien_worker&);
password_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
~password_authenticator();

View File

@@ -7,6 +7,8 @@
*/
#include "auth/passwords.hh"
#include "utils/crypt_sha512.hh"
#include <seastar/core/coroutine.hh>
#include <cerrno>
@@ -21,27 +23,48 @@ static thread_local crypt_data tlcrypt = {};
namespace detail {
void verify_hashing_output(const char * res) {
if (!res || (res[0] == '*')) {
throw std::system_error(errno, std::system_category());
}
}
void verify_scheme(scheme scheme) {
const sstring random_part_of_salt = "aaaabbbbccccdddd";
const sstring salt = sstring(prefix_for_scheme(scheme)) + random_part_of_salt;
const char* e = crypt_r("fisk", salt.c_str(), &tlcrypt);
if (e && (e[0] != '*')) {
return;
try {
verify_hashing_output(e);
} catch (const std::system_error& ex) {
throw no_supported_schemes();
}
throw no_supported_schemes();
}
sstring hash_with_salt(const sstring& pass, const sstring& salt) {
auto res = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
if (!res || (res[0] == '*')) {
throw std::system_error(errno, std::system_category());
}
verify_hashing_output(res);
return res;
}
seastar::future<sstring> hash_with_salt_async(const sstring& pass, const sstring& salt) {
sstring res;
// Only SHA-512 hashes for passphrases shorter than 256 bytes can be computed using
// the __crypt_sha512 method. For other computations, we fall back to the
// crypt_r implementation from `<crypt.h>`, which can stall.
if (salt.starts_with(prefix_for_scheme(scheme::sha_512)) && pass.size() <= 255) {
char buf[128];
const char * output_ptr = co_await __crypt_sha512(pass.c_str(), salt.c_str(), buf);
verify_hashing_output(output_ptr);
res = output_ptr;
} else {
const char * output_ptr = crypt_r(pass.c_str(), salt.c_str(), &tlcrypt);
verify_hashing_output(output_ptr);
res = output_ptr;
}
co_return res;
}
std::string_view prefix_for_scheme(scheme c) noexcept {
switch (c) {
case scheme::bcrypt_y: return "$2y$";
@@ -58,8 +81,9 @@ no_supported_schemes::no_supported_schemes()
: std::runtime_error("No allowed hashing schemes are supported on this system") {
}
bool check(const sstring& pass, const sstring& salted_hash) {
return detail::hash_with_salt(pass, salted_hash) == salted_hash;
seastar::future<bool> check(const sstring& pass, const sstring& salted_hash) {
const auto pwd_hash = co_await detail::hash_with_salt_async(pass, salted_hash);
co_return pwd_hash == salted_hash;
}
} // namespace auth::passwords

View File

@@ -11,6 +11,7 @@
#include <random>
#include <stdexcept>
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include "seastarx.hh"
@@ -75,10 +76,19 @@ sstring generate_salt(RandomNumberEngine& g, scheme scheme) {
///
/// Hash a password combined with an implementation-specific salt string.
/// Deprecated in favor of `hash_with_salt_async`.
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
sstring hash_with_salt(const sstring& pass, const sstring& salt);
[[deprecated("Use hash_with_salt_async instead")]] sstring hash_with_salt(const sstring& pass, const sstring& salt);
///
/// Async version of `hash_with_salt` that returns a future.
/// If possible, hashing uses `coroutine::maybe_yield` to prevent reactor stalls.
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
seastar::future<sstring> hash_with_salt_async(const sstring& pass, const sstring& salt);
} // namespace detail
@@ -107,6 +117,6 @@ sstring hash(const sstring& pass, RandomNumberEngine& g, scheme scheme) {
///
/// \throws \ref std::system_error when an unexpected implementation-specific error occurs.
///
bool check(const sstring& pass, const sstring& salted_hash);
seastar::future<bool> check(const sstring& pass, const sstring& salted_hash);
} // namespace auth::passwords

View File

@@ -35,10 +35,9 @@ static const class_registrator<
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
cache&,
utils::alien_worker&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
cache&> saslauthd_auth_reg("com.scylladb.auth.SaslauthdAuthenticator");
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, cache&, utils::alien_worker&)
saslauthd_authenticator::saslauthd_authenticator(cql3::query_processor& qp, ::service::raft_group0_client&, ::service::migration_manager&, cache&)
: _socket_path(qp.db().get_config().saslauthd_socket_path())
{}

View File

@@ -12,7 +12,6 @@
#include "auth/authenticator.hh"
#include "auth/cache.hh"
#include "utils/alien_worker.hh"
namespace cql3 {
class query_processor;
@@ -30,7 +29,7 @@ namespace auth {
class saslauthd_authenticator : public authenticator {
sstring _socket_path; ///< Path to the domain socket on which saslauthd is listening.
public:
saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&,utils::alien_worker&);
saslauthd_authenticator(cql3::query_processor&, ::service::raft_group0_client&, ::service::migration_manager&, cache&);
future<> start() override;

View File

@@ -191,8 +191,7 @@ service::service(
::service::migration_manager& mm,
const service_config& sc,
maintenance_socket_enabled used_by_maintenance_socket,
cache& cache,
utils::alien_worker& hashing_worker)
cache& cache)
: service(
std::move(c),
cache,
@@ -200,7 +199,7 @@ service::service(
g0,
mn,
create_object<authorizer>(sc.authorizer_java_name, qp, g0, mm),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm, cache, hashing_worker),
create_object<authenticator>(sc.authenticator_java_name, qp, g0, mm, cache),
create_object<role_manager>(sc.role_manager_java_name, qp, g0, mm, cache),
used_by_maintenance_socket) {
}
@@ -226,7 +225,7 @@ future<> service::create_legacy_keyspace_if_missing(::service::migration_manager
try {
co_return co_await mm.announce(::service::prepare_new_keyspace_announcement(db.real_database(), ksm, ts),
std::move(group0_guard), seastar::format("auth_service: create {} keyspace", meta::legacy::AUTH_KS));
} catch (::service::group0_concurrent_modification&) {
} catch (const ::service::group0_concurrent_modification&) {
log.info("Concurrent operation is detected while creating {} keyspace, retrying.", meta::legacy::AUTH_KS);
}
}

View File

@@ -27,7 +27,6 @@
#include "cql3/description.hh"
#include "seastarx.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/alien_worker.hh"
#include "utils/observable.hh"
#include "utils/serialized_action.hh"
#include "service/maintenance_mode.hh"
@@ -131,8 +130,7 @@ public:
::service::migration_manager&,
const service_config&,
maintenance_socket_enabled,
cache&,
utils::alien_worker&);
cache&);
future<> start(::service::migration_manager&, db::system_keyspace&);

View File

@@ -192,7 +192,7 @@ future<> standard_role_manager::legacy_create_default_role_if_missing() {
{_superuser},
cql3::query_processor::cache_internal::no).discard_result();
log.info("Created default superuser role '{}'.", _superuser);
} catch(const exceptions::unavailable_exception& e) {
} catch (const exceptions::unavailable_exception& e) {
log.warn("Skipped default role setup: some nodes were not ready; will retry");
throw e;
}

View File

@@ -38,8 +38,8 @@ class transitional_authenticator : public authenticator {
public:
static const sstring PASSWORD_AUTHENTICATOR_NAME;
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache, utils::alien_worker& hashing_worker)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, cache, hashing_worker)) {
transitional_authenticator(cql3::query_processor& qp, ::service::raft_group0_client& g0, ::service::migration_manager& mm, cache& cache)
: transitional_authenticator(std::make_unique<password_authenticator>(qp, g0, mm, cache)) {
}
transitional_authenticator(std::unique_ptr<authenticator> a)
: _authenticator(std::move(a)) {
@@ -81,7 +81,7 @@ public:
}).handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::authentication_exception&) {
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
@@ -126,7 +126,7 @@ public:
virtual bytes evaluate_response(bytes_view client_response) override {
try {
return _sasl->evaluate_response(client_response);
} catch (exceptions::authentication_exception&) {
} catch (const exceptions::authentication_exception&) {
_complete = true;
return {};
}
@@ -141,7 +141,7 @@ public:
return _sasl->get_authenticated_user().handle_exception([](auto ep) {
try {
std::rethrow_exception(ep);
} catch (exceptions::authentication_exception&) {
} catch (const exceptions::authentication_exception&) {
// return anon user
return make_ready_future<authenticated_user>(anonymous_user());
}
@@ -241,8 +241,7 @@ static const class_registrator<
cql3::query_processor&,
::service::raft_group0_client&,
::service::migration_manager&,
auth::cache&,
utils::alien_worker&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
auth::cache&> transitional_authenticator_reg(auth::PACKAGE_NAME + "TransitionalAuthenticator");
static const class_registrator<
auth::authorizer,

View File

@@ -859,6 +859,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/alien_worker.cc',
'utils/array-search.cc',
'utils/base64.cc',
'utils/crypt_sha512.cc',
'utils/logalloc.cc',
'utils/large_bitset.cc',
'utils/buffer_input_stream.cc',
@@ -1479,7 +1480,6 @@ deps = {
pure_boost_tests = set([
'test/boost/anchorless_list_test',
'test/boost/auth_passwords_test',
'test/boost/auth_resource_test',
'test/boost/big_decimal_test',
'test/boost/caching_options_test',
@@ -2192,6 +2192,8 @@ def kmiplib():
for id in os_ids:
if id in { 'centos', 'fedora', 'rhel' }:
return 'rhel84'
elif id in { 'ubuntu', 'debian' }:
return 'ubuntu' # Temporarily use a placeholder for Ubuntu/Debian
print('Could not resolve libkmip.a for platform {}'.format(os_ids))
sys.exit(1)

View File

@@ -1322,6 +1322,10 @@ const std::vector<expr::expression>& statement_restrictions::index_restrictions(
return _index_restrictions;
}
bool statement_restrictions::is_empty() const {
return !_where.has_value();
}
// Current score table:
// local and restrictions include full partition key: 2
// global: 1

View File

@@ -408,6 +408,8 @@ public:
/// Checks that the primary key restrictions don't contain null values, throws invalid_request_exception otherwise.
void validate_primary_key(const query_options& options) const;
bool is_empty() const;
};
statement_restrictions analyze_statement_restrictions(

View File

@@ -1976,7 +1976,7 @@ mutation_fragments_select_statement::do_execute(query_processor& qp, service::qu
if (it == indexes.end()) {
throw exceptions::invalid_request_exception("ANN ordering by vector requires the column to be indexed using 'vector_index'");
}
if (index_opt || parameters->allow_filtering() || restrictions->need_filtering() || check_needs_allow_filtering_anyway(*restrictions)) {
if (index_opt || parameters->allow_filtering() || !(restrictions->is_empty()) || check_needs_allow_filtering_anyway(*restrictions)) {
throw exceptions::invalid_request_exception("ANN ordering by vector does not support filtering");
}
index_opt = *it;

View File

@@ -42,6 +42,11 @@ table::get_index_manager() const {
return _ops->get_index_manager(*this);
}
db_clock::time_point
table::get_truncation_time() const {
return _ops->get_truncation_time(*this);
}
lw_shared_ptr<keyspace_metadata>
keyspace::metadata() const {
return _ops->get_keyspace_metadata(*this);

View File

@@ -77,6 +77,7 @@ public:
schema_ptr schema() const;
const std::vector<view_ptr>& views() const;
const secondary_index::secondary_index_manager& get_index_manager() const;
db_clock::time_point get_truncation_time() const;
};
class keyspace {

View File

@@ -27,6 +27,7 @@ public:
virtual std::optional<table> try_find_table(database db, table_id id) const = 0;
virtual const secondary_index::secondary_index_manager& get_index_manager(table t) const = 0;
virtual schema_ptr get_table_schema(table t) const = 0;
virtual db_clock::time_point get_truncation_time(table t) const = 0;
virtual lw_shared_ptr<keyspace_metadata> get_keyspace_metadata(keyspace ks) const = 0;
virtual bool is_internal(keyspace ks) const = 0;
virtual const locator::abstract_replication_strategy& get_replication_strategy(keyspace ks) const = 0;

20
db/batchlog.hh Normal file
View File

@@ -0,0 +1,20 @@
/*
* Copyright (C) 2025-present ScyllaDB
*/
/*
* SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
*/
#pragma once
#include "mutation/mutation.hh"
#include "utils/UUID.hh"
namespace db {
mutation get_batchlog_mutation_for(schema_ptr schema, const utils::chunked_vector<mutation>& mutations, int32_t version, db_clock::time_point now, const utils::UUID& id);
mutation get_batchlog_delete_mutation(schema_ptr schema, int32_t version, db_clock::time_point now, const utils::UUID& id);
}

View File

@@ -10,6 +10,7 @@
#include <chrono>
#include <exception>
#include <ranges>
#include <seastar/core/future-util.hh>
#include <seastar/core/do_with.hh>
#include <seastar/core/semaphore.hh>
@@ -18,12 +19,14 @@
#include <seastar/core/sleep.hh>
#include "batchlog_manager.hh"
#include "batchlog.hh"
#include "data_dictionary/data_dictionary.hh"
#include "mutation/canonical_mutation.hh"
#include "service/storage_proxy.hh"
#include "system_keyspace.hh"
#include "utils/rate_limiter.hh"
#include "utils/log.hh"
#include "utils/murmur_hash.hh"
#include "db_clock.hh"
#include "unimplemented.hh"
#include "idl/frozen_schema.dist.hh"
@@ -33,17 +36,94 @@
#include "cql3/untyped_result_set.hh"
#include "service_permit.hh"
#include "cql3/query_processor.hh"
#include "replica/database.hh"
static logging::logger blogger("batchlog_manager");
namespace db {
// Yields 256 batchlog shards. Even on the largest nodes we currently run on,
// this should be enough to give every core a batchlog partition.
static constexpr unsigned batchlog_shard_bits = 8;
int32_t batchlog_shard_of(db_clock::time_point written_at) {
const int64_t count = written_at.time_since_epoch().count();
std::array<uint64_t, 2> result;
utils::murmur_hash::hash3_x64_128(bytes_view(reinterpret_cast<const signed char*>(&count), sizeof(count)), 0, result);
uint64_t hash = result[0] ^ result[1];
return hash & ((1ULL << batchlog_shard_bits) - 1);
}
std::pair<partition_key, clustering_key>
get_batchlog_key(const schema& schema, int32_t version, db::batchlog_stage stage, int32_t batchlog_shard, db_clock::time_point written_at, std::optional<utils::UUID> id) {
auto pkey = partition_key::from_exploded(schema, {serialized(version), serialized(int8_t(stage)), serialized(batchlog_shard)});
std::vector<bytes> ckey_components;
ckey_components.reserve(2);
ckey_components.push_back(serialized(written_at));
if (id) {
ckey_components.push_back(serialized(*id));
}
auto ckey = clustering_key::from_exploded(schema, ckey_components);
return {std::move(pkey), std::move(ckey)};
}
std::pair<partition_key, clustering_key>
get_batchlog_key(const schema& schema, int32_t version, db::batchlog_stage stage, db_clock::time_point written_at, std::optional<utils::UUID> id) {
return get_batchlog_key(schema, version, stage, batchlog_shard_of(written_at), written_at, id);
}
mutation get_batchlog_mutation_for(schema_ptr schema, managed_bytes data, int32_t version, db::batchlog_stage stage, db_clock::time_point now, const utils::UUID& id) {
auto [key, ckey] = get_batchlog_key(*schema, version, stage, now, id);
auto timestamp = api::new_timestamp();
mutation m(schema, key);
// Avoid going through data_value and therefore `bytes`, as it can be large (#24809).
auto cdef_data = schema->get_column_definition(to_bytes("data"));
m.set_cell(ckey, *cdef_data, atomic_cell::make_live(*cdef_data->type, timestamp, std::move(data)));
return m;
}
mutation get_batchlog_mutation_for(schema_ptr schema, const utils::chunked_vector<mutation>& mutations, int32_t version, db::batchlog_stage stage, db_clock::time_point now, const utils::UUID& id) {
auto data = [&mutations] {
utils::chunked_vector<canonical_mutation> fm(mutations.begin(), mutations.end());
bytes_ostream out;
for (auto& m : fm) {
ser::serialize(out, m);
}
return std::move(out).to_managed_bytes();
}();
return get_batchlog_mutation_for(std::move(schema), std::move(data), version, stage, now, id);
}
mutation get_batchlog_mutation_for(schema_ptr schema, const utils::chunked_vector<mutation>& mutations, int32_t version, db_clock::time_point now, const utils::UUID& id) {
return get_batchlog_mutation_for(std::move(schema), mutations, version, batchlog_stage::initial, now, id);
}
mutation get_batchlog_delete_mutation(schema_ptr schema, int32_t version, db::batchlog_stage stage, db_clock::time_point now, const utils::UUID& id) {
auto [key, ckey] = get_batchlog_key(*schema, version, stage, now, id);
mutation m(schema, key);
auto timestamp = api::new_timestamp();
m.partition().apply_delete(*schema, ckey, tombstone(timestamp, gc_clock::now()));
return m;
}
mutation get_batchlog_delete_mutation(schema_ptr schema, int32_t version, db_clock::time_point now, const utils::UUID& id) {
return get_batchlog_delete_mutation(std::move(schema), version, batchlog_stage::initial, now, id);
}
} // namespace db
const std::chrono::seconds db::batchlog_manager::replay_interval;
const uint32_t db::batchlog_manager::page_size;
db::batchlog_manager::batchlog_manager(cql3::query_processor& qp, db::system_keyspace& sys_ks, batchlog_manager_config config)
: _qp(qp)
, _sys_ks(sys_ks)
, _write_request_timeout(std::chrono::duration_cast<db_clock::duration>(config.write_request_timeout))
, _replay_timeout(config.replay_timeout)
, _replay_rate(config.replay_rate)
, _delay(config.delay)
, _replay_cleanup_after_replays(config.replay_cleanup_after_replays)
@@ -152,18 +232,75 @@ future<> db::batchlog_manager::stop() {
}
future<size_t> db::batchlog_manager::count_all_batches() const {
sstring query = format("SELECT count(*) FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG);
sstring query = format("SELECT count(*) FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG_V2);
return _qp.execute_internal(query, cql3::query_processor::cache_internal::yes).then([](::shared_ptr<cql3::untyped_result_set> rs) {
return size_t(rs->one().get_as<int64_t>("count"));
});
}
db_clock::duration db::batchlog_manager::get_batch_log_timeout() const {
// enough time for the actual write + BM removal mutation
return _write_request_timeout * 2;
future<> db::batchlog_manager::maybe_migrate_v1_to_v2() {
if (_migration_done) {
return make_ready_future<>();
}
return with_gate(_gate, [this] () mutable -> future<> {
blogger.info("Migrating batchlog entries from v1 -> v2");
auto schema_v1 = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
auto schema_v2 = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG_V2);
auto batch = [this, schema_v1, schema_v2] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
// check version of serialization format
if (!row.has("version")) {
blogger.warn("Not migrating logged batch because of unknown version");
co_return stop_iteration::no;
}
auto version = row.get_as<int32_t>("version");
if (version != netw::messaging_service::current_version) {
blogger.warn("Not migrating logged batch because of incorrect version");
co_return stop_iteration::no;
}
auto id = row.get_as<utils::UUID>("id");
auto written_at = row.get_as<db_clock::time_point>("written_at");
auto data = row.get_blob_fragmented("data");
auto& sp = _qp.proxy();
utils::get_local_injector().inject("batchlog_manager_fail_migration", [] { throw std::runtime_error("Error injection: failing batchlog migration"); });
auto migrate_mut = get_batchlog_mutation_for(schema_v2, std::move(data), version, batchlog_stage::failed_replay, written_at, id);
co_await sp.mutate_locally(migrate_mut, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
mutation delete_mut(schema_v1, partition_key::from_single_value(*schema_v1, serialized(id)));
delete_mut.partition().apply_delete(*schema_v1, clustering_key_prefix::make_empty(), tombstone(api::new_timestamp(), gc_clock::now()));
co_await sp.mutate_locally(delete_mut, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
co_return stop_iteration::no;
};
try {
co_await _qp.query_internal(
format("SELECT * FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG),
db::consistency_level::ONE,
{},
page_size,
std::move(batch));
} catch (...) {
blogger.warn("Batchlog v1 to v2 migration failed: {}; will retry", std::current_exception());
co_return;
}
co_await container().invoke_on_all([] (auto& bm) {
bm._migration_done = true;
});
blogger.info("Done migrating batchlog entries from v1 -> v2");
});
}
future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cleanup) {
co_await maybe_migrate_v1_to_v2();
typedef db_clock::rep clock_type;
db::all_batches_replayed all_replayed = all_batches_replayed::yes;
@@ -172,21 +309,26 @@ future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches
auto throttle = _replay_rate / _qp.proxy().get_token_metadata_ptr()->count_normal_token_owners();
auto limiter = make_lw_shared<utils::rate_limiter>(throttle);
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
auto delete_batch = [this, schema = std::move(schema)] (utils::UUID id) {
auto key = partition_key::from_singular(*schema, id);
mutation m(schema, key);
auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
m.partition().apply_delete(*schema, clustering_key_prefix::make_empty(), tombstone(now, gc_clock::now()));
return _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG_V2);
struct replay_stats {
std::optional<db_clock::time_point> min_too_fresh;
bool need_cleanup = false;
};
auto batch = [this, limiter, delete_batch = std::move(delete_batch), &all_replayed](const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
std::unordered_map<int32_t, replay_stats> replay_stats_per_shard;
// Use a stable `now` accross all batches, so skip/replay decisions are the
// same accross a while prefix of written_at (accross all ids).
const auto now = db_clock::now();
auto batch = [this, cleanup, limiter, schema, &all_replayed, &replay_stats_per_shard, now] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
const auto stage = static_cast<batchlog_stage>(row.get_as<int8_t>("stage"));
const auto batch_shard = row.get_as<int32_t>("shard");
auto written_at = row.get_as<db_clock::time_point>("written_at");
auto id = row.get_as<utils::UUID>("id");
// enough time for the actual write + batchlog entry mutation delivery (two separate requests).
auto now = db_clock::now();
auto timeout = get_batch_log_timeout();
auto timeout = _replay_timeout;
if (utils::get_local_injector().is_enabled("skip_batch_replay")) {
blogger.debug("Skipping batch replay due to skip_batch_replay injection");
@@ -194,52 +336,48 @@ future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches
co_return stop_iteration::no;
}
// check version of serialization format
if (!row.has("version")) {
blogger.warn("Skipping logged batch because of unknown version");
co_await delete_batch(id);
co_return stop_iteration::no;
}
auto version = row.get_as<int32_t>("version");
if (version != netw::messaging_service::current_version) {
blogger.warn("Skipping logged batch because of incorrect version {}; current version = {}", version, netw::messaging_service::current_version);
co_await delete_batch(id);
co_return stop_iteration::no;
}
auto data = row.get_blob_unfragmented("data");
blogger.debug("Replaying batch {}", id);
blogger.debug("Replaying batch {} from stage {} and batch shard {}", id, int32_t(stage), batch_shard);
utils::chunked_vector<mutation> mutations;
bool send_failed = false;
auto& shard_written_at = replay_stats_per_shard.try_emplace(batch_shard, replay_stats{}).first->second;
try {
auto fms = make_lw_shared<std::deque<canonical_mutation>>();
utils::chunked_vector<std::pair<canonical_mutation, schema_ptr>> fms;
auto in = ser::as_input_stream(data);
while (in.size()) {
fms->emplace_back(ser::deserialize(in, std::type_identity<canonical_mutation>()));
schema_ptr s = _qp.db().find_schema(fms->back().column_family_id());
timeout = std::min(timeout, std::chrono::duration_cast<db_clock::duration>(s->tombstone_gc_options().propagation_delay_in_seconds()));
auto fm = ser::deserialize(in, std::type_identity<canonical_mutation>());
const auto tbl = _qp.db().try_find_table(fm.column_family_id());
if (!tbl) {
continue;
}
if (written_at <= tbl->get_truncation_time()) {
continue;
}
schema_ptr s = tbl->schema();
if (s->tombstone_gc_options().mode() == tombstone_gc_mode::repair) {
timeout = std::min(timeout, std::chrono::duration_cast<db_clock::duration>(s->tombstone_gc_options().propagation_delay_in_seconds()));
}
fms.emplace_back(std::move(fm), std::move(s));
}
if (now < written_at + timeout) {
blogger.debug("Skipping replay of {}, too fresh", id);
shard_written_at.min_too_fresh = std::min(shard_written_at.min_too_fresh.value_or(written_at), written_at);
co_return stop_iteration::no;
}
auto size = data.size();
auto mutations = co_await map_reduce(*fms, [this, written_at] (canonical_mutation& fm) {
const auto& cf = _qp.proxy().local_db().find_column_family(fm.column_family_id());
return make_ready_future<canonical_mutation*>(written_at > cf.get_truncation_time() ? &fm : nullptr);
},
utils::chunked_vector<mutation>(),
[this] (utils::chunked_vector<mutation> mutations, canonical_mutation* fm) {
if (fm) {
schema_ptr s = _qp.db().find_schema(fm->column_family_id());
mutations.emplace_back(fm->to_mutation(s));
}
return mutations;
});
for (const auto& [fm, s] : fms) {
mutations.emplace_back(fm.to_mutation(s));
co_await maybe_yield();
}
if (!mutations.empty()) {
const auto ttl = [written_at]() -> clock_type {
@@ -265,7 +403,11 @@ future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches
co_await limiter->reserve(size);
_stats.write_attempts += mutations.size();
auto timeout = db::timeout_clock::now() + write_timeout;
co_await _qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
if (cleanup) {
co_await _qp.proxy().send_batchlog_replay_to_all_replicas(mutations, timeout);
} else {
co_await _qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
}
}
}
} catch (data_dictionary::no_such_keyspace& ex) {
@@ -279,31 +421,80 @@ future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches
// Do _not_ remove the batch, assuning we got a node write error.
// Since we don't have hints (which origin is satisfied with),
// we have to resort to keeping this batch to next lap.
co_return stop_iteration::no;
if (!cleanup || stage == batchlog_stage::failed_replay) {
co_return stop_iteration::no;
}
send_failed = true;
}
auto& sp = _qp.proxy();
if (send_failed) {
blogger.debug("Moving batch {} to stage failed_replay", id);
auto m = get_batchlog_mutation_for(schema, mutations, netw::messaging_service::current_version, batchlog_stage::failed_replay, written_at, id);
co_await sp.mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
}
// delete batch
co_await delete_batch(id);
auto m = get_batchlog_delete_mutation(schema, netw::messaging_service::current_version, stage, written_at, id);
co_await _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
shard_written_at.need_cleanup = true;
co_return stop_iteration::no;
};
co_await with_gate(_gate, [this, cleanup, batch = std::move(batch)] () mutable -> future<> {
blogger.debug("Started replayAllFailedBatches (cpu {})", this_shard_id());
co_await with_gate(_gate, [this, cleanup, &all_replayed, batch = std::move(batch), now, &replay_stats_per_shard] () mutable -> future<> {
blogger.debug("Started replayAllFailedBatches with cleanup: {}", cleanup);
co_await utils::get_local_injector().inject("add_delay_to_batch_replay", std::chrono::milliseconds(1000));
co_await _qp.query_internal(
format("SELECT id, data, written_at, version FROM {}.{} BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG),
db::consistency_level::ONE,
{},
page_size,
std::move(batch)).then([this, cleanup] {
if (cleanup == post_replay_cleanup::no) {
return make_ready_future<>();
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG_V2);
co_await coroutine::parallel_for_each(std::views::iota(0, 16), [&] (int32_t chunk) -> future<> {
const int32_t batchlog_chunk_base = chunk * 16;
for (int32_t i = 0; i < 16; ++i) {
int32_t batchlog_shard = batchlog_chunk_base + i;
co_await _qp.query_internal(
format("SELECT * FROM {}.{} WHERE version = ? AND stage = ? AND shard = ? BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG_V2),
db::consistency_level::ONE,
{data_value(netw::messaging_service::current_version), data_value(int8_t(batchlog_stage::failed_replay)), data_value(batchlog_shard)},
page_size,
batch);
co_await _qp.query_internal(
format("SELECT * FROM {}.{} WHERE version = ? AND stage = ? AND shard = ? BYPASS CACHE", system_keyspace::NAME, system_keyspace::BATCHLOG_V2),
db::consistency_level::ONE,
{data_value(netw::messaging_service::current_version), data_value(int8_t(batchlog_stage::initial)), data_value(batchlog_shard)},
page_size,
batch);
if (cleanup != post_replay_cleanup::yes) {
continue;
}
auto it = replay_stats_per_shard.find(batchlog_shard);
if (it == replay_stats_per_shard.end() || !it->second.need_cleanup) {
// Nothing was replayed on this batchlog shard, nothing to cleanup.
continue;
}
const auto write_time = it->second.min_too_fresh.value_or(now - _replay_timeout);
const auto end_weight = it->second.min_too_fresh ? bound_weight::before_all_prefixed : bound_weight::after_all_prefixed;
auto [key, ckey] = get_batchlog_key(*schema, netw::messaging_service::current_version, batchlog_stage::initial, batchlog_shard, write_time, {});
auto end_pos = position_in_partition(partition_region::clustered, end_weight, std::move(ckey));
range_tombstone rt(position_in_partition::before_all_clustered_rows(), std::move(end_pos), tombstone(api::new_timestamp(), gc_clock::now()));
blogger.trace("Clean up batchlog shard {} with range tombstone {}", batchlog_shard, rt);
mutation m(schema, key);
m.partition().apply_row_tombstone(*schema, std::move(rt));
co_await _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
}
// Replaying batches could have generated tombstones, flush to disk,
// where they can be compacted away.
return replica::database::flush_table_on_all_shards(_qp.proxy().get_db(), system_keyspace::NAME, system_keyspace::BATCHLOG);
}).then([] {
blogger.debug("Finished replayAllFailedBatches");
});
blogger.debug("Finished replayAllFailedBatches with all_replayed: {}", all_replayed);
});
co_return all_replayed;

View File

@@ -34,12 +34,17 @@ class system_keyspace;
using all_batches_replayed = bool_class<struct all_batches_replayed_tag>;
struct batchlog_manager_config {
std::chrono::duration<double> write_request_timeout;
db_clock::duration replay_timeout;
uint64_t replay_rate = std::numeric_limits<uint64_t>::max();
std::chrono::milliseconds delay = std::chrono::milliseconds(0);
unsigned replay_cleanup_after_replays;
};
enum class batchlog_stage : int8_t {
initial,
failed_replay
};
class batchlog_manager : public peering_sharded_service<batchlog_manager> {
public:
using post_replay_cleanup = bool_class<class post_replay_cleanup_tag>;
@@ -59,7 +64,7 @@ private:
cql3::query_processor& _qp;
db::system_keyspace& _sys_ks;
db_clock::duration _write_request_timeout;
db_clock::duration _replay_timeout;
uint64_t _replay_rate;
std::chrono::milliseconds _delay;
unsigned _replay_cleanup_after_replays = 100;
@@ -71,6 +76,14 @@ private:
gc_clock::time_point _last_replay;
// Was the v1 -> v2 migration already done since last restart?
// The migration is attempted once after each restart. This is redundant but
// keeps thing simple. Once no upgrade path exists from a ScyllaDB version
// which can still produce v1 entries, this migration code can be removed.
bool _migration_done = false;
future<> maybe_migrate_v1_to_v2();
future<all_batches_replayed> replay_all_failed_batches(post_replay_cleanup cleanup);
public:
// Takes a QP, not a distributes. Because this object is supposed
@@ -85,10 +98,13 @@ public:
future<all_batches_replayed> do_batch_log_replay(post_replay_cleanup cleanup);
future<size_t> count_all_batches() const;
db_clock::duration get_batch_log_timeout() const;
gc_clock::time_point get_last_replay() const {
return _last_replay;
}
const stats& stats() const {
return _stats;
}
private:
future<> batchlog_replay_loop();
};

View File

@@ -54,12 +54,14 @@ public:
uint64_t applied_mutations = 0;
uint64_t corrupt_bytes = 0;
uint64_t truncated_at = 0;
uint64_t broken_files = 0;
stats& operator+=(const stats& s) {
invalid_mutations += s.invalid_mutations;
skipped_mutations += s.skipped_mutations;
applied_mutations += s.applied_mutations;
corrupt_bytes += s.corrupt_bytes;
broken_files += s.broken_files;
return *this;
}
stats operator+(const stats& s) const {
@@ -192,6 +194,8 @@ db::commitlog_replayer::impl::recover(const commitlog::descriptor& d, const comm
s->corrupt_bytes += e.bytes();
} catch (commitlog::segment_truncation& e) {
s->truncated_at = e.position();
} catch (commitlog::header_checksum_error&) {
++s->broken_files;
} catch (...) {
throw;
}
@@ -370,6 +374,9 @@ future<> db::commitlog_replayer::recover(std::vector<sstring> files, sstring fna
if (stats.truncated_at != 0) {
rlogger.warn("Truncated file: {} at position {}.", f, stats.truncated_at);
}
if (stats.broken_files != 0) {
rlogger.warn("Corrupted file header: {}. Skipped.", f);
}
rlogger.debug("Log replay of {} complete, {} replayed mutations ({} invalid, {} skipped)"
, f
, stats.applied_mutations

View File

@@ -1152,7 +1152,7 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"Number of threads with which to deliver hints. In multiple data-center deployments, consider increasing this number because cross data-center handoff is generally slower.")
, batchlog_replay_throttle_in_kb(this, "batchlog_replay_throttle_in_kb", value_status::Unused, 1024,
"Total maximum throttle. Throttling is reduced proportionally to the number of nodes in the cluster.")
, batchlog_replay_cleanup_after_replays(this, "batchlog_replay_cleanup_after_replays", liveness::LiveUpdate, value_status::Used, 60,
, batchlog_replay_cleanup_after_replays(this, "batchlog_replay_cleanup_after_replays", liveness::LiveUpdate, value_status::Used, 1,
"Clean up batchlog memtable after every N replays. Replays are issued on a timer, every 60 seconds. So if batchlog_replay_cleanup_after_replays is set to 60, the batchlog memtable is flushed every 60 * 60 seconds.")
/**
* @Group Request scheduler properties

View File

@@ -1262,16 +1262,9 @@ static future<> do_merge_schema(sharded<service::storage_proxy>& proxy, sharded
{
slogger.trace("do_merge_schema: {}", mutations);
schema_applier ap(proxy, ss, sys_ks, reload);
std::exception_ptr ex;
try {
co_await execute_do_merge_schema(proxy, ap, std::move(mutations));
} catch (...) {
ex = std::current_exception();
}
co_await ap.destroy();
if (ex) {
throw ex;
}
co_await execute_do_merge_schema(proxy, ap, std::move(mutations)).finally([&ap]() {
return ap.destroy();
});
}
/**

View File

@@ -65,7 +65,7 @@ future<> snapshot_ctl::run_snapshot_modify_operation(noncopyable_function<future
});
}
future<> snapshot_ctl::take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts) {
future<> snapshot_ctl::take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf) {
if (tag.empty()) {
throw std::runtime_error("You must supply a snapshot name.");
}
@@ -74,21 +74,21 @@ future<> snapshot_ctl::take_snapshot(sstring tag, std::vector<sstring> keyspace_
std::ranges::copy(_db.local().get_keyspaces() | std::views::keys, std::back_inserter(keyspace_names));
};
return run_snapshot_modify_operation([tag = std::move(tag), keyspace_names = std::move(keyspace_names), opts, this] () mutable {
return do_take_snapshot(std::move(tag), std::move(keyspace_names), opts);
return run_snapshot_modify_operation([tag = std::move(tag), keyspace_names = std::move(keyspace_names), sf, this] () mutable {
return do_take_snapshot(std::move(tag), std::move(keyspace_names), sf);
});
}
future<> snapshot_ctl::do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts) {
future<> snapshot_ctl::do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf) {
co_await coroutine::parallel_for_each(keyspace_names, [tag, this] (const auto& ks_name) {
return check_snapshot_not_exist(ks_name, tag);
});
co_await coroutine::parallel_for_each(keyspace_names, [this, tag = std::move(tag), opts] (const auto& ks_name) {
return replica::database::snapshot_keyspace_on_all_shards(_db, ks_name, tag, opts);
co_await coroutine::parallel_for_each(keyspace_names, [this, tag = std::move(tag), sf] (const auto& ks_name) {
return replica::database::snapshot_keyspace_on_all_shards(_db, ks_name, tag, bool(sf));
});
}
future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts) {
future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf) {
if (ks_name.empty()) {
throw std::runtime_error("You must supply a keyspace name");
}
@@ -99,14 +99,14 @@ future<> snapshot_ctl::take_column_family_snapshot(sstring ks_name, std::vector<
throw std::runtime_error("You must supply a snapshot name.");
}
return run_snapshot_modify_operation([this, ks_name = std::move(ks_name), tables = std::move(tables), tag = std::move(tag), opts] () mutable {
return do_take_column_family_snapshot(std::move(ks_name), std::move(tables), std::move(tag), opts);
return run_snapshot_modify_operation([this, ks_name = std::move(ks_name), tables = std::move(tables), tag = std::move(tag), sf] () mutable {
return do_take_column_family_snapshot(std::move(ks_name), std::move(tables), std::move(tag), sf);
});
}
future<> snapshot_ctl::do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts) {
future<> snapshot_ctl::do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf) {
co_await check_snapshot_not_exist(ks_name, tag, tables);
co_await replica::database::snapshot_tables_on_all_shards(_db, ks_name, std::move(tables), std::move(tag), opts);
co_await replica::database::snapshot_tables_on_all_shards(_db, ks_name, std::move(tables), std::move(tag), bool(sf));
}
future<> snapshot_ctl::clear_snapshot(sstring tag, std::vector<sstring> keyspace_names, sstring cf_name) {

View File

@@ -38,13 +38,10 @@ class backup_task_impl;
} // snapshot namespace
struct snapshot_options {
bool skip_flush = false;
bool use_sstable_identifier = false;
};
class snapshot_ctl : public peering_sharded_service<snapshot_ctl> {
public:
using skip_flush = bool_class<class skip_flush_tag>;
struct table_snapshot_details {
int64_t total;
int64_t live;
@@ -73,8 +70,8 @@ public:
*
* @param tag the tag given to the snapshot; may not be null or empty
*/
future<> take_snapshot(sstring tag, snapshot_options opts = {}) {
return take_snapshot(tag, {}, opts);
future<> take_snapshot(sstring tag, skip_flush sf = skip_flush::no) {
return take_snapshot(tag, {}, sf);
}
/**
@@ -83,7 +80,7 @@ public:
* @param tag the tag given to the snapshot; may not be null or empty
* @param keyspace_names the names of the keyspaces to snapshot; empty means "all"
*/
future<> take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts = {});
future<> take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf = skip_flush::no);
/**
* Takes the snapshot of multiple tables. A snapshot name must be specified.
@@ -92,7 +89,7 @@ public:
* @param tables a vector of tables names to snapshot
* @param tag the tag given to the snapshot; may not be null or empty
*/
future<> take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
future<> take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf = skip_flush::no);
/**
* Remove the snapshot with the given name from the given keyspaces.
@@ -130,8 +127,8 @@ private:
friend class snapshot::backup_task_impl;
future<> do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, snapshot_options opts = {} );
future<> do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, snapshot_options opts = {});
future<> do_take_snapshot(sstring tag, std::vector<sstring> keyspace_names, skip_flush sf = skip_flush::no);
future<> do_take_column_family_snapshot(sstring ks_name, std::vector<sstring> tables, sstring tag, skip_flush sf = skip_flush::no);
};
}

View File

@@ -137,8 +137,6 @@ namespace {
system_keyspace::ROLE_PERMISSIONS,
system_keyspace::DICTS,
system_keyspace::VIEW_BUILDING_TASKS,
// repair tasks
system_keyspace::REPAIR_TASKS,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.is_group0_table = true;
@@ -215,6 +213,30 @@ schema_ptr system_keyspace::batchlog() {
return batchlog;
}
schema_ptr system_keyspace::batchlog_v2() {
static thread_local auto batchlog_v2 = [] {
schema_builder builder(generate_legacy_id(NAME, BATCHLOG_V2), NAME, BATCHLOG_V2,
// partition key
{{"version", int32_type}, {"stage", byte_type}, {"shard", int32_type}},
// clustering key
{{"written_at", timestamp_type}, {"id", uuid_type}},
// regular columns
{{"data", bytes_type}},
// static columns
{},
// regular column name type
utf8_type,
// comment
"batches awaiting replay"
);
builder.set_gc_grace_seconds(0);
builder.set_caching_options(caching_options::get_disabled_caching_options());
builder.with_hash_version();
return builder.build(schema_builder::compact_storage::no);
}();
return batchlog_v2;
}
/*static*/ schema_ptr system_keyspace::paxos() {
static thread_local auto paxos = [] {
// FIXME: switch to the new schema_builder interface (with_column(...), etc)
@@ -464,24 +486,6 @@ schema_ptr system_keyspace::repair_history() {
return schema;
}
schema_ptr system_keyspace::repair_tasks() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(NAME, REPAIR_TASKS);
return schema_builder(NAME, REPAIR_TASKS, std::optional(id))
.with_column("task_uuid", uuid_type, column_kind::partition_key)
.with_column("operation", utf8_type, column_kind::clustering_key)
// First and last token for of the tablet
.with_column("first_token", long_type, column_kind::clustering_key)
.with_column("last_token", long_type, column_kind::clustering_key)
.with_column("timestamp", timestamp_type)
.with_column("table_uuid", uuid_type, column_kind::static_column)
.set_comment("Record tablet repair tasks")
.with_hash_version()
.build();
}();
return schema;
}
schema_ptr system_keyspace::built_indexes() {
static thread_local auto built_indexes = [] {
schema_builder builder(generate_legacy_id(NAME, BUILT_INDEXES), NAME, BUILT_INDEXES,
@@ -2324,14 +2328,13 @@ std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
std::copy(schema_tables.begin(), schema_tables.end(), std::back_inserter(r));
auto auth_tables = system_keyspace::auth_tables();
std::copy(auth_tables.begin(), auth_tables.end(), std::back_inserter(r));
r.insert(r.end(), { built_indexes(), hints(), batchlog(), paxos(), local(),
r.insert(r.end(), { built_indexes(), hints(), batchlog(), batchlog_v2(), paxos(), local(),
peers(), peer_events(), range_xfers(),
compactions_in_progress(), compaction_history(),
sstable_activity(), size_estimates(), large_partitions(), large_rows(), large_cells(),
corrupt_data(),
scylla_local(), db::schema_tables::scylla_table_schema_history(),
repair_history(),
repair_tasks(),
v3::views_builds_in_progress(), v3::built_views(),
v3::scylla_views_builds_in_progress(),
v3::truncated(),
@@ -2356,7 +2359,9 @@ std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
}
static bool maybe_write_in_user_memory(schema_ptr s) {
return (s.get() == system_keyspace::batchlog().get()) || (s.get() == system_keyspace::paxos().get())
return (s.get() == system_keyspace::batchlog().get())
|| (s.get() == system_keyspace::batchlog_v2().get())
|| (s.get() == system_keyspace::paxos().get())
|| s == system_keyspace::v3::scylla_views_builds_in_progress();
}
@@ -2573,32 +2578,6 @@ future<> system_keyspace::get_repair_history(::table_id table_id, repair_history
});
}
future<utils::chunked_vector<canonical_mutation>> system_keyspace::get_update_repair_task_mutations(const repair_task_entry& entry, api::timestamp_type ts) {
// Default to timeout the repair task entries in 10 days, this should be enough time for the management tools to query
constexpr int ttl = 10 * 24 * 3600;
sstring req = format("INSERT INTO system.{} (task_uuid, operation, first_token, last_token, timestamp, table_uuid) VALUES (?, ?, ?, ?, ?, ?) USING TTL {}", REPAIR_TASKS, ttl);
auto muts = co_await _qp.get_mutations_internal(req, internal_system_query_state(), ts,
{entry.task_uuid.uuid(), repair_task_operation_to_string(entry.operation),
entry.first_token, entry.last_token, entry.timestamp, entry.table_uuid.uuid()});
utils::chunked_vector<canonical_mutation> cmuts = {muts.begin(), muts.end()};
co_return cmuts;
}
future<> system_keyspace::get_repair_task(tasks::task_id task_uuid, repair_task_consumer f) {
sstring req = format("SELECT * from system.{} WHERE task_uuid = {}", REPAIR_TASKS, task_uuid);
co_await _qp.query_internal(req, [&f] (const cql3::untyped_result_set::row& row) mutable -> future<stop_iteration> {
repair_task_entry ent;
ent.task_uuid = tasks::task_id(row.get_as<utils::UUID>("task_uuid"));
ent.operation = repair_task_operation_from_string(row.get_as<sstring>("operation"));
ent.first_token = row.get_as<int64_t>("first_token");
ent.last_token = row.get_as<int64_t>("last_token");
ent.timestamp = row.get_as<db_clock::time_point>("timestamp");
ent.table_uuid = ::table_id(row.get_as<utils::UUID>("table_uuid"));
co_await f(std::move(ent));
co_return stop_iteration::no;
});
}
future<gms::generation_type> system_keyspace::increment_and_get_generation() {
auto req = format("SELECT gossip_generation FROM system.{} WHERE key='{}'", LOCAL, LOCAL);
auto rs = co_await _qp.execute_internal(req, cql3::query_processor::cache_internal::yes);
@@ -3770,35 +3749,4 @@ future<> system_keyspace::apply_mutation(mutation m) {
return _qp.proxy().mutate_locally(m, {}, db::commitlog::force_sync(m.schema()->static_props().wait_for_sync_to_commitlog), db::no_timeout);
}
// The names are persisted in system tables so should not be changed.
static const std::unordered_map<system_keyspace::repair_task_operation, sstring> repair_task_operation_to_name = {
{system_keyspace::repair_task_operation::requested, "requested"},
{system_keyspace::repair_task_operation::finished, "finished"},
};
static const std::unordered_map<sstring, system_keyspace::repair_task_operation> repair_task_operation_from_name = std::invoke([] {
std::unordered_map<sstring, system_keyspace::repair_task_operation> result;
for (auto&& [v, s] : repair_task_operation_to_name) {
result.emplace(s, v);
}
return result;
});
sstring system_keyspace::repair_task_operation_to_string(system_keyspace::repair_task_operation op) {
auto i = repair_task_operation_to_name.find(op);
if (i == repair_task_operation_to_name.end()) {
on_internal_error(slogger, format("Invalid repair task operation: {}", static_cast<int>(op)));
}
return i->second;
}
system_keyspace::repair_task_operation system_keyspace::repair_task_operation_from_string(const sstring& name) {
return repair_task_operation_from_name.at(name);
}
} // namespace db
auto fmt::formatter<db::system_keyspace::repair_task_operation>::format(const db::system_keyspace::repair_task_operation& op, fmt::format_context& ctx) const
-> decltype(ctx.out()) {
return fmt::format_to(ctx.out(), "{}", db::system_keyspace::repair_task_operation_to_string(op));
}

View File

@@ -57,8 +57,6 @@ namespace paxos {
struct topology_request_state;
class group0_guard;
class raft_group0_client;
}
namespace netw {
@@ -165,6 +163,7 @@ public:
static constexpr auto NAME = "system";
static constexpr auto HINTS = "hints";
static constexpr auto BATCHLOG = "batchlog";
static constexpr auto BATCHLOG_V2 = "batchlog_v2";
static constexpr auto PAXOS = "paxos";
static constexpr auto BUILT_INDEXES = "IndexInfo";
static constexpr auto LOCAL = "local";
@@ -186,7 +185,6 @@ public:
static constexpr auto RAFT_SNAPSHOTS = "raft_snapshots";
static constexpr auto RAFT_SNAPSHOT_CONFIG = "raft_snapshot_config";
static constexpr auto REPAIR_HISTORY = "repair_history";
static constexpr auto REPAIR_TASKS = "repair_tasks";
static constexpr auto GROUP0_HISTORY = "group0_history";
static constexpr auto DISCOVERY = "discovery";
static constexpr auto BROADCAST_KV_STORE = "broadcast_kv_store";
@@ -258,12 +256,12 @@ public:
static schema_ptr hints();
static schema_ptr batchlog();
static schema_ptr batchlog_v2();
static schema_ptr paxos();
static schema_ptr built_indexes(); // TODO (from Cassandra): make private
static schema_ptr raft();
static schema_ptr raft_snapshots();
static schema_ptr repair_history();
static schema_ptr repair_tasks();
static schema_ptr group0_history();
static schema_ptr discovery();
static schema_ptr broadcast_kv_store();
@@ -402,22 +400,6 @@ public:
int64_t range_end;
};
enum class repair_task_operation {
requested,
finished,
};
static sstring repair_task_operation_to_string(repair_task_operation op);
static repair_task_operation repair_task_operation_from_string(const sstring& name);
struct repair_task_entry {
tasks::task_id task_uuid;
repair_task_operation operation;
int64_t first_token;
int64_t last_token;
db_clock::time_point timestamp;
table_id table_uuid;
};
struct topology_requests_entry {
utils::UUID id;
utils::UUID initiating_host;
@@ -439,10 +421,6 @@ public:
using repair_history_consumer = noncopyable_function<future<>(const repair_history_entry&)>;
future<> get_repair_history(table_id, repair_history_consumer f);
future<utils::chunked_vector<canonical_mutation>> get_update_repair_task_mutations(const repair_task_entry& entry, api::timestamp_type ts);
using repair_task_consumer = noncopyable_function<future<>(const repair_task_entry&)>;
future<> get_repair_task(tasks::task_id task_uuid, repair_task_consumer f);
future<> save_truncation_record(const replica::column_family&, db_clock::time_point truncated_at, db::replay_position);
future<replay_positions> get_truncated_positions(table_id);
future<> drop_truncation_rp_records();
@@ -750,8 +728,3 @@ public:
}; // class system_keyspace
} // namespace db
template <>
struct fmt::formatter<db::system_keyspace::repair_task_operation> : fmt::formatter<string_view> {
auto format(const db::system_keyspace::repair_task_operation&, fmt::format_context& ctx) const -> decltype(ctx.out());
};

View File

@@ -71,7 +71,7 @@ Use "Bash on Ubuntu on Windows" for the same tools and capabilities as on Linux
### Building the Docs
1. Run `make preview` to build the documentation.
1. Run `make preview` in the `docs/` directory to build the documentation.
1. Preview the built documentation locally at http://127.0.0.1:5500/.
### Cleanup

View File

@@ -41,6 +41,8 @@ class MetricsProcessor:
# Get metrics from the file
try:
metrics_file = metrics.get_metrics_from_file(relative_path, "scylla_", metrics_info, strict=strict)
except SystemExit:
pass
finally:
os.chdir(old_cwd)
if metrics_file:

View File

@@ -29,9 +29,6 @@ A CDC generation consists of:
This is the mapping used to decide on which stream IDs to use when making writes, as explained in the :doc:`./cdc-streams` document. It is a global property of the cluster: it doesn't depend on the table you're making writes to.
.. caution::
The tables mentioned in the following sections: ``system_distributed.cdc_generation_timestamps`` and ``system_distributed.cdc_streams_descriptions_v2`` have been introduced in ScyllaDB 4.4. It is highly recommended to upgrade to 4.4 for efficient CDC usage. The last section explains how to run the below examples in ScyllaDB 4.3.
When CDC generations change
---------------------------

View File

@@ -28,7 +28,8 @@ Incremental Repair is only supported for tables that use the tablets architectur
Incremental Repair Modes
------------------------
While incremental repair is the default and recommended mode, you can control its behavior for a given repair operation using the ``incremental_mode`` parameter. This is useful for situations where you might need to force a full data validation.
Incremental is currently disabled by default. You can control its behavior for a given repair operation using the ``incremental_mode`` parameter.
This is useful for enabling incremental repair, or in situations where you might need to force a full data validation.
The available modes are:

View File

@@ -53,13 +53,13 @@ ScyllaDB nodetool cluster repair command supports the following options:
nodetool cluster repair --tablet-tokens 1,10474535988
- ``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental.
- ``--incremental-mode`` specifies the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled'.
For example:
::
nodetool cluster repair --incremental-mode regular
nodetool cluster repair --incremental-mode disabled
- ``keyspace`` executes a repair on a specific keyspace. The default is all keyspaces.

View File

@@ -17,7 +17,7 @@ SYNOPSIS
[(-u <username> | --username <username>)] snapshot
[(-cf <table> | --column-family <table> | --table <table>)]
[(-kc <kclist> | --kc.list <kclist>)]
[(-sf | --skip-flush)] [--use-sstable-identifier] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]
[(-sf | --skip-flush)] [(-t <tag> | --tag <tag>)] [--] [<keyspaces...>]
OPTIONS
.......
@@ -37,8 +37,6 @@ Parameter Descriptio
-------------------------------------------------------------------- -------------------------------------------------------------------------------------
-sf / --skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data)
-------------------------------------------------------------------- -------------------------------------------------------------------------------------
--use-sstable-identifier Use the sstable identifier UUID, if available, rather than the sstable generation.
-------------------------------------------------------------------- -------------------------------------------------------------------------------------
-t <tag> / --tag <tag> The name of the snapshot
==================================================================== =====================================================================================

View File

@@ -64,13 +64,12 @@ ADMIN Logs service level operations: create, alter, drop, attach, detach, l
auditing.
========= =========================================================================================
Note that audit for every DML or QUERY might impact performance and consume a lot of storage.
Note that enabling audit may negatively impact performance and audit-to-table may consume extra storage. That's especially true when auditing DML and QUERY categories, which generate a high volume of audit messages.
Configuring Audit Storage
---------------------------
Auditing messages can be sent to :ref:`Syslog <auditing-syslog-storage>` or stored in a Scylla :ref:`table <auditing-table-storage>`.
Currently, auditing messages can only be saved to one location at a time. You cannot log into both a table and the Syslog.
Auditing messages can be sent to :ref:`Syslog <auditing-syslog-storage>` or stored in a Scylla :ref:`table <auditing-table-storage>` or both.
.. _auditing-syslog-storage:
@@ -193,6 +192,23 @@ For example:
2018-03-18 00:00:00+0000 | 10.143.2.108 | 3429b1a5-2a94-11e8-8f4e-000000000001 | DDL | ONE | False | nba | DROP TABLE nba.team_roster ; | 127.0.0.1 | team_roster | Scylla |
(1 row)
.. _auditing-table-and-syslog-storage:
Storing Audit Messages in a Table and Syslog Simultaneously
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
**Procedure**
#. Follow both procedures from above, and set the ``audit`` parameter in the ``scylla.yaml`` file to both ``syslog`` and ``table``. You need to restart scylla only once.
To have both syslog and table you need to specify both backends separated by a comma:
.. code-block:: shell
audit: "syslog,table"
Handling Audit Failures
---------------------------

View File

@@ -143,7 +143,6 @@ public:
gms::feature tablet_incremental_repair { *this, "TABLET_INCREMENTAL_REPAIR"sv };
gms::feature tablet_repair_scheduler { *this, "TABLET_REPAIR_SCHEDULER"sv };
gms::feature tablet_repair_tasks_table { *this, "TABLET_REPAIR_TASKS_TABLE"sv };
gms::feature tablet_merge { *this, "TABLET_MERGE"sv };
gms::feature tablet_rack_aware_view_pairing { *this, "TABLET_RACK_AWARE_VIEW_PAIRING"sv };

View File

@@ -15,3 +15,22 @@ with the Apache License (version 2) and ScyllaDB-Source-Available-1.0.
They contain the following tag:
SPDX-License-Identifier: (LicenseRef-ScyllaDB-Source-Available-1.0 and Apache-2.0)
### `musl libc` files
`licenses/musl-license.txt` is obtained from:
https://git.musl-libc.org/cgit/musl/tree/COPYRIGHT
`utils/crypt_sha512.cc` is obtained from:
https://git.musl-libc.org/cgit/musl/tree/src/crypt/crypt_sha512.c
Both files are obtained from git.musl-libc.org.
Import commit:
commit 1b76ff0767d01df72f692806ee5adee13c67ef88
Author: Alex Rønne Petersen <alex@alexrp.com>
Date: Sun Oct 12 05:35:19 2025 +0200
s390x: shuffle register usage in __tls_get_offset to avoid r0 as address
musl as a whole is licensed under the standard MIT license included in
`licenses/musl-license.txt`.

193
licenses/musl-license.txt Normal file
View File

@@ -0,0 +1,193 @@
musl as a whole is licensed under the following standard MIT license:
----------------------------------------------------------------------
Copyright © 2005-2020 Rich Felker, et al.
Permission is hereby granted, free of charge, to any person obtaining
a copy of this software and associated documentation files (the
"Software"), to deal in the Software without restriction, including
without limitation the rights to use, copy, modify, merge, publish,
distribute, sublicense, and/or sell copies of the Software, and to
permit persons to whom the Software is furnished to do so, subject to
the following conditions:
The above copyright notice and this permission notice shall be
included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND,
EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF
MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.
IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY
CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT,
TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE
SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
----------------------------------------------------------------------
Authors/contributors include:
A. Wilcox
Ada Worcester
Alex Dowad
Alex Suykov
Alexander Monakov
Andre McCurdy
Andrew Kelley
Anthony G. Basile
Aric Belsito
Arvid Picciani
Bartosz Brachaczek
Benjamin Peterson
Bobby Bingham
Boris Brezillon
Brent Cook
Chris Spiegel
Clément Vasseur
Daniel Micay
Daniel Sabogal
Daurnimator
David Carlier
David Edelsohn
Denys Vlasenko
Dmitry Ivanov
Dmitry V. Levin
Drew DeVault
Emil Renner Berthing
Fangrui Song
Felix Fietkau
Felix Janda
Gianluca Anzolin
Hauke Mehrtens
He X
Hiltjo Posthuma
Isaac Dunham
Jaydeep Patil
Jens Gustedt
Jeremy Huntwork
Jo-Philipp Wich
Joakim Sindholt
John Spencer
Julien Ramseier
Justin Cormack
Kaarle Ritvanen
Khem Raj
Kylie McClain
Leah Neukirchen
Luca Barbato
Luka Perkov
Lynn Ochs
M Farkas-Dyck (Strake)
Mahesh Bodapati
Markus Wichmann
Masanori Ogino
Michael Clark
Michael Forney
Mikhail Kremnyov
Natanael Copa
Nicholas J. Kain
orc
Pascal Cuoq
Patrick Oppenlander
Petr Hosek
Petr Skocik
Pierre Carrier
Reini Urban
Rich Felker
Richard Pennington
Ryan Fairfax
Samuel Holland
Segev Finer
Shiz
sin
Solar Designer
Stefan Kristiansson
Stefan O'Rear
Szabolcs Nagy
Timo Teräs
Trutz Behn
Will Dietz
William Haddon
William Pitcock
Portions of this software are derived from third-party works licensed
under terms compatible with the above MIT license:
The TRE regular expression implementation (src/regex/reg* and
src/regex/tre*) is Copyright © 2001-2008 Ville Laurikari and licensed
under a 2-clause BSD license (license text in the source files). The
included version has been heavily modified by Rich Felker in 2012, in
the interests of size, simplicity, and namespace cleanliness.
Much of the math library code (src/math/* and src/complex/*) is
Copyright © 1993,2004 Sun Microsystems or
Copyright © 2003-2011 David Schultz or
Copyright © 2003-2009 Steven G. Kargl or
Copyright © 2003-2009 Bruce D. Evans or
Copyright © 2008 Stephen L. Moshier or
Copyright © 2017-2018 Arm Limited
and labelled as such in comments in the individual source files. All
have been licensed under extremely permissive terms.
The ARM memcpy code (src/string/arm/memcpy.S) is Copyright © 2008
The Android Open Source Project and is licensed under a two-clause BSD
license. It was taken from Bionic libc, used on Android.
The AArch64 memcpy and memset code (src/string/aarch64/*) are
Copyright © 1999-2019, Arm Limited.
The implementation of DES for crypt (src/crypt/crypt_des.c) is
Copyright © 1994 David Burren. It is licensed under a BSD license.
The implementation of blowfish crypt (src/crypt/crypt_blowfish.c) was
originally written by Solar Designer and placed into the public
domain. The code also comes with a fallback permissive license for use
in jurisdictions that may not recognize the public domain.
The smoothsort implementation (src/stdlib/qsort.c) is Copyright © 2011
Lynn Ochs and is licensed under an MIT-style license.
The x86_64 port was written by Nicholas J. Kain and is licensed under
the standard MIT terms.
The mips and microblaze ports were originally written by Richard
Pennington for use in the ellcc project. The original code was adapted
by Rich Felker for build system and code conventions during upstream
integration. It is licensed under the standard MIT terms.
The mips64 port was contributed by Imagination Technologies and is
licensed under the standard MIT terms.
The powerpc port was also originally written by Richard Pennington,
and later supplemented and integrated by John Spencer. It is licensed
under the standard MIT terms.
All other files which have no copyright comments are original works
produced specifically for use as part of this library, written either
by Rich Felker, the main author of the library, or by one or more
contibutors listed above. Details on authorship of individual files
can be found in the git version control history of the project. The
omission of copyright and license comments in each file is in the
interest of source tree size.
In addition, permission is hereby granted for all public header files
(include/* and arch/*/bits/*) and crt files intended to be linked into
applications (crt/*, ldso/dlstart.c, and arch/*/crt_arch.h) to omit
the copyright notice and permission notice otherwise required by the
license, and to use these files without any requirement of
attribution. These files include substantial contributions from:
Bobby Bingham
John Spencer
Nicholas J. Kain
Rich Felker
Richard Pennington
Stefan Kristiansson
Szabolcs Nagy
all of whom have explicitly granted such permission.
This file previously contained text expressing a belief that most of
the files covered by the above exception were sufficiently trivial not
to be subject to copyright, resulting in confusion over whether it
negated the permissions granted in the license. In the spirit of
permissive licensing, and of not having licensing issues being an
obstacle to adoption, that text has been removed.

View File

@@ -200,7 +200,10 @@ enum class tablet_repair_incremental_mode : uint8_t {
disabled,
};
constexpr tablet_repair_incremental_mode default_tablet_repair_incremental_mode{tablet_repair_incremental_mode::incremental};
// FIXME: Incremental repair is disabled by default due to
// https://github.com/scylladb/scylladb/issues/26041 and
// https://github.com/scylladb/scylladb/issues/27414
constexpr tablet_repair_incremental_mode default_tablet_repair_incremental_mode{tablet_repair_incremental_mode::disabled};
sstring tablet_repair_incremental_mode_to_string(tablet_repair_incremental_mode);
tablet_repair_incremental_mode tablet_repair_incremental_mode_from_string(const sstring&);

View File

@@ -235,9 +235,6 @@ public:
const topology& get_topology() const;
void debug_show() const;
/** Return the unique host ID for an end-point. */
host_id get_host_id(inet_address endpoint) const;
/** @return a copy of the endpoint-to-id map for read-only operations */
std::unordered_set<host_id> get_host_ids() const;

11
main.cc
View File

@@ -748,8 +748,6 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
// inherit Seastar's CPU affinity masks. We want this thread to be free
// to migrate between CPUs; we think that's what makes the most sense.
auto rpc_dict_training_worker = utils::alien_worker(startlog, 19, "rpc-dict");
// niceness=10 is ~10% of normal process time
auto hashing_worker = utils::alien_worker(startlog, 10, "pwd-hash");
return app.run(ac, av, [&] () -> future<int> {
@@ -779,8 +777,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
return seastar::async([&app, cfg, ext, &disk_space_monitor_shard0, &cm, &sstm, &db, &qp, &bm, &proxy, &mapreduce_service, &mm, &mm_notifier, &ctx, &opts, &dirs,
&prometheus_server, &cf_cache_hitrate_calculator, &load_meter, &feature_service, &gossiper, &snitch,
&token_metadata, &erm_factory, &snapshot_ctl, &messaging, &sst_dir_semaphore, &raft_gr, &service_memory_limiter,
&repair, &sst_loader, &auth_cache, &ss, &lifecycle_notifier, &stream_manager, &task_manager, &rpc_dict_training_worker,
&hashing_worker, &vector_store_client] {
&repair, &sst_loader, &auth_cache, &ss, &lifecycle_notifier, &stream_manager, &task_manager, &rpc_dict_training_worker, &vector_store_client] {
try {
if (opts.contains("relabel-config-file") && !opts["relabel-config-file"].as<sstring>().empty()) {
// calling update_relabel_config_from_file can cause an exception that would stop startup
@@ -2060,7 +2057,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
maintenance_auth_config.authenticator_java_name = sstring{auth::allow_all_authenticator_name};
maintenance_auth_config.role_manager_java_name = sstring{auth::maintenance_socket_role_manager_name};
maintenance_auth_service.start(perm_cache_config, std::ref(qp), std::ref(group0_client), std::ref(mm_notifier), std::ref(mm), maintenance_auth_config, maintenance_socket_enabled::yes, std::ref(auth_cache), std::ref(hashing_worker)).get();
maintenance_auth_service.start(perm_cache_config, std::ref(qp), std::ref(group0_client), std::ref(mm_notifier), std::ref(mm), maintenance_auth_config, maintenance_socket_enabled::yes, std::ref(auth_cache)).get();
cql_maintenance_server_ctl.emplace(maintenance_auth_service, mm_notifier, gossiper, qp, service_memory_limiter, sl_controller, lifecycle_notifier, *cfg, maintenance_cql_sg_stats_key, maintenance_socket_enabled::yes, dbcfg.statement_scheduling_group);
@@ -2336,7 +2333,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
auth_config.authenticator_java_name = qualified_authenticator_name;
auth_config.role_manager_java_name = qualified_role_manager_name;
auth_service.start(std::move(perm_cache_config), std::ref(qp), std::ref(group0_client), std::ref(mm_notifier), std::ref(mm), auth_config, maintenance_socket_enabled::no, std::ref(auth_cache), std::ref(hashing_worker)).get();
auth_service.start(std::move(perm_cache_config), std::ref(qp), std::ref(group0_client), std::ref(mm_notifier), std::ref(mm), auth_config, maintenance_socket_enabled::no, std::ref(auth_cache)).get();
std::any stop_auth_service;
// Has to be called after node joined the cluster (join_cluster())
@@ -2380,7 +2377,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
checkpoint(stop_signal, "starting batchlog manager");
db::batchlog_manager_config bm_cfg;
bm_cfg.write_request_timeout = cfg->write_request_timeout_in_ms() * 1ms;
bm_cfg.replay_timeout = cfg->write_request_timeout_in_ms() * 1ms * 2;
bm_cfg.replay_rate = cfg->batchlog_replay_throttle_in_kb() * 1000;
bm_cfg.delay = std::chrono::milliseconds(cfg->ring_delay_ms());
bm_cfg.replay_cleanup_after_replays = cfg->batchlog_replay_cleanup_after_replays();

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:80a47fe93866989aaf7e949168fcd308e95841e78c976a61f9eac20bfdd34d96
size 6448960
oid sha256:3cbe2dd05945f8fb76ebce2ea70864063d2b282c4d5080af1f290ead43321ab3
size 6444732

View File

@@ -551,9 +551,13 @@ void repair_writer_impl::create_writer(lw_shared_ptr<repair_writer> w) {
}
replica::table& t = _db.local().find_column_family(_schema->id());
rlogger.debug("repair_writer: keyspace={}, table={}, estimated_partitions={}", w->schema()->ks_name(), w->schema()->cf_name(), w->get_estimated_partitions());
// #17384 don't use off-strategy for repair (etc) if using tablets. sstables generated will
// be single token range and can just be added to normal sstable set as is, eventually
// handled by normal compaction.
auto off_str = t.uses_tablets() ? sstables::offstrategy(false) : is_offstrategy_supported(_reason);
auto sharder = get_sharder_helper(t, *(w->schema()), _topo_guard);
_writer_done = mutation_writer::distribute_reader_and_consume_on_shards(_schema, sharder.sharder, std::move(_queue_reader),
streaming::make_streaming_consumer(sstables::repair_origin, _db, _view_builder, _view_building_worker, w->get_estimated_partitions(), _reason, is_offstrategy_supported(_reason),
streaming::make_streaming_consumer(sstables::repair_origin, _db, _view_builder, _view_building_worker, w->get_estimated_partitions(), _reason, off_str,
_topo_guard, _repaired_at, w->get_sstable_list_to_mark_as_repaired()),
t.stream_in_progress()).then([w] (uint64_t partitions) {
rlogger.debug("repair_writer: keyspace={}, table={}, managed to write partitions={} to sstable",
@@ -3844,83 +3848,3 @@ future<uint32_t> repair_service::get_next_repair_meta_id() {
locator::host_id repair_service::my_host_id() const noexcept {
return _gossiper.local().my_host_id();
}
future<size_t> count_finished_tablets(utils::chunked_vector<tablet_token_range> ranges1, utils::chunked_vector<tablet_token_range> ranges2) {
if (ranges1.empty() || ranges2.empty()) {
co_return 0;
}
auto sort = [] (utils::chunked_vector<tablet_token_range>& ranges) {
std::sort(ranges.begin(), ranges.end(), [] (const auto& a, const auto& b) {
if (a.first_token != b.first_token) {
return a.first_token < b.first_token;
}
return a.last_token < b.last_token;
});
};
// First, merge overlapping and adjacent ranges in ranges2.
sort(ranges2);
utils::chunked_vector<tablet_token_range> merged;
merged.push_back(ranges2[0]);
for (size_t i = 1; i < ranges2.size(); ++i) {
co_await coroutine::maybe_yield();
// To avoid overflow with max() + 1, we check adjacency with `a - 1 <= b` instead of `a <= b + 1`
if (ranges2[i].first_token - 1 <= merged.back().last_token) {
merged.back().last_token = std::max(merged.back().last_token, ranges2[i].last_token);
} else {
merged.push_back(ranges2[i]);
}
}
// Count covered ranges using a linear scan
size_t covered_count = 0;
auto it = merged.begin();
auto end = merged.end();
sort(ranges1);
for (const auto& r1 : ranges1) {
co_await coroutine::maybe_yield();
// Advance the merged iterator only if the current merged range ends
// before the current r1 starts.
while (it != end && it->last_token < r1.first_token) {
co_await coroutine::maybe_yield();
++it;
}
// If we have exhausted the merged ranges, no further r1 can be covered
if (it == end) {
break;
}
// Check if the current merged range covers r1.
if (it->first_token <= r1.first_token && r1.last_token <= it->last_token) {
covered_count++;
}
}
co_return covered_count;
}
future<std::optional<repair_task_progress>> repair_service::get_tablet_repair_task_progress(tasks::task_id task_uuid) {
utils::chunked_vector<tablet_token_range> requested_tablets;
utils::chunked_vector<tablet_token_range> finished_tablets;
table_id tid;
if (!_db.local().features().tablet_repair_tasks_table) {
co_return std::nullopt;
}
co_await _sys_ks.local().get_repair_task(task_uuid, [&tid, &requested_tablets, &finished_tablets] (const db::system_keyspace::repair_task_entry& entry) -> future<> {
rlogger.debug("repair_task_progress: Get entry operation={} first_token={} last_token={}", entry.operation, entry.first_token, entry.last_token);
if (entry.operation == db::system_keyspace::repair_task_operation::requested) {
requested_tablets.push_back({entry.first_token, entry.last_token});
} else if (entry.operation == db::system_keyspace::repair_task_operation::finished) {
finished_tablets.push_back({entry.first_token, entry.last_token});
}
tid = entry.table_uuid;
co_return;
});
auto requested = requested_tablets.size();
auto finished_nomerge = finished_tablets.size();
auto finished = co_await count_finished_tablets(std::move(requested_tablets), std::move(finished_tablets));
auto progress = repair_task_progress{requested, finished, tid};
rlogger.debug("repair_task_progress: task_uuid={} table_uuid={} requested_tablets={} finished_tablets={} progress={} finished_nomerge={}",
task_uuid, tid, requested, finished, progress.progress(), finished_nomerge);
co_return progress;
}

View File

@@ -99,15 +99,6 @@ public:
using host2ip_t = std::function<future<gms::inet_address> (locator::host_id)>;
struct repair_task_progress {
size_t requested;
size_t finished;
table_id table_uuid;
float progress() const {
return requested == 0 ? 1.0 : float(finished) / requested;
}
};
class repair_service : public seastar::peering_sharded_service<repair_service> {
sharded<service::topology_state_machine>& _tsm;
sharded<gms::gossiper>& _gossiper;
@@ -231,9 +222,6 @@ private:
public:
future<gc_clock::time_point> repair_tablet(gms::gossip_address_map& addr_map, locator::tablet_metadata_guard& guard, locator::global_tablet_id gid, tasks::task_info global_tablet_repair_task_info, service::frozen_topology_guard topo_guard, std::optional<locator::tablet_replica_set> rebuild_replicas, locator::tablet_transition_stage stage);
future<std::optional<repair_task_progress>> get_tablet_repair_task_progress(tasks::task_id task_uuid);
private:
future<repair_update_system_table_response> repair_update_system_table_handler(
@@ -338,12 +326,3 @@ future<std::list<repair_row>> to_repair_rows_list(repair_rows_on_wire rows,
schema_ptr s, uint64_t seed, repair_master is_master,
reader_permit permit, repair_hasher hasher);
void flush_rows(schema_ptr s, std::list<repair_row>& rows, lw_shared_ptr<repair_writer>& writer, std::optional<small_table_optimization_params> small_table_optimization = std::nullopt, repair_meta* rm = nullptr);
// A struct to hold the first and last token of a tablet.
struct tablet_token_range {
int64_t first_token;
int64_t last_token;
};
// Function to count the number of ranges in ranges1 covered by the merged ranges of ranges2.
future<size_t> count_finished_tablets(utils::chunked_vector<tablet_token_range> ranges1, utils::chunked_vector<tablet_token_range> ranges2);

View File

@@ -297,17 +297,17 @@ public:
const dht::token_range& token_range() const noexcept;
size_t memtable_count() const noexcept;
size_t memtable_count() const;
const compaction_group_ptr& main_compaction_group() const noexcept;
const std::vector<compaction_group_ptr>& split_ready_compaction_groups() const;
compaction_group_ptr& select_compaction_group(locator::tablet_range_side) noexcept;
uint64_t live_disk_space_used() const noexcept;
uint64_t live_disk_space_used() const;
void for_each_compaction_group(std::function<void(const compaction_group_ptr&)> action) const noexcept;
utils::small_vector<compaction_group_ptr, 3> compaction_groups() noexcept;
utils::small_vector<const_compaction_group_ptr, 3> compaction_groups() const noexcept;
void for_each_compaction_group(std::function<void(const compaction_group_ptr&)> action) const;
utils::small_vector<compaction_group_ptr, 3> compaction_groups();
utils::small_vector<const_compaction_group_ptr, 3> compaction_groups() const;
utils::small_vector<compaction_group_ptr, 3> split_unready_groups() const;
bool split_unready_groups_are_empty() const;
@@ -430,7 +430,7 @@ public:
virtual storage_group& storage_group_for_token(dht::token) const = 0;
virtual utils::chunked_vector<storage_group_ptr> storage_groups_for_token_range(dht::token_range tr) const = 0;
virtual locator::combined_load_stats table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)> tablet_filter) const noexcept = 0;
virtual locator::combined_load_stats table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)> tablet_filter) const = 0;
virtual bool all_storage_groups_split() = 0;
virtual future<> split_all_storage_groups(tasks::task_info tablet_split_task_info) = 0;
virtual future<> maybe_split_compaction_group_of(size_t idx) = 0;

View File

@@ -96,6 +96,9 @@ public:
virtual const secondary_index::secondary_index_manager& get_index_manager(data_dictionary::table t) const override {
return const_cast<replica::table&>(unwrap(t)).get_index_manager();
}
virtual db_clock::time_point get_truncation_time(data_dictionary::table t) const override {
return const_cast<replica::table&>(unwrap(t)).get_truncation_time();
}
virtual lw_shared_ptr<keyspace_metadata> get_keyspace_metadata(data_dictionary::keyspace ks) const override {
return unwrap(ks).metadata();
}

View File

@@ -2810,26 +2810,26 @@ future<> database::drop_cache_for_keyspace_on_all_shards(sharded<database>& shar
});
}
future<> database::snapshot_table_on_all_shards(sharded<database>& sharded_db, table_id uuid, sstring tag, db::snapshot_options opts) {
if (!opts.skip_flush) {
future<> database::snapshot_table_on_all_shards(sharded<database>& sharded_db, table_id uuid, sstring tag, bool skip_flush) {
if (!skip_flush) {
co_await flush_table_on_all_shards(sharded_db, uuid);
}
auto table_shards = co_await get_table_on_all_shards(sharded_db, uuid);
co_await table::snapshot_on_all_shards(sharded_db, table_shards, tag, opts);
co_await table::snapshot_on_all_shards(sharded_db, table_shards, tag);
}
future<> database::snapshot_tables_on_all_shards(sharded<database>& sharded_db, std::string_view ks_name, std::vector<sstring> table_names, sstring tag, db::snapshot_options opts) {
return parallel_for_each(table_names, [&sharded_db, ks_name, tag = std::move(tag), opts] (auto& table_name) {
future<> database::snapshot_tables_on_all_shards(sharded<database>& sharded_db, std::string_view ks_name, std::vector<sstring> table_names, sstring tag, bool skip_flush) {
return parallel_for_each(table_names, [&sharded_db, ks_name, tag = std::move(tag), skip_flush] (auto& table_name) {
auto uuid = sharded_db.local().find_uuid(ks_name, table_name);
return snapshot_table_on_all_shards(sharded_db, uuid, tag, opts);
return snapshot_table_on_all_shards(sharded_db, uuid, tag, skip_flush);
});
}
future<> database::snapshot_keyspace_on_all_shards(sharded<database>& sharded_db, std::string_view ks_name, sstring tag, db::snapshot_options opts) {
future<> database::snapshot_keyspace_on_all_shards(sharded<database>& sharded_db, std::string_view ks_name, sstring tag, bool skip_flush) {
auto& ks = sharded_db.local().find_keyspace(ks_name);
co_await coroutine::parallel_for_each(ks.metadata()->cf_meta_data(), [&, tag = std::move(tag), opts] (const auto& pair) -> future<> {
co_await coroutine::parallel_for_each(ks.metadata()->cf_meta_data(), [&, tag = std::move(tag), skip_flush] (const auto& pair) -> future<> {
auto uuid = pair.second->id();
co_await snapshot_table_on_all_shards(sharded_db, uuid, tag, opts);
co_await snapshot_table_on_all_shards(sharded_db, uuid, tag, skip_flush);
});
}
@@ -2951,12 +2951,7 @@ future<> database::truncate_table_on_all_shards(sharded<database>& sharded_db, s
auto truncated_at = truncated_at_opt.value_or(db_clock::now());
auto name = snapshot_name_opt.value_or(
format("{:d}-{}", truncated_at.time_since_epoch().count(), cf.schema()->cf_name()));
// Use the sstable identifier in snapshot names to allow de-duplication of sstables
// at backup time even if they were migrated across shards or nodes and were renamed a given a new generation.
// We hard-code that here since we have no way to pass this option to auto-snapshot and
// it is always safe to use the sstable identifier for the sstable generation.
auto opts = db::snapshot_options{.use_sstable_identifier = true};
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name, opts);
co_await table::snapshot_on_all_shards(sharded_db, table_shards, name);
}
co_await sharded_db.invoke_on_all([&] (database& db) {

View File

@@ -1040,12 +1040,12 @@ public:
private:
using snapshot_file_set = foreign_ptr<std::unique_ptr<std::unordered_set<sstring>>>;
future<snapshot_file_set> take_snapshot(sstring jsondir, db::snapshot_options opts);
future<snapshot_file_set> take_snapshot(sstring jsondir);
// Writes the table schema and the manifest of all files in the snapshot directory.
future<> finalize_snapshot(const global_table_ptr& table_shards, sstring jsondir, std::vector<snapshot_file_set> file_sets);
static future<> seal_snapshot(sstring jsondir, std::vector<snapshot_file_set> file_sets);
public:
static future<> snapshot_on_all_shards(sharded<database>& sharded_db, const global_table_ptr& table_shards, sstring name, db::snapshot_options opts);
static future<> snapshot_on_all_shards(sharded<database>& sharded_db, const global_table_ptr& table_shards, sstring name);
future<std::unordered_map<sstring, snapshot_details>> get_snapshot_details();
static future<snapshot_details> get_snapshot_details(std::filesystem::path snapshot_dir, std::filesystem::path datadir);
@@ -1133,7 +1133,7 @@ public:
// The tablet filter is used to not double account migrating tablets, so it's important that
// only one of pending or leaving replica is accounted based on current migration stage.
locator::combined_load_stats table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)> tablet_filter) const noexcept;
locator::combined_load_stats table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)> tablet_filter) const;
const db::view::stats& get_view_stats() const {
return _view_stats;
@@ -2009,9 +2009,9 @@ public:
static future<> drop_cache_for_table_on_all_shards(sharded<database>& sharded_db, table_id id);
static future<> drop_cache_for_keyspace_on_all_shards(sharded<database>& sharded_db, std::string_view ks_name);
static future<> snapshot_table_on_all_shards(sharded<database>& sharded_db, table_id id, sstring tag, db::snapshot_options opts);
static future<> snapshot_tables_on_all_shards(sharded<database>& sharded_db, std::string_view ks_name, std::vector<sstring> table_names, sstring tag, db::snapshot_options opts);
static future<> snapshot_keyspace_on_all_shards(sharded<database>& sharded_db, std::string_view ks_name, sstring tag, db::snapshot_options opts);
static future<> snapshot_table_on_all_shards(sharded<database>& sharded_db, table_id id, sstring tag, bool skip_flush);
static future<> snapshot_tables_on_all_shards(sharded<database>& sharded_db, std::string_view ks_name, std::vector<sstring> table_names, sstring tag, bool skip_flush);
static future<> snapshot_keyspace_on_all_shards(sharded<database>& sharded_db, std::string_view ks_name, sstring tag, bool skip_flush);
public:
bool update_column_family(schema_ptr s);

View File

@@ -215,7 +215,7 @@ private:
output_ck_raw_values.emplace_back(bytes{});
}
}
if (underlying_ck_raw_values.empty()) {
if (pos.region() != partition_region::clustered) {
output_ck_raw_values.push_back(bytes{});
} else {
output_ck_raw_values.push_back(data_value(static_cast<int8_t>(pos.get_bound_weight())).serialize_nonnull());

View File

@@ -16,6 +16,7 @@
#include <seastar/coroutine/as_future.hh>
#include <seastar/util/closeable.hh>
#include <seastar/util/defer.hh>
#include <seastar/json/json_elements.hh>
#include "dht/decorated_key.hh"
#include "replica/database.hh"
@@ -708,7 +709,7 @@ public:
return *_single_sg;
}
locator::combined_load_stats table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)>) const noexcept override {
locator::combined_load_stats table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)>) const override {
return locator::combined_load_stats{
.table_ls = locator::table_load_stats{
.size_in_bytes = _single_sg->live_disk_space_used(),
@@ -874,7 +875,7 @@ public:
return storage_group_for_id(storage_group_of(token).first);
}
locator::combined_load_stats table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)> tablet_filter) const noexcept override;
locator::combined_load_stats table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)> tablet_filter) const override;
bool all_storage_groups_split() override;
future<> split_all_storage_groups(tasks::task_info tablet_split_task_info) override;
future<> maybe_split_compaction_group_of(size_t idx) override;
@@ -922,7 +923,7 @@ compaction_group_ptr& storage_group::select_compaction_group(locator::tablet_ran
return _main_cg;
}
void storage_group::for_each_compaction_group(std::function<void(const compaction_group_ptr&)> action) const noexcept {
void storage_group::for_each_compaction_group(std::function<void(const compaction_group_ptr&)> action) const {
action(_main_cg);
for (auto& cg : _merging_groups) {
action(cg);
@@ -932,7 +933,7 @@ void storage_group::for_each_compaction_group(std::function<void(const compactio
}
}
utils::small_vector<compaction_group_ptr, 3> storage_group::compaction_groups() noexcept {
utils::small_vector<compaction_group_ptr, 3> storage_group::compaction_groups() {
utils::small_vector<compaction_group_ptr, 3> cgs;
for_each_compaction_group([&cgs] (const compaction_group_ptr& cg) {
cgs.push_back(cg);
@@ -940,7 +941,7 @@ utils::small_vector<compaction_group_ptr, 3> storage_group::compaction_groups()
return cgs;
}
utils::small_vector<const_compaction_group_ptr, 3> storage_group::compaction_groups() const noexcept {
utils::small_vector<const_compaction_group_ptr, 3> storage_group::compaction_groups() const {
utils::small_vector<const_compaction_group_ptr, 3> cgs;
for_each_compaction_group([&cgs] (const compaction_group_ptr& cg) {
cgs.push_back(cg);
@@ -1890,7 +1891,7 @@ sstables::file_size_stats compaction_group::live_disk_space_used_full_stats() co
return _main_sstables->get_file_size_stats() + _maintenance_sstables->get_file_size_stats();
}
uint64_t storage_group::live_disk_space_used() const noexcept {
uint64_t storage_group::live_disk_space_used() const {
auto cgs = const_cast<storage_group&>(*this).compaction_groups();
return std::ranges::fold_left(cgs | std::views::transform(std::mem_fn(&compaction_group::live_disk_space_used)), uint64_t(0), std::plus{});
}
@@ -2813,7 +2814,7 @@ void table::on_flush_timer() {
});
}
locator::combined_load_stats tablet_storage_group_manager::table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)> tablet_filter) const noexcept {
locator::combined_load_stats tablet_storage_group_manager::table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)> tablet_filter) const {
locator::table_load_stats table_stats;
table_stats.split_ready_seq_number = _split_ready_seq_number;
@@ -2836,7 +2837,7 @@ locator::combined_load_stats tablet_storage_group_manager::table_load_stats(std:
};
}
locator::combined_load_stats table::table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)> tablet_filter) const noexcept {
locator::combined_load_stats table::table_load_stats(std::function<bool(const locator::tablet_map&, locator::global_tablet_id)> tablet_filter) const {
return _sg_manager->table_load_stats(std::move(tablet_filter));
}
@@ -3198,23 +3199,35 @@ db::replay_position table::highest_flushed_replay_position() const {
return _highest_flushed_rp;
}
struct manifest_json : public json::json_base {
json::json_chunked_list<sstring> files;
manifest_json() {
register_params();
}
manifest_json(manifest_json&& e) {
register_params();
files = std::move(e.files);
}
manifest_json& operator=(manifest_json&& e) {
files = std::move(e.files);
return *this;
}
private:
void register_params() {
add(&files, "files");
}
};
future<>
table::seal_snapshot(sstring jsondir, std::vector<snapshot_file_set> file_sets) {
std::ostringstream ss;
int n = 0;
ss << "{" << std::endl << "\t\"files\" : [ ";
manifest_json manifest;
for (const auto& fsp : file_sets) {
for (const auto& rf : *fsp) {
if (n++ > 0) {
ss << ", ";
for (auto& rf : *fsp) {
manifest.files.push(std::move(rf));
}
ss << "\"" << rf << "\"";
co_await coroutine::maybe_yield();
}
}
ss << " ]" << std::endl << "}" << std::endl;
auto json = ss.str();
auto streamer = json::stream_object(std::move(manifest));
auto jsonfile = jsondir + "/manifest.json";
tlogger.debug("Storing manifest {}", jsonfile);
@@ -3224,12 +3237,10 @@ table::seal_snapshot(sstring jsondir, std::vector<snapshot_file_set> file_sets)
auto out = co_await make_file_output_stream(std::move(f));
std::exception_ptr ex;
try {
co_await out.write(json.c_str(), json.size());
co_await out.flush();
co_await streamer(std::move(out));
} catch (...) {
ex = std::current_exception();
}
co_await out.close();
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
@@ -3268,7 +3279,7 @@ future<> table::write_schema_as_cql(const global_table_ptr& table_shards, sstrin
}
// Runs the orchestration code on an arbitrary shard to balance the load.
future<> table::snapshot_on_all_shards(sharded<database>& sharded_db, const global_table_ptr& table_shards, sstring name, db::snapshot_options opts) {
future<> table::snapshot_on_all_shards(sharded<database>& sharded_db, const global_table_ptr& table_shards, sstring name) {
auto* so = std::get_if<storage_options::local>(&table_shards->get_storage_options().value);
if (so == nullptr) {
throw std::runtime_error("Snapshotting non-local tables is not implemented");
@@ -3291,7 +3302,7 @@ future<> table::snapshot_on_all_shards(sharded<database>& sharded_db, const glob
co_await io_check([&jsondir] { return recursive_touch_directory(jsondir); });
co_await coroutine::parallel_for_each(smp::all_cpus(), [&] (unsigned shard) -> future<> {
file_sets.emplace_back(co_await smp::submit_to(shard, [&] {
return table_shards->take_snapshot(jsondir, opts);
return table_shards->take_snapshot(jsondir);
}));
});
co_await io_check(sync_directory, jsondir);
@@ -3300,22 +3311,19 @@ future<> table::snapshot_on_all_shards(sharded<database>& sharded_db, const glob
});
}
future<table::snapshot_file_set> table::take_snapshot(sstring jsondir, db::snapshot_options opts) {
tlogger.trace("take_snapshot {}: use_sstable_identifier={}", jsondir, opts.use_sstable_identifier);
future<table::snapshot_file_set> table::take_snapshot(sstring jsondir) {
tlogger.trace("take_snapshot {}", jsondir);
auto sstable_deletion_guard = co_await get_sstable_list_permit();
auto tables = *_sstables->all() | std::ranges::to<std::vector<sstables::shared_sstable>>();
auto table_names = std::make_unique<std::unordered_set<sstring>>();
auto& ks_name = schema()->ks_name();
auto& cf_name = schema()->cf_name();
co_await _sstables_manager.dir_semaphore().parallel_for_each(tables, [&, opts] (sstables::shared_sstable sstable) -> future<> {
auto gen = co_await io_check([sstable, &dir = jsondir, opts] {
return sstable->snapshot(dir, opts.use_sstable_identifier);
co_await _sstables_manager.dir_semaphore().parallel_for_each(tables, [&jsondir, &table_names] (sstables::shared_sstable sstable) {
table_names->insert(sstable->component_basename(sstables::component_type::Data));
return io_check([sstable, &dir = jsondir] {
return sstable->snapshot(dir);
});
auto fname = sstable->component_basename(ks_name, cf_name, sstable->get_version(), gen, sstable->get_format(), sstables::component_type::Data);
table_names->insert(fname);
});
co_return make_foreign(std::move(table_names));
}
@@ -3456,7 +3464,7 @@ size_t compaction_group::memtable_count() const noexcept {
return _memtables->size();
}
size_t storage_group::memtable_count() const noexcept {
size_t storage_group::memtable_count() const {
return std::ranges::fold_left(compaction_groups() | std::views::transform(std::mem_fn(&compaction_group::memtable_count)), size_t(0), std::plus{});
}

View File

@@ -22,7 +22,7 @@ format_match = re.compile(r'\s*(?:seastar::)?format\(\s*"([^"]+)"\s*,\s*(.*)\s*'
def handle_error(message, strict=True, verbose_mode=False):
if strict:
print(f"[ERROR] {message}")
exit(-1)
exit(1)
elif verbose_mode:
print(f"[WARNING] {message}")
@@ -180,12 +180,11 @@ def get_metrics_from_file(file_name, prefix, metrics_information, verb=None, str
groups = {}
if clean_name in metrics_information:
if (isinstance(metrics_information[clean_name], str) and metrics_information[clean_name] == "skip") or "skip" in metrics_information[clean_name]:
exit(0)
return {}
param_mapping = metrics_information[clean_name]["params"] if clean_name in metrics_information and "params" in metrics_information[clean_name] else {}
groups = metrics_information[clean_name]["groups"] if clean_name in metrics_information and "groups" in metrics_information[clean_name] else {}
metrics = {}
multi_line = False
names = undefined
typ = undefined
line_number = 0;

View File

@@ -1,6 +1,6 @@
"cdc/log.cc":
params:
cdc_group_name: cdc
cdc_group_name: "cdc"
part_name;suffix: [["static_row", "total"],["clustering_row", "total"], ["map", "total"], ["set", "total"], ["list", "total"], ["udt", "total"], ["range_tombstone", "total"],["partition_delete", "total"],["row_delete", "total"], ["static_row", "failed"],["clustering_row", "failed"], ["map", "failed"], ["set", "failed"], ["list", "failed"], ["udt", "failed"], ["range_tombstone", "failed"],["partition_delete", "failed"],["row_delete", "failed"]]
kind: ["total", "failed"]
"db/commitlog/commitlog.cc":
@@ -9,7 +9,7 @@
"cfg.max_active_flushes": "cfg.max_active_flushes"
"cql3/query_processor.cc":
groups:
"80": query_processor
"80": "query_processor"
"replica/dirty_memory_manager.cc":
params:
namestr: ["regular", "system"]
@@ -19,10 +19,11 @@
"replica/database.cc":
params:
"_dirty_memory_manager.throttle_threshold()": "throttle threshold"
"seastar/apps/metrics_tester/metrics_tester.cc": skip
"seastar/tests/unit/metrics_test.cc": skip
"seastar/tests/unit/metrics_tester.cc": skip
"seastar/tests/unit/prometheus_http_test.cc": skip
"seastar/apps/metrics_tester/metrics_tester.cc": "skip"
"seastar/tests/unit/metrics_test.cc": "skip"
"seastar/tests/unit/metrics_tester.cc": "skip"
"seastar/tests/unit/prometheus_http_test.cc": "skip"
"seastar/tests/unit/prometheus_text_test.cc": "skip"
"service/storage_proxy.cc":
params:
COORDINATOR_STATS_CATEGORY: "storage_proxy_coordinator"
@@ -32,25 +33,25 @@
_short_description_prefix: ["total_write_attempts", "write_errors", "background_replica_writes_failed", "read_repair_write_attempts"]
_long_description_prefix: ["total number of write requests", "number of write requests that failed", "background_replica_writes_failed", "number of write operations in a read repair context"]
_category: "storage_proxy_coordinator"
"thrift/server.cc": skip
"thrift/server.cc": "skip"
"tracing/tracing.cc":
params:
"max_pending_trace_records + write_event_records_threshold": "max_pending_trace_records + write_event_records_threshold"
"transport/server.cc":
groups:
"200": transport
"200": "transport"
params:
"_config.max_request_size": "max_request_size"
"seastar/src/net/dpdk.cc": skip
"seastar/src/net/dpdk.cc": "skip"
"db/hints/manager.cc":
params:
"group_name": ["hints_for_views_manager", "hints_manager"]
"seastar/src/core/execution_stage.cc":
groups:
"100": execution_stages
"100": "execution_stages"
"seastar/src/core/fair_queue.cc":
groups:
"300": io_queue
"300": "io_queue"
"seastar/src/net/net.cc":
params:
_stats_plugin_name: ["stats_plugin_name"]

View File

@@ -818,6 +818,9 @@ class std_list:
self._node = node_header['_M_next']
self._end = node_header['_M_next']['_M_prev']
def __iter__(self):
return self
def __next__(self):
if self._node == self._end:
raise StopIteration()

View File

@@ -36,6 +36,7 @@
#include <seastar/core/future-util.hh>
#include "db/read_repair_decision.hh"
#include "db/config.hh"
#include "db/batchlog.hh"
#include "db/batchlog_manager.hh"
#include "db/hints/manager.hh"
#include "db/system_keyspace.hh"
@@ -4281,12 +4282,13 @@ storage_proxy::mutate_atomically_result(utils::chunked_vector<mutation> mutation
coordinator_mutate_options _options;
const utils::UUID _batch_uuid;
const db_clock::time_point _batch_write_time;
const host_id_vector_replica_set _batchlog_endpoints;
public:
context(storage_proxy & p, utils::chunked_vector<mutation>&& mutations, lw_shared_ptr<cdc::operation_result_tracker>&& cdc_tracker, db::consistency_level cl, clock_type::time_point timeout, tracing::trace_state_ptr tr_state, service_permit permit, coordinator_mutate_options options)
: _p(p)
, _schema(_p.local_db().find_schema(db::system_keyspace::NAME, db::system_keyspace::BATCHLOG))
, _schema(_p.local_db().find_schema(db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2))
, _ermp(_p.local_db().find_column_family(_schema->id()).get_effective_replication_map())
, _mutations(std::move(mutations))
, _cdc_tracker(std::move(cdc_tracker))
@@ -4297,6 +4299,7 @@ storage_proxy::mutate_atomically_result(utils::chunked_vector<mutation> mutation
, _permit(std::move(permit))
, _options(std::move(options))
, _batch_uuid(utils::UUID_gen::get_time_UUID())
, _batch_write_time(db_clock::now())
, _batchlog_endpoints(
[this]() -> host_id_vector_replica_set {
auto local_addr = _p.my_host_id(*_ermp);
@@ -4334,17 +4337,14 @@ storage_proxy::mutate_atomically_result(utils::chunked_vector<mutation> mutation
}));
}
future<result<>> sync_write_to_batchlog() {
auto m = _p.do_get_batchlog_mutation_for(_schema, _mutations, _batch_uuid, netw::messaging_service::current_version, db_clock::now());
auto m = db::get_batchlog_mutation_for(_schema, _mutations, netw::messaging_service::current_version, _batch_write_time, _batch_uuid);
tracing::trace(_trace_state, "Sending a batchlog write mutation");
return send_batchlog_mutation(std::move(m));
};
future<> async_remove_from_batchlog() {
// delete batch
utils::get_local_injector().inject("storage_proxy_fail_remove_from_batchlog", [] { throw std::runtime_error("Error injection: failing remove from batchlog"); });
auto key = partition_key::from_exploded(*_schema, {uuid_type->decompose(_batch_uuid)});
auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
mutation m(_schema, key);
m.partition().apply_delete(*_schema, clustering_key_prefix::make_empty(), tombstone(now, gc_clock::now()));
auto m = db::get_batchlog_delete_mutation(_schema, netw::messaging_service::current_version, _batch_write_time, _batch_uuid);
tracing::trace(_trace_state, "Sending a batchlog remove mutation");
return send_batchlog_mutation(std::move(m), db::consistency_level::ANY).then_wrapped([] (future<result<>> f) {
@@ -4363,6 +4363,7 @@ storage_proxy::mutate_atomically_result(utils::chunked_vector<mutation> mutation
return _p.mutate_prepare(_mutations, _cl, db::write_type::BATCH, _trace_state, _permit, db::allow_per_partition_rate_limit::no, _options).then(utils::result_wrap([this] (unique_response_handler_vector ids) {
return sync_write_to_batchlog().then(utils::result_wrap([this, ids = std::move(ids)] () mutable {
tracing::trace(_trace_state, "Sending batch mutations");
utils::get_local_injector().inject("storage_proxy_fail_send_batch", [] { throw std::runtime_error("Error injection: failing to send batch"); });
_p.register_cdc_operation_result_tracker(ids, _cdc_tracker);
return _p.mutate_begin(std::move(ids), _cl, _trace_state, _timeout);
})).then(utils::result_wrap([this] {
@@ -4398,33 +4399,6 @@ storage_proxy::mutate_atomically_result(utils::chunked_vector<mutation> mutation
}).then_wrapped(std::move(cleanup));
}
mutation storage_proxy::get_batchlog_mutation_for(const utils::chunked_vector<mutation>& mutations, const utils::UUID& id, int32_t version, db_clock::time_point now) {
auto schema = local_db().find_schema(db::system_keyspace::NAME, db::system_keyspace::BATCHLOG);
return do_get_batchlog_mutation_for(std::move(schema), mutations, id, version, now);
}
mutation storage_proxy::do_get_batchlog_mutation_for(schema_ptr schema, const utils::chunked_vector<mutation>& mutations, const utils::UUID& id, int32_t version, db_clock::time_point now) {
auto key = partition_key::from_singular(*schema, id);
auto timestamp = api::new_timestamp();
auto data = [&mutations] {
utils::chunked_vector<canonical_mutation> fm(mutations.begin(), mutations.end());
bytes_ostream out;
for (auto& m : fm) {
ser::serialize(out, m);
}
return std::move(out).to_managed_bytes();
}();
mutation m(schema, key);
m.set_cell(clustering_key_prefix::make_empty(), to_bytes("version"), version, timestamp);
m.set_cell(clustering_key_prefix::make_empty(), to_bytes("written_at"), now, timestamp);
// Avoid going through data_value and therefore `bytes`, as it can be large (#24809).
auto cdef_data = schema->get_column_definition(to_bytes("data"));
m.set_cell(clustering_key_prefix::make_empty(), *cdef_data, atomic_cell::make_live(*cdef_data->type, timestamp, std::move(data)));
return m;
}
template<typename Range>
bool storage_proxy::cannot_hint(const Range& targets, db::write_type type) const {
// if hints are disabled we "can always hint" since there's going to be no hint generated in this case
@@ -4528,14 +4502,14 @@ future<> storage_proxy::send_hint_to_all_replicas(frozen_mutation_and_schema fm_
}
future<> storage_proxy::send_batchlog_replay_to_all_replicas(utils::chunked_vector<mutation> mutations, clock_type::time_point timeout) {
if (utils::get_local_injector().is_enabled("batch_replay_throw")) {
throw std::runtime_error("Skipping batch replay due to batch_replay_throw injection");
}
utils::get_local_injector().inject("storage_proxy_fail_replay_batch", [] { throw std::runtime_error("Error injection: failing to send batch"); });
utils::chunked_vector<batchlog_replay_mutation> ms = mutations | std::views::transform([] (auto&& m) {
return batchlog_replay_mutation(std::move(m));
}) | std::ranges::to<utils::chunked_vector<batchlog_replay_mutation>>();
utils::get_local_injector().inject("storage_proxy_fail_replay_batch", [] { throw std::runtime_error("Error injection: failing to send batch"); });
return mutate_internal(std::move(ms), db::consistency_level::EACH_QUORUM, nullptr, empty_service_permit(), timeout, db::write_type::BATCH)
.then(utils::result_into_future<result<>>);
}

View File

@@ -683,7 +683,6 @@ private:
fencing_token caller_token, locator::host_id caller_id,
Func&& write_func);
mutation do_get_batchlog_mutation_for(schema_ptr schema, const utils::chunked_vector<mutation>& mutations, const utils::UUID& id, int32_t version, db_clock::time_point now);
future<> drain_on_shutdown();
public:
void update_fence_version(locator::token_metadata::version_t fence_version);
@@ -834,8 +833,6 @@ public:
db::consistency_level cl_for_paxos, db::consistency_level cl_for_learn,
clock_type::time_point write_timeout, clock_type::time_point cas_timeout, bool write = true, cdc::per_request_options cdc_opts = {});
mutation get_batchlog_mutation_for(const utils::chunked_vector<mutation>& mutations, const utils::UUID& id, int32_t version, db_clock::time_point now);
future<> stop();
future<> start_hints_manager();
void allow_replaying_hints() noexcept;

View File

@@ -6822,7 +6822,6 @@ future<std::unordered_map<sstring, sstring>> storage_service::add_repair_tablet_
});
}
auto ts = db_clock::now();
for (const auto& token : tokens) {
auto tid = tmap.get_tablet_id(token);
auto& tinfo = tmap.get_tablet_info(tid);
@@ -6836,20 +6835,6 @@ future<std::unordered_map<sstring, sstring>> storage_service::add_repair_tablet_
tablet_mutation_builder_for_base_table(guard.write_timestamp(), table)
.set_repair_task_info(last_token, repair_task_info, _feature_service)
.build());
db::system_keyspace::repair_task_entry entry{
.task_uuid = tasks::task_id(repair_task_info.tablet_task_id.uuid()),
.operation = db::system_keyspace::repair_task_operation::requested,
.first_token = dht::token::to_int64(tmap.get_first_token(tid)),
.last_token = dht::token::to_int64(tmap.get_last_token(tid)),
.timestamp = ts,
.table_uuid = table,
};
if (_feature_service.tablet_repair_tasks_table) {
auto cmuts = co_await _sys_ks.local().get_update_repair_task_mutations(entry, guard.write_timestamp());
for (auto& m : cmuts) {
updates.push_back(std::move(m));
}
}
}
sstring reason = format("Repair tablet by API request tokens={} tablet_task_id={}", tokens, repair_task_info.tablet_task_id);
@@ -7546,7 +7531,7 @@ future<join_node_response_result> storage_service::join_node_response_handler(jo
&& _join_node_response_done.failed()) {
// The topology coordinator accepted the node that was rejected before or failed while handling
// the response. Inform the coordinator about it so it moves the node to the left state.
throw _join_node_response_done.get_shared_future().get_exception();
co_await coroutine::return_exception_ptr(_join_node_response_done.get_shared_future().get_exception());
}
co_return join_node_response_result{};

View File

@@ -136,17 +136,6 @@ db::tablet_options combine_tablet_options(R&& opts) {
return combined_opts;
}
static std::unordered_set<locator::tablet_id> split_string_to_tablet_id(std::string_view s, char delimiter) {
auto tokens_view = s | std::views::split(delimiter)
| std::views::transform([](auto&& range) {
return std::string_view(&*range.begin(), std::ranges::distance(range));
})
| std::views::transform([](std::string_view sv) {
return locator::tablet_id(std::stoul(std::string(sv)));
});
return std::unordered_set<locator::tablet_id>{tokens_view.begin(), tokens_view.end()};
}
// Used to compare different migration choices in regard to impact on load imbalance.
// There is a total order on migration_badness such that better migrations are ordered before worse ones.
struct migration_badness {
@@ -904,8 +893,6 @@ public:
co_await coroutine::maybe_yield();
auto& config = tmap.repair_scheduler_config();
auto now = db_clock::now();
auto skip = utils::get_local_injector().inject_parameter<std::string_view>("tablet_repair_skip_sched");
auto skip_tablets = skip ? split_string_to_tablet_id(*skip, ',') : std::unordered_set<locator::tablet_id>();
co_await tmap.for_each_tablet([&] (locator::tablet_id id, const locator::tablet_info& info) -> future<> {
auto gid = locator::global_tablet_id{table, id};
// Skip tablet that is in transitions.
@@ -926,11 +913,6 @@ public:
co_return;
}
if (skip_tablets.contains(id)) {
lblogger.debug("Skipped tablet repair for tablet={} by error injector", gid);
co_return;
}
// Avoid rescheduling a failed tablet repair in a loop
// TODO: Allow user to config
const auto min_reschedule_time = std::chrono::seconds(5);

View File

@@ -10,7 +10,6 @@
#include "replica/database.hh"
#include "service/migration_manager.hh"
#include "service/storage_service.hh"
#include "repair/row_level.hh"
#include "service/task_manager_module.hh"
#include "tasks/task_handler.hh"
#include "tasks/virtual_task_hint.hh"
@@ -110,16 +109,6 @@ future<std::optional<tasks::virtual_task_hint>> tablet_virtual_task::contains(ta
tid = tmap.next_tablet(*tid);
}
}
// Check if the task id is present in the repair task table
auto progress = co_await _ss._repair.local().get_tablet_repair_task_progress(task_id);
if (progress && progress->requested > 0) {
co_return tasks::virtual_task_hint{
.table_id = progress->table_uuid,
.task_type = locator::tablet_task_type::user_repair,
.tablet_id = std::nullopt,
};
}
co_return std::nullopt;
}
@@ -254,20 +243,7 @@ future<std::optional<status_helper>> tablet_virtual_task::get_status_helper(task
size_t sched_nr = 0;
auto tmptr = _ss.get_token_metadata_ptr();
auto& tmap = tmptr->tablets().get_tablet_map(table);
bool repair_task_finished = false;
bool repair_task_pending = false;
if (is_repair_task(task_type)) {
auto progress = co_await _ss._repair.local().get_tablet_repair_task_progress(id);
if (progress) {
res.status.progress.completed = progress->finished;
res.status.progress.total = progress->requested;
res.status.progress_units = "tablets";
if (progress->requested > 0 && progress->requested == progress->finished) {
repair_task_finished = true;
} if (progress->requested > 0 && progress->requested > progress->finished) {
repair_task_pending = true;
}
}
co_await tmap.for_each_tablet([&] (locator::tablet_id tid, const locator::tablet_info& info) {
auto& task_info = info.repair_task_info;
if (task_info.tablet_task_id.uuid() == id.uuid()) {
@@ -299,17 +275,7 @@ future<std::optional<status_helper>> tablet_virtual_task::get_status_helper(task
res.status.state = sched_nr == 0 ? tasks::task_manager::task_state::created : tasks::task_manager::task_state::running;
co_return res;
}
if (repair_task_pending) {
// When repair_task_pending is true, the res.tablets will be empty iff the request is aborted by user.
res.status.state = res.tablets.empty() ? tasks::task_manager::task_state::failed : tasks::task_manager::task_state::running;
co_return res;
}
if (repair_task_finished) {
res.status.state = tasks::task_manager::task_state::done;
co_return res;
}
// FIXME: Show finished tasks.
co_return std::nullopt;
}

View File

@@ -1205,8 +1205,6 @@ class topology_coordinator : public endpoint_lifecycle_subscriber {
std::unordered_map<locator::tablet_transition_stage, background_action_holder> barriers;
// Record the repair_time returned by the repair_tablet rpc call
db_clock::time_point repair_time;
// Record the repair task update muations
utils::chunked_vector<canonical_mutation> repair_task_updates;
service::session_id session_id;
};
@@ -1739,14 +1737,6 @@ class topology_coordinator : public endpoint_lifecycle_subscriber {
}
dst = dst_opt.value().host;
}
// Update repair task
db::system_keyspace::repair_task_entry entry{
.task_uuid = tasks::task_id(tinfo.repair_task_info.tablet_task_id.uuid()),
.operation = db::system_keyspace::repair_task_operation::finished,
.first_token = dht::token::to_int64(tmap.get_first_token(gid.tablet)),
.last_token = dht::token::to_int64(tmap.get_last_token(gid.tablet)),
.table_uuid = gid.table,
};
rtlogger.info("Initiating tablet repair host={} tablet={}", dst, gid);
auto session_id = utils::get_local_injector().enter("handle_tablet_migration_repair_random_session") ?
service::session_id::create_random_id() : trinfo->session_id;
@@ -1755,10 +1745,6 @@ class topology_coordinator : public endpoint_lifecycle_subscriber {
auto duration = std::chrono::duration<float>(db_clock::now() - sched_time);
auto& tablet_state = _tablets[tablet];
tablet_state.repair_time = db_clock::from_time_t(gc_clock::to_time_t(res.repair_time));
if (_feature_service.tablet_repair_tasks_table) {
entry.timestamp = db_clock::now();
tablet_state.repair_task_updates = co_await _sys_ks.get_update_repair_task_mutations(entry, api::new_timestamp());
}
rtlogger.info("Finished tablet repair host={} tablet={} duration={} repair_time={}",
dst, tablet, duration, res.repair_time);
})) {
@@ -1777,9 +1763,6 @@ class topology_coordinator : public endpoint_lifecycle_subscriber {
.set_stage(last_token, locator::tablet_transition_stage::end_repair)
.del_repair_task_info(last_token, _feature_service)
.del_session(last_token);
for (auto& m : tablet_state.repair_task_updates) {
updates.push_back(std::move(m));
}
// Skip update repair time in case hosts filter or dcs filter is set.
if (valid && is_filter_off) {
auto sched_time = tinfo.repair_task_info.sched_time;

View File

@@ -2117,14 +2117,11 @@ sstable::write_scylla_metadata(shard_id shard, struct run_identifier identifier,
}
sstable_id sid;
// Force a random sstable_id for testing purposes
bool random_sstable_identifier = utils::get_local_injector().is_enabled("random_sstable_identifier");
if (!random_sstable_identifier && generation().is_uuid_based()) {
if (generation().is_uuid_based()) {
sid = sstable_id(generation().as_uuid());
} else {
sid = sstable_id(utils::UUID_gen::get_time_UUID());
auto msg = random_sstable_identifier ? "forced random sstable_id" : "has numerical generation";
sstlog.info("SSTable {} {}. SSTable identifier in scylla_metadata set to {}", get_filename(), msg, sid);
sstlog.info("SSTable {} has numerical generation. SSTable identifier in scylla_metadata set to {}", get_filename(), sid);
}
_components->scylla_metadata->data.set<scylla_metadata_type::SSTableIdentifier>(scylla_metadata::sstable_identifier{sid});
@@ -2538,11 +2535,8 @@ std::vector<std::pair<component_type, sstring>> sstable::all_components() const
return all;
}
future<generation_type> sstable::snapshot(const sstring& dir, bool use_sstable_identifier) const {
// Use the sstable identifier UUID if available to enable global de-duplication of sstables in backup.
generation_type gen = (use_sstable_identifier && _sstable_identifier) ? generation_type(_sstable_identifier->uuid()) : _generation;
co_await _storage->snapshot(*this, dir, storage::absolute_path::yes, gen);
co_return gen;
future<> sstable::snapshot(const sstring& dir) const {
return _storage->snapshot(*this, dir, storage::absolute_path::yes);
}
future<> sstable::change_state(sstable_state to, delayed_commit_changes* delay_commit) {

View File

@@ -397,10 +397,6 @@ public:
return _version;
}
format_types get_format() const {
return _format;
}
// Returns the total bytes of all components.
uint64_t bytes_on_disk() const;
file_size_stats get_file_size_stats() const;
@@ -442,10 +438,7 @@ public:
std::vector<std::pair<component_type, sstring>> all_components() const;
// When use_sstable_identifier is true and the sstable identifier is available,
// use it to name the sstable in the snapshot, rather than the sstable generation.
// Returns the generation used for snapshot.
future<generation_type> snapshot(const sstring& dir, bool use_sstable_identifier = false) const;
future<> snapshot(const sstring& dir) const;
// Delete the sstable by unlinking all sstable files
// Ignores all errors.

View File

@@ -9,7 +9,7 @@ import pytest
import requests
from botocore.exceptions import ClientError
from test.alternator.test_manual_requests import get_signed_request
from .util import get_signed_request
# Test that trying to perform an operation signed with a wrong key

View File

@@ -13,30 +13,12 @@ import urllib3
from botocore.exceptions import BotoCoreError, ClientError
from packaging.version import Version
from test.alternator.util import random_bytes, random_string
from test.alternator.util import random_bytes, random_string, get_signed_request, manual_request, ManualRequestError
def gen_json(n):
return '{"":'*n + '{}' + '}'*n
def get_signed_request(dynamodb, target, payload):
# Usually "payload" will be a Python string and we'll write it as UTF-8.
# but in some tests we may want to write bytes directly - potentially
# bytes which include invalid UTF-8.
payload_bytes = payload if isinstance(payload, bytes) else payload.encode(encoding='UTF-8')
# NOTE: Signing routines use boto3 implementation details and may be prone
# to unexpected changes
class Request:
url=dynamodb.meta.client._endpoint.host
headers={'X-Amz-Target': 'DynamoDB_20120810.' + target, 'Content-Type': 'application/x-amz-json-1.0'}
body=payload_bytes
method='POST'
context={}
params={}
req = Request()
signer = dynamodb.meta.client._request_signer
signer.get_auth(signer.signing_name, signer.region_name).add_auth(request=req)
return req
# Test that deeply nested objects (e.g. with depth of 200k) are parsed correctly,
# i.e. do not cause stack overflows for the server. It's totally fine for the
@@ -483,6 +465,23 @@ def assert_validation_exception(response_text, request_info, accept_serializatio
(accept_serialization_exception and "SerializationException" in r['__type']), \
f"Unexpected error type {r['__type']} for {request_info}"
# Test that JSON parse errors have a reasonable type (in DynamoDB, it is
# SerializationException but in Alternator it is ValidationException), and
# if it contains a message, it doesn't contain garbage positions (reproduces
# issue #27372).
def test_json_parse_error(dynamodb, test_table):
with pytest.raises(ManualRequestError) as err:
manual_request(dynamodb, 'PutItem', '{broken json')
e = err.value
# In DynamoDB, we get a SerializationException with no message.
# In Alternator, we get a ValidationException with a message.
# For now, we'll accept both, but want to check that if we do have
# a message it gives the correct position of the error ('at 1') and
# not some garbage number (issue #27372).
assert e.type == 'SerializationException' or e.type == 'ValidationException'
if e.message:
assert e.message.endswith('at 1')
# Tests some invalid payloads (empty values, wrong types) to BatchWriteItem. Reproduces #23233
def test_batch_write_item_invalid_payload(dynamodb, test_table):
cases = [
@@ -587,3 +586,58 @@ def test_keep_alive(dynamodb, test_table, use_keep_alive):
finally:
urllib3.connection.HTTPConnection.connect = original_http_connect
urllib3.connection.HTTPSConnection.connect = original_https_connect
# Test that attempting to write a malformed value with PutItem, UpdateItem or
# BatchWriteItem fails. A "malformed value" is valid JSON but which doesn't
# conform to DynamoDB's value structure - maps with types. For example,
# {"S": "dog"} is a proper value (a string), but {"dog": "cat"} is malformed.
# We don't want to store the malformed value on disk and only discover the
# problem when reading the value back. We want the write to fail immediately,
# and this is what this test checks. Reproduces issue #8070.
@pytest.mark.xfail(reason="issue #8070")
@pytest.mark.parametrize("op", ['PutItem', 'UpdateItem', 'BatchWriteItem'])
def test_write_malformed_value(dynamodb, test_table_s, op):
p = random_string()
payloads = {
'PutItem': '''{
"TableName": "''' + test_table_s.name + '''",
"Item": {"p": {"S": "''' + p + '''"}, "x": %s}}''',
'UpdateItem': '''{
"TableName": "''' + test_table_s.name + '''",
"Key": {"p": {"S": "''' + p + '''"}},
"UpdateExpression": "SET x = :x",
"ExpressionAttributeValues": {":x": %s }}''',
'BatchWriteItem': '''{
"RequestItems": {
"''' + test_table_s.name + '''": [
{"PutRequest":
{"Item": {"p": {"S": "''' + p + '''"}, "x": %s}}}
]}}'''
}
payload = payloads[op]
# As a sanity check, check the value {"S": "hello"} works:
manual_request(dynamodb, op, payload % '{"S": "hello"}')
assert test_table_s.get_item(Key={'p': p}, ConsistentRead=True)['Item']['x'] == 'hello'
# The value {"dog": "cat"} is malformed. Because Alternator wants to
# optimize how it stores certain types, it checks the supposed type
# of this value and sees "dog" is not a known type and fails the write.
with pytest.raises(ManualRequestError, match='ValidationException'):
manual_request(dynamodb, op, payload % '{"dog": "cat"}')
# The value {"N": 3} is also malformed - "N" is a good type ("N") but
# the right serialization for the number 3 is {"N": "3"} (with the
# string "3") not {"N": 3}. DynamoDB generates a SerializationException
# error here, Alternator a ValidationException.
# I consider the difference to be not important - the important thing
# is that the write fails and doesn't save a malformed value on disk.
with pytest.raises(ManualRequestError, match='ValidationException|SerializationException'):
manual_request(dynamodb, op, payload % '{"N": 3}')
# If the value is a map (type "M"), Alternator doesn't attempt to further
# optimize its storage, and as issue #8070 noted, just stored the value
# as-is, as JSON, so it missed the need to validate the content of that
# JSON - and failed to find the malformed value.
with pytest.raises(ManualRequestError, match='ValidationException|SerializationException'):
manual_request(dynamodb, op, payload % '{"M": {"dog": "cat"}}')
# If PutItem didn't fail, and wrote the malformed map, GetItem
# will return this broken map and boto3's attempt to parse the
# returned map will fail, causing the following call to fail.
test_table_s.get_item(Key={'p': p}, ConsistentRead=True)

View File

@@ -33,9 +33,8 @@ import pytest
import requests
from botocore.exceptions import ClientError
from test.alternator.test_manual_requests import get_signed_request
from test.alternator.test_cql_rbac import new_dynamodb, new_role
from test.alternator.util import random_string, new_test_table, is_aws, scylla_config_read, scylla_config_temporary
from test.alternator.util import random_string, new_test_table, is_aws, scylla_config_read, scylla_config_temporary, get_signed_request
# Fixture for checking if we are able to test Scylla metrics. Scylla metrics
# are not available on AWS (of course), but may also not be available for

View File

@@ -14,7 +14,7 @@ import pytest
from boto3.dynamodb.types import TypeDeserializer
from botocore.exceptions import ClientError
from test.alternator.util import is_aws, scylla_config_temporary, unique_table_name, create_test_table, new_test_table, random_string, full_scan, freeze, list_tables, get_region
from test.alternator.util import is_aws, scylla_config_temporary, unique_table_name, create_test_table, new_test_table, random_string, full_scan, freeze, list_tables, get_region, manual_request
# All tests in this file are expected to fail with tablets due to #23838.
# To ensure that Alternator Streams is still being tested, instead of
@@ -1023,6 +1023,47 @@ def test_streams_updateitem_identical(test_table_ss_keys_only, test_table_ss_new
do_test(test_table_ss_old_image, dynamodb, dynamodbstreams, do_updates, 'OLD_IMAGE')
do_test(test_table_ss_new_and_old_images, dynamodb, dynamodbstreams, do_updates, 'NEW_AND_OLD_IMAGES')
# The above test_streams_updateitem_identical tested that if we UpdateItem an
# item changing an attribute to a value identical to the one it already has,
# no event is generated on the stream. In this test we change an attribute
# value from one map to another which are really the same value, but have
# different serialization (the map's elements are ordered differently).
# Since the value is nevertheless the same, here too we expect not to see
# a change event in the stream. Reproduces issue #27375.
@pytest.mark.xfail(reason="issue #27375")
def test_streams_updateitem_equal_but_not_identical(test_table_ss_keys_only, test_table_ss_new_image, test_table_ss_old_image, test_table_ss_new_and_old_images, dynamodb, dynamodbstreams):
def do_updates(table, p, c):
events = []
# We use manual_request() to let us be sure that we pass the JSON
# values to the server in exactly the order we want to, without the
# Python SDK possibly rearranging them.
# Set x to be a map: {'dog': 1, 'cat': 2, 'mouse': 3}.
payload = '''{
"TableName": "''' + table.name + '''",
"Key": {"p": {"S": "''' + p + '''"}, "c": {"S": "''' + c + '''"}},
"UpdateExpression": "SET x = :x",
"ExpressionAttributeValues": {":x": %s}
}'''
manual_request(dynamodb, 'UpdateItem',
payload % '{"M": {"dog": {"N": "1"}, "cat": {"N": "2"}, "mouse": {"N": "3"}}}')
events.append(['INSERT', {'p': p, 'c': c}, None, {'p': p, 'c': c, 'x': {'dog': 1, 'cat': 2, 'mouse': 3}}])
# Overwriting x an identical map item shouldn't produce any events,
# so we won't add anything to the "events" list:
manual_request(dynamodb, 'UpdateItem',
payload % '{"M": {"dog": {"N": "1"}, "cat": {"N": "2"}, "mouse": {"N": "3"}}}')
# Now try to overwrite x with a map that has the same elements in
# a different order. These two values, despite not being identical
# in JSON form, should be considered equal and again no event should
# be generated.
manual_request(dynamodb, 'UpdateItem',
payload % '{"M": {"cat": {"N": "2"}, "dog": {"N": "1"}, "mouse": {"N": "3"}}}')
return events
with scylla_config_temporary(dynamodb, 'alternator_streams_increased_compatibility', 'true', nop=is_aws(dynamodb)):
do_test(test_table_ss_keys_only, dynamodb, dynamodbstreams, do_updates, 'KEYS_ONLY')
do_test(test_table_ss_new_image, dynamodb, dynamodbstreams, do_updates, 'NEW_IMAGE')
do_test(test_table_ss_old_image, dynamodb, dynamodbstreams, do_updates, 'OLD_IMAGE')
do_test(test_table_ss_new_and_old_images, dynamodb, dynamodbstreams, do_updates, 'NEW_AND_OLD_IMAGES')
# Tests that deleting a missing attribute with UpdateItem doesn't generate a
# REMOVE event. Other cases are tested in test_streams_batch_delete_missing
# and in tests based on do_updates_1. Reproduces #6918.

View File

@@ -10,6 +10,7 @@ import collections
import time
import re
import requests
import json
import pytest
from contextlib import contextmanager
from botocore.hooks import HierarchicalEmitter
@@ -364,3 +365,54 @@ def scylla_config_temporary(dynamodb, name, value, nop = False):
yield
finally:
scylla_config_write(dynamodb, name, original_value)
# manual_request() can be used to send a DynamoDB API request without any
# boto3 involvement in preparing the request - the operation name and
# operation payload (a JSON string) are created by the caller. Use this
# function sparingly - most tests should use boto3's resource API or when
# needed, client_no_transform(). Although manual_request() does give the test
# more control, not using it allows the test to check the natural requests
# sent by a real-life SDK.
def manual_request(dynamodb, op, payload):
req = get_signed_request(dynamodb, op, payload)
res = requests.post(req.url, headers=req.headers, data=req.body, verify=False)
if res.status_code == 200:
return json.loads(res.text)
else:
err = json.loads(res.text)
error_code = res.status_code
error_type = err['__type'].split('#')[1]
# Normally, DynamoDB uses lowercase 'message', but in some cases
# it uses 'Message', and for some types of error it may be missing
# entirely (we'll return an empty string for that).
message = err.get('message', err.get('Message', ''))
raise ManualRequestError(error_code, error_type, message)
class ManualRequestError(Exception):
def __init__(self, error_code, error_type, message):
super().__init__(message) # message is the main exception text
self.code = error_code
self.type = error_type
self.message = message
def __str__(self):
return f'{self.code} {self.type} {self.message}'
__repr__ = __str__
def get_signed_request(dynamodb, op, payload):
# Usually "payload" will be a Python string and we'll write it as UTF-8.
# but in some tests we may want to write bytes directly - potentially
# bytes which include invalid UTF-8.
payload_bytes = payload if isinstance(payload, bytes) else payload.encode(encoding='UTF-8')
# NOTE: Signing routines use boto3 implementation details and may be prone
# to unexpected changes
class Request:
url=dynamodb.meta.client._endpoint.host
headers={'X-Amz-Target': 'DynamoDB_20120810.' + op, 'Content-Type': 'application/x-amz-json-1.0'}
body=payload_bytes
method='POST'
context={}
params={}
req = Request()
signer = dynamodb.meta.client._request_signer
signer.get_auth(signer.signing_name, signer.region_name).add_auth(request=req)
return req

View File

@@ -12,7 +12,7 @@ add_scylla_test(alternator_unit_test
add_scylla_test(anchorless_list_test
KIND BOOST)
add_scylla_test(auth_passwords_test
KIND BOOST
KIND SEASTAR
LIBRARIES auth)
add_scylla_test(auth_resource_test
KIND BOOST)

View File

@@ -6,7 +6,7 @@
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
*/
#define BOOST_TEST_MODULE core
#include <seastar/testing/test_case.hh>
#include <array>
#include <random>
@@ -16,15 +16,21 @@
#include <boost/test/unit_test.hpp>
#include <seastar/core/sstring.hh>
#include <seastar/core/coroutine.hh>
#include "seastarx.hh"
extern "C" {
#include <crypt.h>
#include <unistd.h>
}
static auto rng_for_salt = std::default_random_engine(std::random_device{}());
//
// The same password hashed multiple times will result in different strings because the salt will be different.
//
BOOST_AUTO_TEST_CASE(passwords_are_salted) {
SEASTAR_TEST_CASE(passwords_are_salted) {
const char* const cleartext = "my_excellent_password";
std::unordered_set<sstring> observed_passwords{};
@@ -33,12 +39,13 @@ BOOST_AUTO_TEST_CASE(passwords_are_salted) {
BOOST_REQUIRE(!observed_passwords.contains(e));
observed_passwords.insert(e);
}
co_return;
}
//
// A hashed password will authenticate against the same password in cleartext.
//
BOOST_AUTO_TEST_CASE(correct_passwords_authenticate) {
SEASTAR_TEST_CASE(correct_passwords_authenticate) {
// Common passwords.
std::array<const char*, 3> passwords{
"12345",
@@ -47,14 +54,85 @@ BOOST_AUTO_TEST_CASE(correct_passwords_authenticate) {
};
for (const char* p : passwords) {
BOOST_REQUIRE(auth::passwords::check(p, auth::passwords::hash(p, rng_for_salt, auth::passwords::scheme::sha_512)));
BOOST_REQUIRE(co_await auth::passwords::check(p, auth::passwords::hash(p, rng_for_salt, auth::passwords::scheme::sha_512)));
}
}
std::string long_password(uint32_t len) {
std::string out;
auto pattern = "0123456789";
for (uint32_t i = 0; i < len; ++i) {
out.push_back(pattern[i % strlen(pattern)]);
}
return out;
}
SEASTAR_TEST_CASE(same_hashes_as_crypt_h) {
std::string long_pwd_254 = long_password(254);
std::string long_pwd_255 = long_password(255);
std::string long_pwd_511 = long_password(511);
std::array<const char*, 8> passwords{
"12345",
"1_am_the_greatest!",
"password1",
// Some special characters
"!@#$%^&*()_+-=[]{}|\n;:'\",.<>/?",
// UTF-8 characters
"こんにちは、世界!",
// Passwords close to __crypt_sha512 length limit
long_pwd_254.c_str(),
long_pwd_255.c_str(),
// Password of maximal accepted length
long_pwd_511.c_str(),
};
auto salt = "$6$aaaabbbbccccdddd";
for (const char* p : passwords) {
auto res = co_await auth::passwords::detail::hash_with_salt_async(p, salt);
BOOST_REQUIRE(res == auth::passwords::detail::hash_with_salt(p, salt));
}
}
SEASTAR_TEST_CASE(too_long_password) {
auto p1 = long_password(71);
auto p2 = long_password(72);
auto p3 = long_password(73);
auto too_long_password = long_password(512);
auto salt_bcrypt = "$2a$05$mAyzaIeJu41dWUkxEbn8hO";
auto h1_bcrypt = co_await auth::passwords::detail::hash_with_salt_async(p1, salt_bcrypt);
auto h2_bcrypt = co_await auth::passwords::detail::hash_with_salt_async(p2, salt_bcrypt);
auto h3_bcrypt = co_await auth::passwords::detail::hash_with_salt_async(p3, salt_bcrypt);
BOOST_REQUIRE(h1_bcrypt != h2_bcrypt);
// The check below documents the behavior of the current bcrypt
// implementation that compares only the first 72 bytes of the password.
// Although we don't typically use bcrypt for password hashing, it is
// possible to insert such a hash using`CREATE ROLE ... WITH HASHED PASSWORD ...`.
// Refs: scylladb/scylladb#26842
BOOST_REQUIRE(h2_bcrypt == h3_bcrypt);
// The current implementation of bcrypt password hasing fails with passwords of length 512 and above
BOOST_CHECK_THROW(co_await auth::passwords::detail::hash_with_salt_async(too_long_password, salt_bcrypt), std::system_error);
auto salt_sha512 = "$6$aaaabbbbccccdddd";
auto h1_sha512 = co_await auth::passwords::detail::hash_with_salt_async(p1, salt_sha512);
auto h2_sha512 = co_await auth::passwords::detail::hash_with_salt_async(p2, salt_sha512);
auto h3_sha512 = co_await auth::passwords::detail::hash_with_salt_async(p3, salt_sha512);
BOOST_REQUIRE(h1_sha512 != h2_sha512);
BOOST_REQUIRE(h2_sha512 != h3_sha512);
// The current implementation of SHA-512 password hasing fails with passwords of length 512 and above
BOOST_CHECK_THROW(co_await auth::passwords::detail::hash_with_salt_async(too_long_password, salt_sha512), std::system_error);
}
//
// A hashed password that does not match the password in cleartext does not authenticate.
//
BOOST_AUTO_TEST_CASE(incorrect_passwords_do_not_authenticate) {
SEASTAR_TEST_CASE(incorrect_passwords_do_not_authenticate) {
const sstring hashed_password = auth::passwords::hash("actual_password", rng_for_salt,auth::passwords::scheme::sha_512);
BOOST_REQUIRE(!auth::passwords::check("password_guess", hashed_password));
BOOST_REQUIRE(!co_await auth::passwords::check("password_guess", hashed_password));
}

View File

@@ -12,16 +12,25 @@
#undef SEASTAR_TESTING_MAIN
#include <seastar/testing/test_case.hh>
#include "test/lib/cql_assertions.hh"
#include "test/lib/cql_test_env.hh"
#include "test/lib/error_injection.hh"
#include "test/lib/log.hh"
#include <seastar/core/future-util.hh>
#include <seastar/core/shared_ptr.hh>
#include "cql3/statements/batch_statement.hh"
#include "cql3/query_processor.hh"
#include "cql3/untyped_result_set.hh"
#include "db/batchlog.hh"
#include "db/batchlog_manager.hh"
#include "db/commitlog/commitlog.hh"
#include "db/config.hh"
#include "idl/frozen_schema.dist.hh"
#include "idl/frozen_schema.dist.impl.hh"
#include "message/messaging_service.hh"
#include "service/storage_proxy.hh"
#include "utils/rjson.hh"
BOOST_AUTO_TEST_SUITE(batchlog_manager_test)
@@ -37,6 +46,7 @@ SEASTAR_TEST_CASE(test_execute_batch) {
return e.execute_cql("create table cf (p1 varchar, c1 int, r1 int, PRIMARY KEY (p1, c1));").discard_result().then([&qp, &e, &bp] () mutable {
auto& db = e.local_db();
auto s = db.find_schema("ks", "cf");
const auto batchlog_schema = db.find_schema(db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2);
const column_definition& r1_col = *s->get_column_definition("r1");
auto key = partition_key::from_exploded(*s, {to_bytes("key1")});
@@ -48,7 +58,7 @@ SEASTAR_TEST_CASE(test_execute_batch) {
using namespace std::chrono_literals;
auto version = netw::messaging_service::current_version;
auto bm = qp.proxy().get_batchlog_mutation_for({ m }, s->id().uuid(), version, db_clock::now() - db_clock::duration(3h));
auto bm = db::get_batchlog_mutation_for(batchlog_schema, { m }, version, db_clock::now() - db_clock::duration(3h), s->id().uuid());
return qp.proxy().mutate_locally(bm, tracing::trace_state_ptr(), db::commitlog::force_sync::no).then([&bp] () mutable {
return bp.count_all_batches().then([](auto n) {
@@ -67,4 +77,719 @@ SEASTAR_TEST_CASE(test_execute_batch) {
});
}
namespace {
struct fragment {
position_in_partition pos;
bool is_range_tombstone;
bool is_live;
fragment(position_in_partition p, bool is_rt, bool is_live)
: pos(std::move(p)), is_range_tombstone(is_rt), is_live(is_live)
{}
class less_comparator {
schema_ptr _s;
public:
explicit less_comparator(schema_ptr s) : _s(std::move(s)) {}
bool operator()(const fragment& a, const fragment& b) const {
position_in_partition::less_compare cmp{*_s};
return cmp(a.pos, b.pos);
}
};
};
struct batchlog_as_fragments {
using fragment_set = std::set<fragment, fragment::less_comparator>;
std::unordered_map<int32_t, fragment_set> initial_fragments_per_shard;
std::unordered_map<int32_t, fragment_set> failed_replay_fragments_per_shard;
};
position_in_partition parse_batchlog_position(const schema& s, const cql3::untyped_result_set::row& row) {
std::vector<managed_bytes> ck_components;
ck_components.reserve(2);
for (const auto ck_component : {"written_at", "id"}) {
if (!row.has(ck_component) || row.get_blob_fragmented(ck_component).empty()) {
// ck is prefix
break;
}
ck_components.push_back(row.get_blob_fragmented(ck_component));
}
return position_in_partition(
static_cast<partition_region>(row.get_as<int8_t>("partition_region")),
static_cast<bound_weight>(row.get_as<int8_t>("position_weight")),
clustering_key_prefix::from_exploded(s, ck_components));
}
batchlog_as_fragments extract_batchlog_fragments(const cql3::untyped_result_set& fragment_results, schema_ptr batchlog_v2_schema) {
auto is_live = [] (std::optional<sstring> value) {
// no value can be stored as null or empty json object ("{}")
return value && value->size() > 2;
};
batchlog_as_fragments result;
for (const auto& row : fragment_results) {
const auto stage = static_cast<db::batchlog_stage>(row.get_as<int8_t>("stage"));
const auto shard = row.get_as<int32_t>("shard");
auto& fragments_per_shard = stage == db::batchlog_stage::initial
? result.initial_fragments_per_shard
: result.failed_replay_fragments_per_shard;
const auto is_rtc = row.get_as<sstring>("mutation_fragment_kind") == "range tombstone change";
auto pos = parse_batchlog_position(*batchlog_v2_schema, row);
BOOST_REQUIRE_EQUAL(is_rtc, (pos.get_bound_weight() != bound_weight::equal));
testlog.info("[stage {}, bathlog shard {}] fragment: pos={}, kind={}, live={}", int8_t(stage), shard, pos, row.get_as<sstring>("mutation_fragment_kind"), is_live(row.get_opt<sstring>("value")));
auto& fragments = fragments_per_shard.try_emplace(shard, batchlog_as_fragments::fragment_set(fragment::less_comparator(batchlog_v2_schema))).first->second;
fragments.emplace(pos, is_rtc, !is_rtc && is_live(row.get_opt<sstring>("value")));
}
return result;
}
void check_range_tombstone_start(const fragment& f) {
BOOST_REQUIRE(f.is_range_tombstone);
BOOST_REQUIRE(f.pos.region() == partition_region::clustered);
BOOST_REQUIRE(f.pos.has_key());
BOOST_REQUIRE(f.pos.key().explode().empty());
BOOST_REQUIRE(f.pos.get_bound_weight() == bound_weight::before_all_prefixed);
}
void check_range_tombstone_end(const fragment& f, std::optional<bound_weight> bound_weight_opt = {}) {
BOOST_REQUIRE(f.is_range_tombstone);
BOOST_REQUIRE(f.pos.region() == partition_region::clustered);
BOOST_REQUIRE(f.pos.has_key());
BOOST_REQUIRE_EQUAL(f.pos.key().explode().size(), 1);
if (bound_weight_opt) {
BOOST_REQUIRE(f.pos.get_bound_weight() == *bound_weight_opt);
}
}
} // anonymous namespace
future<> run_batchlog_cleanup_with_failed_batches_test(bool replay_fails, db::batchlog_manager::post_replay_cleanup cleanup) {
#ifndef SCYLLA_ENABLE_ERROR_INJECTION
return make_ready_future<>();
#endif
cql_test_config cfg;
cfg.db_config->batchlog_replay_cleanup_after_replays.set_value("9999999", utils::config_file::config_source::Internal);
cfg.batchlog_replay_timeout = 0s;
cfg.batchlog_delay = 9999h;
return do_with_cql_env_thread([=] (cql_test_env& env) -> void {
auto& bm = env.batchlog_manager().local();
env.execute_cql("CREATE TABLE tbl (pk bigint PRIMARY KEY, v text)").get();
const uint64_t batch_count = 8;
uint64_t failed_batches = 0;
for (uint64_t i = 0; i != batch_count; ++i) {
std::vector<sstring> queries;
std::vector<std::string_view> query_views;
for (uint64_t j = 0; j != i+2; ++j) {
queries.emplace_back(format("INSERT INTO tbl (pk, v) VALUES ({}, 'value');", j));
query_views.emplace_back(queries.back());
}
const bool fail = i % 2;
bool injected_exception_thrown = false;
std::optional<scoped_error_injection> error_injection;
if (fail) {
++failed_batches;
error_injection.emplace("storage_proxy_fail_send_batch");
}
try {
env.execute_batch(
query_views,
cql3::statements::batch_statement::type::LOGGED,
std::make_unique<cql3::query_options>(db::consistency_level::ONE, std::vector<cql3::raw_value>())).get();
} catch (std::runtime_error& ex) {
if (fail) {
BOOST_REQUIRE_EQUAL(std::string(ex.what()), "Error injection: failing to send batch");
injected_exception_thrown = true;
} else {
throw;
}
}
BOOST_REQUIRE_EQUAL(injected_exception_thrown, fail);
}
const auto fragments_query = format("SELECT * FROM MUTATION_FRAGMENTS({}.{}) WHERE partition_region = 2 ALLOW FILTERING", db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2);
assert_that(env.execute_cql(format("SELECT id FROM {}.{}", db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2)).get())
.is_rows()
.with_size(failed_batches);
assert_that(env.execute_cql(fragments_query).get())
.is_rows()
.with_size(batch_count)
.assert_for_columns_of_each_row([&] (columns_assertions& columns) {
columns.with_typed_column<sstring>("mutation_source", "memtable:0");
});
std::optional<scoped_error_injection> error_injection;
if (replay_fails) {
error_injection.emplace("storage_proxy_fail_replay_batch");
}
bm.do_batch_log_replay(cleanup).get();
assert_that(env.execute_cql(format("SELECT id FROM {}.{}", db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2)).get())
.is_rows()
.with_size(replay_fails ? failed_batches : 0);
const auto fragment_results = cql3::untyped_result_set(env.execute_cql(fragments_query).get());
const auto batchlog_v2_schema = env.local_db().find_schema(db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2);
const auto [initial_fragments_per_shard, failed_replay_fragments_per_shard] = extract_batchlog_fragments(
cql3::untyped_result_set(env.execute_cql(fragments_query).get()), batchlog_v2_schema);
if (cleanup) {
size_t initial_rows{};
for (const auto& [batchlog_shard, batchlog_shard_fragments] : initial_fragments_per_shard) {
// some batchlog shards can be empty, just ignore
if (batchlog_shard_fragments.empty()) {
continue;
}
testlog.info("Checking fragment in initial stage and batchlog shard {}", batchlog_shard);
size_t rts{}, rows{};
for (const auto& fragment : batchlog_shard_fragments) {
rts += fragment.is_range_tombstone;
rows += !fragment.is_range_tombstone;
BOOST_REQUIRE(!fragment.is_live);
}
// cleanup affects only batchlog shards which contributed to replay
if (rts) {
BOOST_REQUIRE_EQUAL(rts, 2);
check_range_tombstone_start(*batchlog_shard_fragments.begin());
check_range_tombstone_end(*batchlog_shard_fragments.rbegin(), bound_weight::after_all_prefixed);
}
initial_rows += rows;
}
// some of the initial fragments could have been garbage collected after cleanup (shadowed by range tombstone)
// this happens in the background so we can have up to total_batches rows
BOOST_REQUIRE_LE(initial_rows, batch_count);
if (replay_fails) {
size_t failed_replay_rows{};
for (const auto& [batchlog_shard, batchlog_shard_fragments] : failed_replay_fragments_per_shard) {
for (const auto& fragment : batchlog_shard_fragments) {
BOOST_REQUIRE(!fragment.is_range_tombstone);
BOOST_REQUIRE(fragment.is_live);
++failed_replay_rows;
}
}
BOOST_REQUIRE_EQUAL(failed_replay_rows, failed_batches);
} else {
BOOST_REQUIRE(failed_replay_fragments_per_shard.empty());
}
} else {
size_t live{}, dead{};
BOOST_REQUIRE(failed_replay_fragments_per_shard.empty());
for (const auto& [batchlog_shard, batchlog_shard_fragments] : initial_fragments_per_shard) {
for (const auto& fragment : batchlog_shard_fragments) {
BOOST_REQUIRE(!fragment.is_range_tombstone);
if (fragment.is_live) {
++live;
} else {
++dead;
}
}
}
const auto total = live + dead;
BOOST_REQUIRE_EQUAL(total, batch_count);
if (replay_fails) {
BOOST_REQUIRE_EQUAL(live, failed_batches);
} else {
BOOST_REQUIRE_EQUAL(dead, total);
}
}
}, cfg);
}
SEASTAR_TEST_CASE(test_batchlog_replay_fails_no_cleanup) {
return run_batchlog_cleanup_with_failed_batches_test(true, db::batchlog_manager::post_replay_cleanup::no);
}
SEASTAR_TEST_CASE(test_batchlog_replay_fails_with_cleanup) {
return run_batchlog_cleanup_with_failed_batches_test(true, db::batchlog_manager::post_replay_cleanup::yes);
}
SEASTAR_TEST_CASE(test_batchlog_replay_no_cleanup) {
return run_batchlog_cleanup_with_failed_batches_test(false, db::batchlog_manager::post_replay_cleanup::no);
}
SEASTAR_TEST_CASE(test_batchlog_replay_with_cleanup) {
return run_batchlog_cleanup_with_failed_batches_test(false, db::batchlog_manager::post_replay_cleanup::yes);
}
SEASTAR_TEST_CASE(test_batchlog_replay_stage) {
#ifndef SCYLLA_ENABLE_ERROR_INJECTION
return make_ready_future<>();
#endif
cql_test_config cfg;
cfg.db_config->batchlog_replay_cleanup_after_replays.set_value("9999999", utils::config_file::config_source::Internal);
cfg.batchlog_replay_timeout = 0s;
cfg.batchlog_delay = 9999h;
return do_with_cql_env_thread([] (cql_test_env& env) -> void {
auto& bm = env.batchlog_manager().local();
env.execute_cql("CREATE TABLE tbl (pk bigint PRIMARY KEY, v text)").get();
const uint64_t batch_count = 8;
const auto shard_count = 256;
{
scoped_error_injection error_injection("storage_proxy_fail_send_batch");
for (uint64_t i = 0; i != batch_count; ++i) {
std::vector<sstring> queries;
std::vector<std::string_view> query_views;
for (uint64_t j = 0; j != i+2; ++j) {
queries.emplace_back(format("INSERT INTO tbl (pk, v) VALUES ({}, 'value');", j));
query_views.emplace_back(queries.back());
}
try {
env.execute_batch(
query_views,
cql3::statements::batch_statement::type::LOGGED,
std::make_unique<cql3::query_options>(db::consistency_level::ONE, std::vector<cql3::raw_value>())).get();
} catch (std::runtime_error& ex) {
BOOST_REQUIRE_EQUAL(std::string(ex.what()), "Error injection: failing to send batch");
}
}
}
// Ensure select ... where write_time < write_time_limit (=now) picks up all batches.
sleep(2ms).get();
const auto batchlog_query = format("SELECT * FROM {}.{}", db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2);
const auto fragments_query = format("SELECT * FROM MUTATION_FRAGMENTS({}.{}) WHERE partition_region = 2 ALLOW FILTERING", db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2);
std::set<utils::UUID> ids;
std::set<db_clock::time_point> written_ats;
assert_that(env.execute_cql(batchlog_query).get())
.is_rows()
.with_size(batch_count)
.assert_for_columns_of_each_row([&] (columns_assertions& columns) {
columns.with_typed_column<int32_t>("version", netw::messaging_service::current_version)
.with_typed_column<int8_t>("stage", int8_t(db::batchlog_stage::initial))
.with_typed_column<db_clock::time_point>("written_at", [&] (db_clock::time_point written_at) {
written_ats.insert(written_at);
return true;
})
.with_typed_column<utils::UUID>("id", [&] (utils::UUID id) {
ids.insert(id);
return true;
});
});
BOOST_REQUIRE_EQUAL(ids.size(), batch_count);
BOOST_REQUIRE_LE(written_ats.size(), batch_count);
auto do_replays = [&] (db::batchlog_manager::post_replay_cleanup cleanup) {
for (unsigned i = 0; i < 3; ++i) {
testlog.info("Replay attempt [cleanup={}] #{} - batches should be in failed_replay stage", cleanup, i);
bm.do_batch_log_replay(cleanup).get();
assert_that(env.execute_cql(batchlog_query).get())
.is_rows()
.with_size(batch_count)
.assert_for_columns_of_each_row([&] (columns_assertions& columns) {
columns.with_typed_column<int32_t>("version", netw::messaging_service::current_version)
.with_typed_column<int8_t>("stage", [&] (int8_t stage) {
// (0) cleanup::no == db::batchlog_stage::initial
// (1) cleanup::yes == db::batchlog_stage::failed_replay
return stage == int8_t(bool(cleanup));
})
.with_typed_column<db_clock::time_point>("written_at", [&] (db_clock::time_point written_at) {
return written_ats.contains(written_at);
})
.with_typed_column<utils::UUID>("id", [&] (utils::UUID id) {
return ids.contains(id);
});
});
auto fragment_results = cql3::untyped_result_set(env.execute_cql(fragments_query).get());
if (cleanup) {
// Each shard has a range tombstone (shard_count * 2 range tombstone changes)
// There should be batch_count clustering rows, with stage=1
// There may be [0, batch_count] clustering rows with stage=0
BOOST_REQUIRE_GT(fragment_results.size(), batch_count);
BOOST_REQUIRE_LE(fragment_results.size(), shard_count * 2 + batch_count * 2);
} else {
BOOST_REQUIRE_EQUAL(fragment_results.size(), batch_count); // only clustering rows
}
for (const auto& row : fragment_results) {
if (row.get_as<sstring>("mutation_fragment_kind") == "range tombstone change") {
continue;
}
const auto id = row.get_as<utils::UUID>("id");
const auto stage = row.get_as<int8_t>("stage");
testlog.trace("Processing row for batch id={}, stage={}: ", id, stage);
BOOST_REQUIRE_EQUAL(row.get_as<int32_t>("version"), netw::messaging_service::current_version);
BOOST_REQUIRE(ids.contains(id));
const auto metadata = row.get_as<sstring>("metadata");
auto metadata_json = rjson::parse(metadata);
BOOST_REQUIRE(metadata_json.IsObject());
if (!cleanup || stage == int8_t(db::batchlog_stage::failed_replay)) {
BOOST_REQUIRE_NE(row.get_as<sstring>("value"), "{}");
BOOST_REQUIRE(!metadata_json.HasMember("tombstone"));
const auto value_json = rjson::parse(row.get_as<sstring>("value"));
BOOST_REQUIRE(value_json.IsObject());
BOOST_REQUIRE(value_json.HasMember("data"));
} else if (stage == int8_t(db::batchlog_stage::initial)) {
BOOST_REQUIRE_EQUAL(row.get_as<sstring>("value"), "{}"); // row should be dead -- data column shadowed by tombstone
if (!cleanup) {
BOOST_REQUIRE(metadata_json.HasMember("tombstone"));
}
} else {
BOOST_FAIL(format("Unexpected stage: {}", stage));
}
}
}
};
{
scoped_error_injection error_injection("storage_proxy_fail_replay_batch");
do_replays(db::batchlog_manager::post_replay_cleanup::no);
do_replays(db::batchlog_manager::post_replay_cleanup::yes);
}
testlog.info("Successful replay - should remove all batches");
bm.do_batch_log_replay(db::batchlog_manager::post_replay_cleanup::no).get();
assert_that(env.execute_cql(batchlog_query).get())
.is_rows()
.is_empty();
const auto fragment_results = cql3::untyped_result_set(env.execute_cql(fragments_query).get());
// Each shard can have a range tombstone (shard_count * 2 range tombstone changes)
// There should be batch_count clustering rows, with stage=1
// There may be [0, batch_count] clustering rows with stage=0
BOOST_REQUIRE_GT(fragment_results.size(), batch_count);
BOOST_REQUIRE_LE(fragment_results.size(), shard_count * 2 + batch_count * 2);
for (const auto& row : fragment_results) {
if (row.get_as<sstring>("mutation_fragment_kind") == "range tombstone change") {
BOOST_REQUIRE_EQUAL(row.get_as<int8_t>("stage"), int8_t(db::batchlog_stage::initial));
continue;
}
BOOST_REQUIRE_EQUAL(row.get_as<int32_t>("version"), netw::messaging_service::current_version);
BOOST_REQUIRE(written_ats.contains(row.get_as<db_clock::time_point>("written_at")));
BOOST_REQUIRE(ids.contains(row.get_as<utils::UUID>("id")));
BOOST_REQUIRE_EQUAL(row.get_as<sstring>("value"), "{}");
}
}, cfg);
}
SEASTAR_TEST_CASE(test_batchlog_migrate_v1_v2) {
#ifndef SCYLLA_ENABLE_ERROR_INJECTION
return make_ready_future<>();
#endif
const auto batch_replay_timeout = 1s;
cql_test_config cfg;
cfg.db_config->batchlog_replay_cleanup_after_replays.set_value("9999999", utils::config_file::config_source::Internal);
cfg.batchlog_replay_timeout = batch_replay_timeout;
cfg.batchlog_delay = 9999h;
return do_with_cql_env_thread([batch_replay_timeout] (cql_test_env& env) -> void {
auto& bm = env.batchlog_manager().local();
env.execute_cql("CREATE TABLE tbl (pk bigint PRIMARY KEY, v text)").get();
auto& sp = env.get_storage_proxy().local();
auto& db = env.local_db();
auto& tbl = db.find_column_family("ks", "tbl");
auto tbl_schema = tbl.schema();
auto cdef_tbl_v = tbl_schema->get_column_definition(to_bytes("v"));
auto batchlog_v1_schema = db.find_schema(db::system_keyspace::NAME, db::system_keyspace::BATCHLOG);
auto cdef_batchlog_v1_data = batchlog_v1_schema->get_column_definition(to_bytes("data"));
const uint64_t batch_count = 8;
struct batchlog {
utils::UUID id;
int32_t version;
db_clock::time_point written_at;
managed_bytes data;
};
std::map<utils::UUID, const batchlog> batchlogs;
for (int64_t i = 0; i != batch_count; ++i) {
bytes_ostream batchlog_data;
for (int64_t j = 0; j != i+2; ++j) {
auto key = partition_key::from_single_value(*tbl_schema, serialized(j));
mutation m(tbl_schema, key);
m.set_clustered_cell(clustering_key::make_empty(), *cdef_tbl_v, make_atomic_cell(utf8_type, serialized("value")));
ser::serialize(batchlog_data, canonical_mutation(m));
}
const auto id = utils::UUID_gen::get_time_UUID();
auto [it, _] = batchlogs.emplace(id, batchlog{
.id = id,
.version = netw::messaging_service::current_version,
.written_at = db_clock::now() - batch_replay_timeout * 10,
.data = std::move(batchlog_data).to_managed_bytes()});
auto& batch = it->second;
const auto timestamp = api::new_timestamp();
mutation m(batchlog_v1_schema, partition_key::from_single_value(*batchlog_v1_schema, serialized(batch.id)));
m.set_cell(clustering_key_prefix::make_empty(), to_bytes("version"), batch.version, timestamp);
m.set_cell(clustering_key_prefix::make_empty(), to_bytes("written_at"), batch.written_at, timestamp);
m.set_cell(clustering_key_prefix::make_empty(), *cdef_batchlog_v1_data, atomic_cell::make_live(*cdef_batchlog_v1_data->type, timestamp, std::move(batch.data)));
sp.mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no).get();
}
const auto batchlog_v1_query = format("SELECT * FROM {}.{}", db::system_keyspace::NAME, db::system_keyspace::BATCHLOG);
const auto batchlog_v2_query = format("SELECT * FROM {}.{}", db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2);
// Initial state, all entries are in the v1 table.
assert_that(env.execute_cql(batchlog_v1_query).get())
.is_rows()
.with_size(batch_count);
assert_that(env.execute_cql(batchlog_v2_query).get())
.is_rows()
.is_empty();
{
scoped_error_injection error_injection("batchlog_manager_fail_migration");
testlog.info("First replay - migration should fail, all entries stay in v1 table");
bm.do_batch_log_replay(db::batchlog_manager::post_replay_cleanup::no).get();
}
assert_that(env.execute_cql(batchlog_v1_query).get())
.is_rows()
.with_size(batch_count);
assert_that(env.execute_cql(batchlog_v2_query).get())
.is_rows()
.is_empty();
{
scoped_error_injection error_injection("storage_proxy_fail_replay_batch");
testlog.info("Second replay - migration should run again and succeed, but replay of migrated entries should fail, so they should remain in the v2 table.");
bm.do_batch_log_replay(db::batchlog_manager::post_replay_cleanup::no).get();
}
assert_that(env.execute_cql(batchlog_v1_query).get())
.is_rows()
.is_empty();
auto results = cql3::untyped_result_set(env.execute_cql(batchlog_v2_query).get());
BOOST_REQUIRE_EQUAL(results.size(), batch_count);
std::set<utils::UUID> migrated_batchlog_ids;
for (const auto& row : results) {
BOOST_REQUIRE_EQUAL(row.get_as<int8_t>("stage"), int8_t(db::batchlog_stage::failed_replay));
const auto id = row.get_as<utils::UUID>("id");
auto it = batchlogs.find(id);
BOOST_REQUIRE(it != batchlogs.end());
const auto& batch = it->second;
BOOST_REQUIRE_EQUAL(row.get_as<int32_t>("version"), batch.version);
BOOST_REQUIRE(row.get_as<db_clock::time_point>("written_at") == batch.written_at);
BOOST_REQUIRE_EQUAL(row.get_blob_fragmented("data"), batch.data);
migrated_batchlog_ids.emplace(id);
}
BOOST_REQUIRE_EQUAL(batchlogs.size(), migrated_batchlog_ids.size());
testlog.info("Third replay - migration is already done, replay of migrated entries should succeed, v2 table should be empty afterwards.");
bm.do_batch_log_replay(db::batchlog_manager::post_replay_cleanup::no).get();
assert_that(env.execute_cql(batchlog_v2_query).get())
.is_rows()
.is_empty();
}, cfg);
}
SEASTAR_TEST_CASE(test_batchlog_replay_write_time) {
#ifndef SCYLLA_ENABLE_ERROR_INJECTION
return make_ready_future<>();
#endif
cql_test_config cfg;
cfg.db_config->batchlog_replay_cleanup_after_replays.set_value("9999999", utils::config_file::config_source::Internal);
cfg.batchlog_replay_timeout = 1h;
cfg.batchlog_delay = 9999h;
return do_with_cql_env_thread([ks = get_name()] (cql_test_env& env) -> void {
auto& bm = env.batchlog_manager().local();
env.execute_cql(format("CREATE KEYSPACE {} WITH replication = {{'class': 'SimpleStrategy', 'replication_factor': 2}} AND tablets = {{'enabled': 'false'}}", ks)).get();
env.execute_cql(format("CREATE TABLE {}.tbl1 (pk bigint PRIMARY KEY, v text) WITH tombstone_gc = {{'mode': 'repair', 'propagation_delay_in_seconds': 1}}", ks)).get();
env.execute_cql(format("CREATE TABLE {}.tbl2 (pk bigint PRIMARY KEY, v text) WITH tombstone_gc = {{'mode': 'timeout'}}", ks)).get();
const uint64_t batch_count = 8;
const uint64_t mutations_per_batch = 2;
{
scoped_error_injection error_injection("storage_proxy_fail_send_batch");
for (uint64_t i = 0; i != batch_count; ++i) {
std::vector<sstring> queries;
std::vector<std::string_view> query_views;
if (i % 2) {
for (const auto& tbl_name : {"tbl1", "tbl2"}) {
queries.emplace_back(format("INSERT INTO {}.{} (pk, v) VALUES (0, 'value');", ks, tbl_name, i));
query_views.emplace_back(queries.back());
}
} else {
for (uint64_t j = 0; j != mutations_per_batch; ++j) {
queries.emplace_back(format("INSERT INTO {}.tbl2 (pk, v) VALUES ({}, 'value');", ks, i * 2 + j));
query_views.emplace_back(queries.back());
}
}
try {
env.execute_batch(
query_views,
cql3::statements::batch_statement::type::LOGGED,
std::make_unique<cql3::query_options>(db::consistency_level::ONE, std::vector<cql3::raw_value>())).get();
} catch (std::runtime_error& ex) {
BOOST_REQUIRE_EQUAL(std::string(ex.what()), "Error injection: failing to send batch");
}
}
}
const auto batchlog_query = format("SELECT * FROM {}.{}", db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2);
assert_that(env.execute_cql(batchlog_query).get())
.is_rows()
.with_size(batch_count);
auto get_write_attempts = [&] () -> uint64_t {
return env.batchlog_manager().map_reduce0([] (const db::batchlog_manager& bm) {
return bm.stats().write_attempts;
}, uint64_t(0), std::plus<uint64_t>{}).get();
};
BOOST_REQUIRE_EQUAL(get_write_attempts(), 0);
// We need this sleep here to make sure all batches are older than propagation delay of 1s.
sleep(2s).get();
{
scoped_error_injection error_injection("storage_proxy_fail_replay_batch");
bm.do_batch_log_replay(db::batchlog_manager::post_replay_cleanup::no).get();
// Half the batches are skipped due to being too fresh
BOOST_REQUIRE_EQUAL(get_write_attempts(), (batch_count / 2) * mutations_per_batch);
auto results = cql3::untyped_result_set(env.execute_cql(batchlog_query).get());
BOOST_REQUIRE_EQUAL(results.size(), batch_count);
for (const auto& row : results) {
BOOST_REQUIRE_EQUAL(row.get_as<int8_t>("stage"), int8_t(db::batchlog_stage::initial));
}
}
{
scoped_error_injection error_injection("storage_proxy_fail_replay_batch");
bm.do_batch_log_replay(db::batchlog_manager::post_replay_cleanup::yes).get();
// Half the batches are skipped (again) due to being too fresh
BOOST_REQUIRE_EQUAL(get_write_attempts(), batch_count * mutations_per_batch);
auto results = cql3::untyped_result_set(env.execute_cql(batchlog_query).get());
BOOST_REQUIRE_EQUAL(results.size(), batch_count);
size_t initial = 0;
size_t failed_replay = 0;
for (const auto& row : results) {
const auto stage = static_cast<db::batchlog_stage>(row.get_as<int8_t>("stage"));
switch (stage) {
case db::batchlog_stage::initial:
++initial;
break;
case db::batchlog_stage::failed_replay:
++failed_replay;
break;
default:
BOOST_FAIL(format("Unexpected stage: {}", int8_t(stage)));
}
}
BOOST_REQUIRE_EQUAL(initial, batch_count / 2);
BOOST_REQUIRE_EQUAL(failed_replay, batch_count / 2);
}
const auto fragments_query = format("SELECT * FROM MUTATION_FRAGMENTS({}.{}) WHERE partition_region = 2 ALLOW FILTERING", db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2);
const auto batchlog_v2_schema = env.local_db().find_schema(db::system_keyspace::NAME, db::system_keyspace::BATCHLOG_V2);
const auto [initial_fragments_per_shard, failed_replay_fragments_per_shard] = extract_batchlog_fragments(
cql3::untyped_result_set(env.execute_cql(fragments_query).get()), batchlog_v2_schema);
for (const auto& [batchlog_shard, batchlog_shard_fragments] : initial_fragments_per_shard) {
if (batchlog_shard_fragments.empty()) {
continue;
}
testlog.info("Checking fragment in initial stage and batchlog shard {}", batchlog_shard);
position_in_partition::less_compare less_cmp{*batchlog_v2_schema};
size_t rts = 0;
auto min_live_pos = position_in_partition::after_all_clustered_rows();
for (const auto& f : batchlog_shard_fragments) {
if (f.is_range_tombstone) {
++rts;
continue;
}
if (!f.is_live) {
continue;
}
min_live_pos = std::min(min_live_pos, f.pos, less_cmp);
}
BOOST_REQUIRE(rts == 0 || rts == 2);
if (!rts) {
continue;
}
check_range_tombstone_start(*batchlog_shard_fragments.begin());
bool found_range_tombstone_end = false;
for (auto it = std::next(batchlog_shard_fragments.begin()); it != batchlog_shard_fragments.end(); ++it) {
if (it->is_range_tombstone) {
BOOST_REQUIRE(!std::exchange(found_range_tombstone_end, true));
check_range_tombstone_end(*it);
BOOST_REQUIRE(less_cmp(it->pos, min_live_pos));
}
}
}
}, cfg);
}
BOOST_AUTO_TEST_SUITE_END()

View File

@@ -31,7 +31,6 @@
#include "replica/database.hh"
#include "utils/assert.hh"
#include "utils/lister.hh"
#include "utils/rjson.hh"
#include "partition_slice_builder.hh"
#include "mutation/frozen_mutation.hh"
#include "test/lib/mutation_source_test.hh"
@@ -39,7 +38,6 @@
#include "service/migration_manager.hh"
#include "sstables/sstables.hh"
#include "sstables/generation_type.hh"
#include "sstables/sstable_version.hh"
#include "db/config.hh"
#include "db/commitlog/commitlog_replayer.hh"
#include "db/commitlog/commitlog.hh"
@@ -53,7 +51,6 @@
#include "db/system_keyspace.hh"
#include "db/view/view_builder.hh"
#include "replica/mutation_dump.hh"
#include "utils/error_injection.hh"
using namespace std::chrono_literals;
using namespace sstables;
@@ -615,13 +612,13 @@ future<> do_with_some_data(std::vector<sstring> cf_names, std::function<future<>
});
}
future<> take_snapshot(cql_test_env& e, sstring ks_name = "ks", sstring cf_name = "cf", sstring snapshot_name = "test", db::snapshot_options opts = {}) {
future<> take_snapshot(cql_test_env& e, sstring ks_name = "ks", sstring cf_name = "cf", sstring snapshot_name = "test", bool skip_flush = false) {
try {
auto uuid = e.db().local().find_uuid(ks_name, cf_name);
co_await replica::database::snapshot_table_on_all_shards(e.db(), uuid, snapshot_name, opts);
co_await replica::database::snapshot_table_on_all_shards(e.db(), uuid, snapshot_name, skip_flush);
} catch (...) {
testlog.error("Could not take snapshot for {}.{} snapshot_name={} skip_flush={} use_sstable_identifier={}: {}",
ks_name, cf_name, snapshot_name, opts.skip_flush, opts.use_sstable_identifier, std::current_exception());
testlog.error("Could not take snapshot for {}.{} snapshot_name={} skip_flush={}: {}",
ks_name, cf_name, snapshot_name, skip_flush, std::current_exception());
throw;
}
}
@@ -635,37 +632,6 @@ future<std::set<sstring>> collect_files(fs::path path) {
co_return ret;
}
static bool is_component(const sstring& fname, const sstring& suffix) {
return fname.ends_with(suffix);
}
static std::set<sstring> collect_sstables(const std::set<sstring>& all_files, const sstring& suffix) {
// Verify manifest against the files in the snapshots dir
auto pred = [&suffix] (const sstring& fname) {
return is_component(fname, suffix);
};
return std::ranges::filter_view(all_files, pred) | std::ranges::to<std::set<sstring>>();
}
// Validate that the manifest.json lists exactly the SSTables present in the snapshot directory
static future<> validate_manifest(const fs::path& snapshot_dir, const std::set<sstring>& in_snapshot_dir) {
sstring suffix = "-Data.db";
auto sstables_in_snapshot = collect_sstables(in_snapshot_dir, suffix);
std::set<sstring> sstables_in_manifest;
auto manifest_str = co_await util::read_entire_file_contiguous(snapshot_dir / "manifest.json");
auto manifest_json = rjson::parse(manifest_str);
auto& manifest_files = manifest_json["files"];
BOOST_REQUIRE(manifest_files.IsArray());
for (auto& f : manifest_files.GetArray()) {
if (is_component(f.GetString(), suffix)) {
sstables_in_manifest.insert(f.GetString());
}
}
testlog.debug("SSTables in manifest.json: {}", fmt::join(sstables_in_manifest, ", "));
BOOST_REQUIRE_EQUAL(sstables_in_snapshot, sstables_in_manifest);
}
static future<> snapshot_works(const std::string& table_name) {
return do_with_some_data({"cf"}, [table_name] (cql_test_env& e) {
take_snapshot(e, "ks", table_name).get();
@@ -685,8 +651,6 @@ static future<> snapshot_works(const std::string& table_name) {
// all files were copied and manifest was generated
BOOST_REQUIRE_EQUAL(in_table_dir, in_snapshot_dir);
validate_manifest(snapshot_dir, in_snapshot_dir).get();
return make_ready_future<>();
}, true);
}
@@ -705,8 +669,7 @@ SEASTAR_TEST_CASE(index_snapshot_works) {
SEASTAR_TEST_CASE(snapshot_skip_flush_works) {
return do_with_some_data({"cf"}, [] (cql_test_env& e) {
db::snapshot_options opts = {.skip_flush = true};
take_snapshot(e, "ks", "cf", "test", opts).get();
take_snapshot(e, "ks", "cf", "test", true /* skip_flush */).get();
auto& cf = e.local_db().find_column_family("ks", "cf");
@@ -719,41 +682,6 @@ SEASTAR_TEST_CASE(snapshot_skip_flush_works) {
});
}
SEASTAR_TEST_CASE(snapshot_use_sstable_identifier_works) {
#ifndef SCYLLA_ENABLE_ERROR_INJECTION
fmt::print("Skipping test as it depends on error injection. Please run in mode where it's enabled (debug,dev).\n");
return make_ready_future<>();
#endif
sstring table_name = "cf";
// Force random sstable identifiers, otherwise the initial sstable_id is equal
// to the sstable generation and the test can't distinguish between them.
utils::get_local_injector().enable("random_sstable_identifier", false);
return do_with_some_data({table_name}, [table_name] (cql_test_env& e) -> future<> {
sstring tag = "test";
db::snapshot_options opts = {.use_sstable_identifier = true};
co_await take_snapshot(e, "ks", table_name, tag, opts);
auto& cf = e.local_db().find_column_family("ks", table_name);
auto table_directory = table_dir(cf);
auto snapshot_dir = table_directory / sstables::snapshots_dir / tag;
auto in_table_dir = co_await collect_files(table_directory);
// snapshot triggered a flush and wrote the data down.
BOOST_REQUIRE_GE(in_table_dir.size(), 9);
testlog.info("Files in table dir: {}", fmt::join(in_table_dir, ", "));
auto in_snapshot_dir = co_await collect_files(snapshot_dir);
testlog.info("Files in snapshot dir: {}", fmt::join(in_snapshot_dir, ", "));
in_table_dir.insert("manifest.json");
in_table_dir.insert("schema.cql");
// all files were copied and manifest was generated
BOOST_REQUIRE_EQUAL(in_table_dir.size(), in_snapshot_dir.size());
BOOST_REQUIRE_NE(in_table_dir, in_snapshot_dir);
co_await validate_manifest(snapshot_dir, in_snapshot_dir);
}, true);
}
SEASTAR_TEST_CASE(snapshot_list_okay) {
return do_with_some_data({"cf"}, [] (cql_test_env& e) {
auto& cf = e.local_db().find_column_family("ks", "cf");
@@ -1528,7 +1456,7 @@ SEASTAR_TEST_CASE(snapshot_with_quarantine_works) {
}
BOOST_REQUIRE(found);
co_await take_snapshot(e, "ks", "cf", "test", db::snapshot_options{.skip_flush = true});
co_await take_snapshot(e, "ks", "cf", "test", true /* skip_flush */);
testlog.debug("Expected: {}", expected);

View File

@@ -220,7 +220,7 @@ SEASTAR_TEST_CASE(test_query_counters) {
// Executes a batch of (modifying) statements and waits for it to complete.
auto process_batch = [&e](const std::vector<std::string_view>& queries, clevel cl) mutable {
e.execute_batch(queries, make_options(cl)).get();
e.execute_batch(queries, cql3::statements::batch_statement::type::UNLOGGED, make_options(cl)).get();
};
auto expected = get_query_metrics();

View File

@@ -346,60 +346,4 @@ SEASTAR_TEST_CASE(repair_rows_size_considers_external_memory) {
});
}
SEASTAR_TEST_CASE(test_tablet_token_range_count) {
{
// Simple case: one large range covers a smaller one
utils::chunked_vector<tablet_token_range> r1 = {{10, 20}};
utils::chunked_vector<tablet_token_range> r2 = {{0, 100}};
BOOST_REQUIRE(co_await count_finished_tablets(r1, r2) == 1);
}
{
// r2 ranges overlap and should merge to cover r1
// r2: [0, 50] + [40, 100] -> merges to [0, 100]
// r1: [10, 90] should be covered
utils::chunked_vector<tablet_token_range> r1 = {{10, 90}};
utils::chunked_vector<tablet_token_range> r2 = {{0, 50}, {40, 100}};
BOOST_REQUIRE(co_await count_finished_tablets(r1, r2) == 1);
}
{
// r2 ranges are adjacent (contiguous) and should merge
// r2: [0, 10] + [11, 20] -> merges to [0, 20]
// r1: [5, 15] should be covered
utils::chunked_vector<tablet_token_range> r1 = {{5, 15}};
utils::chunked_vector<tablet_token_range> r2 = {{0, 10}, {11, 20}};
BOOST_REQUIRE(co_await count_finished_tablets(r1, r2) == 1);
}
{
// r1 overlaps r2 but is not FULLY contained
// r2: [0, 10]
// r1: [5, 15] (Ends too late), [ -5, 5 ] (Starts too early)
utils::chunked_vector<tablet_token_range> r1 = {{5, 15}, {-5, 5}};
utils::chunked_vector<tablet_token_range> r2 = {{0, 10}};
BOOST_REQUIRE(co_await count_finished_tablets(r1, r2) == 0);
}
{
// A single merged range in r2 covers multiple distinct ranges in r1
utils::chunked_vector<tablet_token_range> r1 = {{10, 20}, {30, 40}, {50, 60}};
utils::chunked_vector<tablet_token_range> r2 = {{0, 100}};
BOOST_REQUIRE(co_await count_finished_tablets(r1, r2) == 3);
}
{
// Inputs are provided in random order, ensuring the internal sort works
utils::chunked_vector<tablet_token_range> r1 = {{50, 60}, {10, 20}};
utils::chunked_vector<tablet_token_range> r2 = {{50, 100}, {0, 40}};
// r2 merges effectively to [0, 40] and [50, 100]
// Both r1 items are covered
BOOST_REQUIRE(co_await count_finished_tablets(r1, r2) == 2);
}
{
utils::chunked_vector<tablet_token_range> r1 = {{10, 20}};
utils::chunked_vector<tablet_token_range> r2_empty = {};
utils::chunked_vector<tablet_token_range> r1_empty = {};
utils::chunked_vector<tablet_token_range> r2 = {{0, 100}};
BOOST_REQUIRE(co_await count_finished_tablets(r1, r2_empty) == 0);
BOOST_REQUIRE(co_await count_finished_tablets(r1_empty, r2) == 0);
}
}
BOOST_AUTO_TEST_SUITE_END()

View File

@@ -3388,8 +3388,11 @@ SEASTAR_TEST_CASE(test_non_full_and_empty_row_keys) {
.with_typed_column<bytes>("partition_key", [&] (const bytes& v) {
return partition_key::from_bytes(v).equal(*schema, dk.key());
})
.with_typed_column<bytes>("clustering_key", [&] (const bytes& v) {
return clustering_key::from_bytes(v).equal(*schema, ck);
.with_typed_column<bytes>("clustering_key", [&] (const bytes* v) {
if (!v) {
return ck.is_empty();
}
return clustering_key::from_bytes(*v).equal(*schema, ck);
})
.with_typed_column<sstring>("mutation_fragment_kind", "clustering row")
.with_typed_column<bytes>("frozen_mutation_fragment", [&] (const bytes& v) {

View File

@@ -49,5 +49,7 @@ custom_args:
- '-c1'
s3_test:
- '-c2 -m2G --logger-log-level s3=trace --logger-log-level http=trace --logger-log-level default_http_retry_strategy=trace'
batchlog_manager_test:
- '-c2 -m2G --logger-log-level batchlog_manager=trace:debug_error_injection=trace:testlog=trace'
run_in_debug:
- logalloc_standard_allocator_segment_pool_backend_test

View File

@@ -25,7 +25,7 @@ from typing import Any, Optional, override
import pytest
import requests
from cassandra import AlreadyExists, AuthenticationFailed, ConsistencyLevel, InvalidRequest, Unauthorized, Unavailable, WriteFailure
from cassandra.cluster import NoHostAvailable, Session
from cassandra.cluster import NoHostAvailable, Session, EXEC_PROFILE_DEFAULT
from cassandra.query import SimpleStatement, named_tuple_factory
from ccmlib.scylla_node import ScyllaNode, NodeError
@@ -1135,6 +1135,14 @@ class TestCQLAudit(AuditTester):
session.execute("DROP TABLE test1")
def _get_attempt_count(self, session: Session, *, execution_profile=EXEC_PROFILE_DEFAULT, consistency_level: ConsistencyLevel = ConsistencyLevel.ONE) -> int:
# dtest env is using FlakyRetryPolicy which has `max_retries` attribute
cl_profile = session.execution_profile_clone_update(execution_profile, consistency_level=consistency_level)
policy = cl_profile.retry_policy
retries = getattr(policy, "max_retries", None)
assert retries is not None
return 1 + retries
def _test_insert_failure_doesnt_report_success_assign_nodes(self, session: Session = None):
all_nodes: set[ScyllaNode] = set(self.cluster.nodelist())
assert len(all_nodes) == 7
@@ -1154,6 +1162,7 @@ class TestCQLAudit(AuditTester):
for i in range(256):
stmt = SimpleStatement(f"INSERT INTO ks.test1 (k, v1) VALUES ({i}, 1337)", consistency_level=ConsistencyLevel.THREE)
session.execute(stmt)
attempt_count = self._get_attempt_count(session, consistency_level=ConsistencyLevel.THREE)
token = rows_to_list(session.execute(f"SELECT token(k) FROM ks.test1 WHERE k = {i}"))[0][0]
@@ -1168,9 +1177,9 @@ class TestCQLAudit(AuditTester):
audit_partition_nodes = [address_to_node[address] for address in audit_nodes]
insert_node = address_to_node[insert_node.pop()]
kill_node = address_to_node[partitions.pop()]
return audit_partition_nodes, insert_node, kill_node, stmt.query_string
return audit_partition_nodes, insert_node, kill_node, stmt.query_string, attempt_count
return [], [], None, None
return [], [], None, None, None
@pytest.mark.exclude_errors("audit - Unexpected exception when writing log with: node_ip")
def test_insert_failure_doesnt_report_success(self):
@@ -1192,7 +1201,7 @@ class TestCQLAudit(AuditTester):
with self.assert_exactly_n_audit_entries_were_added(session, 1):
conn.execute(stmt)
audit_paritition_nodes, insert_node, node_to_stop, query_to_fail = self._test_insert_failure_doesnt_report_success_assign_nodes(session=session)
audit_paritition_nodes, insert_node, node_to_stop, query_to_fail, query_fail_count = self._test_insert_failure_doesnt_report_success_assign_nodes(session=session)
# TODO: remove the loop when scylladb#24473 is fixed
# We call get_host_id only to cache host_id
@@ -1231,8 +1240,8 @@ class TestCQLAudit(AuditTester):
# If any audit mode is not done yet, continue polling.
all_modes_done = True
for mode, rows in rows_dict.items():
rows_with_error = list(filter(lambda r: r.error, rows))
if len(rows_with_error) == 6:
rows_with_error = [row for row in rows if row.error and row.operation == query_to_fail]
if len(rows_with_error) == query_fail_count:
logger.info(f"audit mode {mode} log updated after {i} iterations ({i / 10}s)")
assert rows_with_error[0].error is True
assert rows_with_error[0].consistency == "THREE"

View File

@@ -26,7 +26,6 @@ from tools.assertions import (
assert_row_count_in_select_less,
)
from tools.data import insert_c1c2, rows_to_list
from tools.files import corrupt_file
from tools.metrics import get_node_metrics
@@ -678,6 +677,12 @@ class TestCommitLog(Tester):
node1.stop(gently=False)
# corrupt the commitlogs
def corrupt_file(file_path: str):
print(f"Writing junk into file header: {file_path}")
with open(file_path, "r+b") as f:
f.seek(18)
f.write(os.urandom(474))
for commitlog in self._get_commitlog_files():
corrupt_file(commitlog)
@@ -687,7 +692,7 @@ class TestCommitLog(Tester):
# check the data and the logs
in_table = rows_to_list(session.execute(f"SELECT * FROM {self.ks}.{self.cf};"))
assert not node1.grep_log(f"large_data - Writing large row {self.ks}/{self.cf}")
assert node1.grep_log("commitlog_replayer - Corrupted file:")
assert node1.grep_log("commitlog_replayer - Corrupted file")
assert in_table == []
def test_one_big_mutation_rollback_on_startup(self):

View File

@@ -1,25 +0,0 @@
#
# Copyright (C) 2025-present ScyllaDB
#
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
#
import os
import random
def corrupt_file(file_path: str, chunk_size: int = 1024, percentage: float = 0.2) -> None:
"""
Corrupts a file by overwriting random chunks of its content with random data.
Args:
file_path (str): The path to the file to be corrupted.
chunk_size (int, optional): The size of each random data chunk to write, in bytes.
percentage (float, optional): The percentage of the file to corrupt, represented as a float between 0 and 1.
"""
with open(file_path, "r+b") as f:
file_size = os.path.getsize(file_path)
chunks = int(file_size * percentage / chunk_size) or 1
for _ in range(chunks):
random_position = random.randint(0, max(file_size - chunk_size, 0))
f.seek(random_position)
f.write(os.urandom(chunk_size))

View File

@@ -570,6 +570,90 @@ async def test_alternator_enforce_authorization_true(manager: ManagerClient):
# We could further test how GRANT works, but this would be unnecessary
# repeating of the tests in test/alternator/test_cql_rbac.py.
@pytest.mark.asyncio
async def test_index_requires_rf_rack_valid(manager: ManagerClient):
"""
Verify that creating a table with GSI or LSI fails with an appropriate error if
it's tablets based and rf_rack_valid_keyspaces is disabled.
Adding a GSI to an existing table with tablets should fail as well.
"""
servers = await manager.servers_add(1, config=alternator_config | {"rf_rack_valid_keyspaces": False})
alternator = get_alternator(servers[0].ip_addr)
expected_err_create = "GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled."
expected_err_update_add_gsi = "GlobalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled."
def create_table_with_index(alternator, table_name, index_type, initial_tablets):
create_table_args = dict(
TableName=table_name,
BillingMode='PAY_PER_REQUEST',
Tags=[{'Key': 'system:initial_tablets', 'Value': initial_tablets}],
KeySchema=[
{'AttributeName': 'p', 'KeyType': 'HASH'},
{'AttributeName': 'c', 'KeyType': 'RANGE'}
],
AttributeDefinitions=[
{'AttributeName': 'p', 'AttributeType': 'S'},
{'AttributeName': 'c', 'AttributeType': 'S'}
],
)
if index_type == "GSI":
create_table_args["GlobalSecondaryIndexes"] = [
{
'IndexName': 'gsi1',
'KeySchema': [
{'AttributeName': 'c', 'KeyType': 'HASH'},
{'AttributeName': 'p', 'KeyType': 'RANGE'},
],
'Projection': {'ProjectionType': 'ALL'}
}
]
elif index_type == "LSI":
create_table_args["LocalSecondaryIndexes"] = [
{
'IndexName': 'lsi1',
'KeySchema': [
{'AttributeName': 'p', 'KeyType': 'HASH'},
{'AttributeName': 'c', 'KeyType': 'RANGE'},
],
'Projection': {'ProjectionType': 'ALL'}
}
]
alternator.create_table(**create_table_args)
# Create a table with tablets and GSI or LSI - should fail because rf_rack_valid_keyspaces is disabled
for index_type in ["GSI", "LSI"]:
with pytest.raises(ClientError, match=expected_err_create):
create_table_with_index(alternator, unique_table_name(), index_type, initial_tablets='1')
# Now create the table without tablets - should succeed
for index_type in ["GSI", "LSI"]:
create_table_with_index(alternator, unique_table_name(), index_type, initial_tablets='none')
# Create a table with tablets and no indexes, then add a GSI - the update should fail
table_name = unique_table_name()
create_table_with_index(alternator, table_name, index_type=None, initial_tablets='1')
with pytest.raises(ClientError, match=expected_err_update_add_gsi):
alternator.meta.client.update_table(
TableName=table_name,
AttributeDefinitions=[
{'AttributeName': 'c', 'AttributeType': 'S'},
{'AttributeName': 'p', 'AttributeType': 'S'},
],
GlobalSecondaryIndexUpdates=[
{
'Create': {
'IndexName': 'gsi1',
'KeySchema': [
{'AttributeName': 'c', 'KeyType': 'HASH'},
{'AttributeName': 'p', 'KeyType': 'RANGE'},
],
'Projection': {'ProjectionType': 'ALL'}
}
}
]
)
# Unfortunately by default a Python thread print the exception that kills
# it (e.g., pytest assert failures) but it doesn't propagate the exception
# to the join() - so the overall test doesn't fail. The following ThreadWrapper

View File

@@ -56,7 +56,7 @@ async def test_batchlog_replay_while_a_node_is_down(manager: ManagerClient) -> N
await manager.server_stop(servers[1].server_id)
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog", host=hosts[0]))[0].count
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog_v2", host=hosts[0]))[0].count
assert batchlog_row_count > 0
await asyncio.gather(*[manager.api.disable_injection(s.ip_addr, "skip_batch_replay") for s in servers if s != servers[1]])
@@ -72,7 +72,7 @@ async def test_batchlog_replay_while_a_node_is_down(manager: ManagerClient) -> N
await s0_log.wait_for('Finished replayAllFailedBatches', timeout=60, from_mark=s0_mark)
async def batchlog_empty() -> bool:
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog", host=hosts[0]))[0].count
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog_v2", host=hosts[0]))[0].count
if batchlog_row_count == 0:
return True
await wait_for(batchlog_empty, time.time() + 60)
@@ -120,7 +120,7 @@ async def test_batchlog_replay_aborted_on_shutdown(manager: ManagerClient) -> No
await asyncio.gather(*[manager.api.disable_injection(s.ip_addr, "skip_batch_replay") for s in servers if s != servers[1]])
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog", host=hosts[0]))[0].count
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog_v2", host=hosts[0]))[0].count
assert batchlog_row_count > 0
# The batch is replayed while server 1 is down
@@ -136,7 +136,7 @@ async def test_batchlog_replay_aborted_on_shutdown(manager: ManagerClient) -> No
hosts = await wait_for_cql_and_get_hosts(cql, servers, time.time() + 60)
async def batchlog_empty() -> bool:
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog", host=hosts[0]))[0].count
batchlog_row_count = (await cql.run_async("SELECT COUNT(*) FROM system.batchlog_v2", host=hosts[0]))[0].count
if batchlog_row_count == 0:
return True
await wait_for(batchlog_empty, time.time() + 60)
@@ -237,7 +237,7 @@ async def test_drop_mutations_for_dropped_table(manager: ManagerClient) -> None:
host1, _ = hosts
async def get_batchlog_row_count():
rows = await cql.run_async("SELECT COUNT(*) FROM system.batchlog", host=host1)
rows = await cql.run_async("SELECT COUNT(*) FROM system.batchlog_v2", host=host1)
row = rows[0]
assert hasattr(row, "count")
@@ -313,12 +313,14 @@ async def test_batchlog_replay_failure_during_repair(manager: ManagerClient, rep
9. Verify that the row is deleted and there is no data resurrection.
"""
write_timeout_ms = 2000
cmdline=['--enable-cache', '0',
"--hinted-handoff-enabled", "0",
"--logger-log-level", "batchlog_manager=trace:repair=debug",
"--repair-hints-batchlog-flush-cache-time-in-ms", "1000000" if repair_cache else "0"]
config = {"error_injections_at_startup": ["short_batchlog_manager_replay_interval"],
"write_request_timeout_in_ms": 2000,
"write_request_timeout_in_ms": write_timeout_ms,
'group0_tombstone_gc_refresh_interval_in_ms': 1000,
'tablets_mode_for_new_keyspaces': 'disabled'
}
@@ -330,7 +332,7 @@ async def test_batchlog_replay_failure_during_repair(manager: ManagerClient, rep
host1, _ = hosts
async def get_batchlog_row_count():
rows = await cql.run_async("SELECT COUNT(*) FROM system.batchlog", host=host1)
rows = await cql.run_async("SELECT COUNT(*) FROM system.batchlog_v2", host=host1)
row = rows[0]
assert hasattr(row, "count")
@@ -376,7 +378,7 @@ async def test_batchlog_replay_failure_during_repair(manager: ManagerClient, rep
await cql.run_async(f"DELETE FROM {ks}.my_table WHERE pk=0 AND ck=0")
time.sleep(2)
await asyncio.sleep((write_timeout_ms // 1000) * 2 + 1)
batchlog_row_count = await get_batchlog_row_count()
assert batchlog_row_count > 0
@@ -386,7 +388,7 @@ async def test_batchlog_replay_failure_during_repair(manager: ManagerClient, rep
# Once the mutations are in the batchlog, waiting to be replayed, we can disable this.
await disable_injection("storage_proxy_fail_remove_from_batchlog")
await enable_injection("batch_replay_throw")
await enable_injection("storage_proxy_fail_replay_batch")
# Once the table is dropped, we can resume the replay. The bug can
# be triggered from now on (if it's present).
await disable_injection("skip_batch_replay")
@@ -402,7 +404,7 @@ async def test_batchlog_replay_failure_during_repair(manager: ManagerClient, rep
await manager.api.keyspace_compaction(s1.ip_addr, ks)
await manager.api.keyspace_compaction(s2.ip_addr, ks)
await disable_injection("batch_replay_throw")
await disable_injection("storage_proxy_fail_replay_batch")
await s1_log.wait_for("Replaying batch", timeout=60, from_mark=s1_mark)
await s1_log.wait_for("Finished replayAllFailedBatches", timeout=60, from_mark=s1_mark)

View File

@@ -220,14 +220,14 @@ async def test_tablet_repair_sstable_skipped_read_metrics(manager: ManagerClient
await insert_keys(cql, ks, 0, 100)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental')
skipped_bytes = get_incremental_repair_sst_skipped_bytes(servers[0])
read_bytes = get_incremental_repair_sst_read_bytes(servers[0])
# Nothing to skip. Repair all data.
assert skipped_bytes == 0
assert read_bytes > 0
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental')
skipped_bytes2 = get_incremental_repair_sst_skipped_bytes(servers[0])
read_bytes2 = get_incremental_repair_sst_read_bytes(servers[0])
# Skip all. Nothing to repair
@@ -236,7 +236,7 @@ async def test_tablet_repair_sstable_skipped_read_metrics(manager: ManagerClient
await insert_keys(cql, ks, 200, 300)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental')
skipped_bytes3 = get_incremental_repair_sst_skipped_bytes(servers[0])
read_bytes3 = get_incremental_repair_sst_read_bytes(servers[0])
# Both skipped and read bytes should grow
@@ -272,7 +272,7 @@ async def test_tablet_incremental_repair(manager: ManagerClient):
assert get_sstables_repaired_at(map0, token) == sstables_repaired_at
# First repair
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental')
map1 = await load_tablet_sstables_repaired_at(manager, cql, servers[0], hosts[0], table_id)
logging.info(f'map1={map1}')
# Check sstables_repaired_at is increased by 1
@@ -288,7 +288,7 @@ async def test_tablet_incremental_repair(manager: ManagerClient):
assert len(enable) == 1
# Second repair
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental')
map2 = await load_tablet_sstables_repaired_at(manager, cql, servers[0], hosts[0], table_id)
logging.info(f'map2={map2}')
# Check sstables_repaired_at is increased by 1
@@ -313,7 +313,7 @@ async def test_tablet_incremental_repair_error(manager: ManagerClient):
# Repair should not finish with error
await inject_error_on(manager, "repair_tablet_fail_on_rpc_call", servers)
try:
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, timeout=10)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental', timeout=10)
assert False # Check the tablet repair is not supposed to finish
except TimeoutError:
logger.info("Repair timeout as expected")
@@ -329,7 +329,7 @@ async def do_tablet_incremental_repair_and_ops(manager: ManagerClient, ops: str)
servers, cql, hosts, ks, table_id, logs, repaired_keys, unrepaired_keys, current_key, token = await preapre_cluster_for_incremental_repair(manager, nr_keys)
token = -1
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental')
# 1 add 0 skip 1 mark
for log in logs:
sst_add, sst_skip, sst_mark = await get_sst_status("First", log)
@@ -355,7 +355,7 @@ async def do_tablet_incremental_repair_and_ops(manager: ManagerClient, ops: str)
else:
assert False # Wrong ops
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental')
# 1 add 1 skip 1 mark
for log in logs:
@@ -394,7 +394,7 @@ async def test_tablet_incremental_repair_and_minor(manager: ManagerClient):
await manager.api.disable_autocompaction(server.ip_addr, ks, 'test')
# First repair
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental')
# Insert more keys
await insert_keys(cql, ks, current_key, current_key + nr_keys)
@@ -402,7 +402,7 @@ async def test_tablet_incremental_repair_and_minor(manager: ManagerClient):
current_key += nr_keys
# Second repair
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental')
# Insert more keys and flush to get 2 more sstables
for _ in range(2):
@@ -436,7 +436,7 @@ async def do_test_tablet_incremental_repair_with_split_and_merge(manager, do_spl
servers, cql, hosts, ks, table_id, logs, repaired_keys, unrepaired_keys, current_key, token = await preapre_cluster_for_incremental_repair(manager, nr_keys)
# First repair
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token) # sstables_repaired_at 1
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental') # sstables_repaired_at 1
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token) # sstables_repaired_at 2
# Insert more keys
@@ -445,7 +445,7 @@ async def do_test_tablet_incremental_repair_with_split_and_merge(manager, do_spl
current_key += nr_keys
# Second repair
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token) # sstables_repaired_at 3
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental') # sstables_repaired_at 3
# Insert more keys and flush to get 2 more sstables
for _ in range(2):
@@ -505,7 +505,7 @@ async def test_tablet_incremental_repair_existing_and_repair_produced_sstable(ma
await manager.server_start(servers[1].server_id)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental')
scylla_path = get_scylla_path(cql)
@@ -521,8 +521,8 @@ async def test_tablet_incremental_repair_merge_higher_repaired_at_number(manager
servers, cql, hosts, ks, table_id, logs, repaired_keys, unrepaired_keys, current_key, token = await preapre_cluster_for_incremental_repair(manager, nr_keys)
# First repair
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token) # sstables_repaired_at 1
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token) # sstables_repaired_at 2
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental') # sstables_repaired_at 1
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental') # sstables_repaired_at 2
# Insert more keys
await insert_keys(cql, ks, current_key, current_key + nr_keys)
@@ -532,7 +532,7 @@ async def test_tablet_incremental_repair_merge_higher_repaired_at_number(manager
# Second repair
await inject_error_on(manager, "repair_tablet_no_update_sstables_repair_at", servers)
# some sstable repaired_at = 3, but sstables_repaired_at = 2
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token) # sstables_repaired_at 2
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental') # sstables_repaired_at 2
await inject_error_off(manager, "repair_tablet_no_update_sstables_repair_at", servers)
scylla_path = get_scylla_path(cql)
@@ -561,8 +561,8 @@ async def test_tablet_incremental_repair_merge_correct_repaired_at_number_after_
servers, cql, hosts, ks, table_id, logs, repaired_keys, unrepaired_keys, current_key, token = await preapre_cluster_for_incremental_repair(manager, nr_keys)
# First repair
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token) # sstables_repaired_at 1
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token) # sstables_repaired_at 2
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental') # sstables_repaired_at 1
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental') # sstables_repaired_at 2
# Insert more keys
await insert_keys(cql, ks, current_key, current_key + nr_keys)
@@ -574,7 +574,7 @@ async def test_tablet_incremental_repair_merge_correct_repaired_at_number_after_
last_tokens = [t.last_token for t in replicas]
for t in last_tokens[0::2]:
logging.info(f"Start repair for token={t}");
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", t) # sstables_repaired_at 3
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", t, incremental_mode='incremental') # sstables_repaired_at 3
scylla_path = get_scylla_path(cql)
@@ -595,7 +595,7 @@ async def do_test_tablet_incremental_repair_merge_error(manager, error):
servers, cql, hosts, ks, table_id, logs, repaired_keys, unrepaired_keys, current_key, token = await preapre_cluster_for_incremental_repair(manager, nr_keys, cmdline)
# First repair
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token) # sstables_repaired_at 1
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental') # sstables_repaired_at 1
# Insert more keys
await insert_keys(cql, ks, current_key, current_key + nr_keys)
@@ -659,13 +659,18 @@ async def test_tablet_repair_with_incremental_option(manager: ManagerClient):
assert read1 == 0
assert skip2 == 0
assert read2 > 0
await do_repair_and_check(None, 1, rf'Starting tablet repair by API .* incremental_mode=incremental.*', check1)
await do_repair_and_check('incremental', 1, rf'Starting tablet repair by API .* incremental_mode=incremental.*', check1)
def check2(skip1, read1, skip2, read2):
assert skip1 == skip2
assert read1 == read2
await do_repair_and_check('disabled', 0, rf'Starting tablet repair by API .* incremental_mode=disabled.*', check2)
# FIXME: Incremental repair is disabled by default due to
# https://github.com/scylladb/scylladb/issues/26041 and
# https://github.com/scylladb/scylladb/issues/27414
await do_repair_and_check(None, 0, rf'Starting tablet repair by API .* incremental_mode=disabled.*', check2)
def check3(skip1, read1, skip2, read2):
assert skip1 < skip2
assert read1 == read2
@@ -677,14 +682,14 @@ async def test_tablet_repair_with_incremental_option(manager: ManagerClient):
await do_repair_and_check('full', 1, rf'Starting tablet repair by API .* incremental_mode=full.*', check4)
@pytest.mark.asyncio
async def test_tablet_repair_tablet_time_metrics(manager: ManagerClient):
async def test_incremental_repair_tablet_time_metrics(manager: ManagerClient):
servers, _, _, ks, _, _, _, _, _, token = await preapre_cluster_for_incremental_repair(manager)
time1 = 0
time2 = 0
for s in servers:
time1 += get_repair_tablet_time_ms(s)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token)
await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, incremental_mode='incremental')
for s in servers:
time2 += get_repair_tablet_time_ms(s)
@@ -694,7 +699,7 @@ async def test_tablet_repair_tablet_time_metrics(manager: ManagerClient):
# Reproducer for https://github.com/scylladb/scylladb/issues/26346
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_repair_finishes_when_tablet_skips_end_repair_stage(manager):
async def test_incremental_repair_finishes_when_tablet_skips_end_repair_stage(manager):
servers = await manager.servers_add(3, auto_rack_dc="dc1")
async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 3} AND tablets = {'initial': 1}") as ks:
@@ -719,7 +724,7 @@ async def test_repair_finishes_when_tablet_skips_end_repair_stage(manager):
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
async def test_repair_rejoin_do_tablet_operation(manager):
async def test_incremental_repair_rejoin_do_tablet_operation(manager):
cmdline = ['--logger-log-level', 'raft_topology=debug']
servers = await manager.servers_add(3, auto_rack_dc="dc1", cmdline=cmdline)

View File

@@ -43,86 +43,6 @@ async def guarantee_repair_time_next_second():
# different than the previous one.
await asyncio.sleep(1)
async def do_test_tablet_repair_progress_split_merge(manager: ManagerClient, do_split=False, do_merge=False):
nr_tablets = 16
servers, cql, hosts, ks, table_id = await create_table_insert_data_for_repair(manager, fast_stats_refresh=True, tablets=nr_tablets)
token = 'all'
logs = []
for s in servers:
logs.append(await manager.server_open_log(s.server_id))
# Skip repair for the listed tablet id
nr_tablets_skipped = 4
nr_tablets_repaired = nr_tablets - nr_tablets_skipped
await inject_error_on(manager, "tablet_repair_skip_sched", servers, params={'value':"0,1,5,8"})
# Request to repair all tablets
repair_res = await manager.api.tablet_repair(servers[0].ip_addr, ks, "test", token, await_completion=False)
logging.info(f'{repair_res=}')
tablet_task_id = repair_res['tablet_task_id']
async def get_task_status(desc):
task_status = await manager.api.get_task_status(servers[0].ip_addr, tablet_task_id)
completed = int(task_status['progress_completed'])
total = int(task_status['progress_total'])
logging.info(f'{desc=} {completed=} {total=} {task_status=}')
return completed, total
async def wait_task_progress(wanted_complete, wanted_total):
while True:
completed, total = await get_task_status("wait_task_progress")
if completed == wanted_complete and total == wanted_total:
break
await asyncio.sleep(1)
async def get_task_status_and_check(desc):
completed, total = await get_task_status(desc)
assert completed == nr_tablets_repaired
assert total == nr_tablets
# 12 out of 16 tablets should finish
await wait_task_progress(nr_tablets_repaired, nr_tablets)
if do_split:
await get_task_status_and_check("before_split")
s1_mark = await logs[0].mark()
await inject_error_on(manager, "tablet_force_tablet_count_increase", servers)
await logs[0].wait_for('Detected tablet split for table', from_mark=s1_mark)
await inject_error_off(manager, "tablet_force_tablet_count_increase", servers)
await get_task_status_and_check("after_split")
if do_merge:
await get_task_status_and_check("before_merge")
s1_mark = await logs[0].mark()
await inject_error_on(manager, "tablet_force_tablet_count_decrease", servers)
await logs[0].wait_for('Detected tablet merge for table', from_mark=s1_mark)
await inject_error_off(manager, "tablet_force_tablet_count_decrease", servers)
await get_task_status_and_check("after_merge")
# Wait for all repair to finish after all tablets can be scheduled to run repair
await inject_error_off(manager, "tablet_repair_skip_sched", servers)
await wait_task_progress(nr_tablets, nr_tablets)
@skip_mode('release', 'error injections are not supported in release mode')
@pytest.mark.asyncio
async def test_tablet_repair_progress(manager: ManagerClient):
await do_test_tablet_repair_progress_split_merge(manager, do_split=False, do_merge=False)
@skip_mode('release', 'error injections are not supported in release mode')
@pytest.mark.asyncio
async def test_tablet_repair_progress_split(manager: ManagerClient):
await do_test_tablet_repair_progress_split_merge(manager, do_split=True)
@pytest.mark.asyncio
@pytest.mark.skip(reason="https://github.com/scylladb/scylladb/issues/26844")
@skip_mode('release', 'error injections are not supported in release mode')
async def test_tablet_repair_progress_merge(manager: ManagerClient):
await do_test_tablet_repair_progress_split_merge(manager, do_merge=True)
@pytest.mark.asyncio
async def test_tablet_manual_repair(manager: ManagerClient):
servers, cql, hosts, ks, table_id = await create_table_insert_data_for_repair(manager, fast_stats_refresh=False, disable_flush_cache_time=True)

View File

@@ -150,6 +150,7 @@ def test_cannot_order_with_ann_on_non_vector_column(cql, test_keyspace):
# assertThatThrownBy(() -> execute("SELECT * FROM %s ORDER BY value ann of [0.0, 0.0] LIMIT 2")).isInstanceOf(InvalidRequestException.class);
# }
@pytest.mark.xfail(reason="As we do not support filtering yet, wrong error is thrown, once VECTOR-374 is done, this should pass")
def test_must_have_limit_specified_and_within_max_allowed(cql, test_keyspace):
with create_table(cql, test_keyspace, "(k int PRIMARY KEY, v vector<float, 1>)") as table:
custom_index = "vector_index" if is_scylla(cql) else "StorageAttachedIndex"
@@ -187,6 +188,7 @@ def test_must_have_limit_specified_and_within_max_allowed(cql, test_keyspace):
# assertEquals(1, result.size());
# }
@pytest.mark.xfail(reason="As we do not support filtering yet, wrong error is thrown, once VECTOR-374 is done, this should pass")
def test_cannot_have_aggregation_on_ann_query(cql, test_keyspace):
with create_table(cql, test_keyspace, "(k int PRIMARY KEY, v vector<float, 1>, c int)") as table:
custom_index = "vector_index" if is_scylla(cql) else "StorageAttachedIndex"

Some files were not shown because too many files have changed in this diff Show More