Compare commits

...

113 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
6851de4899 Fix token kind comparison in decorated_key::tri_compare
When comparing decorated_key with ring_position, we need to account for
the token kind. decorated_key tokens are always token_kind::key, but
ring_position tokens can be before_all_keys or after_all_keys.

The previous version incorrectly compared only _data fields, which would
produce wrong results when the ring_position token had a different kind.

Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
2026-01-29 13:58:03 +00:00
copilot-swe-agent[bot]
79aa26becd Replace dht::token with int64_t in decorated_key
- Change decorated_key._token from dht::token to int64_t _token_data
- Update token() method to return constructed dht::token
- Update all direct field accesses to use token() method
- Update comparisons to use _token_data directly for efficiency
- Remove token's external_memory_usage contribution (was 0)
- Update formatters and hash functions

This eliminates 8 bytes of bloat per decorated_key instance by removing
the redundant token_kind field (always token_kind::key for decorated_key).

Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
2026-01-29 13:54:46 +00:00
copilot-swe-agent[bot]
eb9852499c Initial plan 2026-01-29 13:49:15 +00:00
Botond Dénes
a8767f36da Merge 'Improve load balancer logging and other minor cleanups' from Tomasz Grabiec
Contains various improvements to tablet load balancer. Batched together to save on the bill for CI.

Most notably:
 - Make plan summary more concise, and print info only about present elements.
 - Print rack name in addition to DC name when making a per-rack plan
 - Print "Not possible to achieve balance" only when this is the final plan with no active migrations
 - Print per-node stats when "Not possible to achieve balance" is printed
 - amortize metrics lookup cost
 - avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring"

Backport to 2026.1: since the changes enhance debuggability and are relatively low risk

Fixes #28423
Fixes #28422

Closes scylladb/scylladb#28337

* github.com:scylladb/scylladb:
  tablets: tablet_allocator.cc: Convert tabs to spaces
  tablets: load_balancer: Warn about incomplete stats once for all offending nodes
  tablets: load_balancer: Improve node stats printout
  tablets: load_balancer: Warn about imbalance only when there are no more active migrations
  tablets: load_balancer: Extract print_node_stats()
  tablet: load_balancer: Use empty() instead of size() where applicable
  tablets: Fix redundancy in migration_plan::empty()
  tablets: Cache pointer to stats during plan-making
  tablets: load_balancer: Print rack in addition to DC when giving context
  tablets: load_balancer: Make plan summary concise
  tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
2026-01-29 08:25:17 +02:00
Piotr Dulikowski
ec6a2661de Merge 'Keep view_builder background fiber in maintenance scheduling group' from Pavel Emelyanov
In fact, it's partially there already. When view_builder::start() is called is first calls initialization code (the start_in_background() method), then kicks do_build_step() that runs a background fiber to perform build steps. The starting code inherits scheduling group from main(). And the step fiber code needs to run itself in a maintenance scheduling group, so it explicitly grabs one via database->db_config.

This PR mainly gets rid of the call to database::get_streaming_scheduling_group() from do_build_step() as preparation to splitting the streaming scheduling group into parts (see SCYLLADB-351). To make it happen the do_build_step() is patched to inherit its scheduling group from view_builder::start() and the start() itself is called by main from maintenance scheduling group (like for other view building services).

New feature (nested scheduling group), not backporting

Closes scylladb/scylladb#28386

* github.com:scylladb/scylladb:
  view_builder: Start background in maintenance group
  view_builder: Wake-up step fiber with condition variable
2026-01-28 20:49:19 +01:00
Pavel Emelyanov
cb1d05d65a streaming: Get streaming sched group from debug:: namespace
In a lambda returned from make_streaming_consumer() there's a check for
current scheudling group being streaming one. It came from #17090 where
streaming code was launched in wrong sched group thus affecting user
groups in a bad way.

The check is nice and useful, but it abuses replica::database by getting
unrelated information from it.

To preserve the check and to stop using database as provider of configs,
keep the streaming scheduling group handle in the debug namespace. This
emphasises that this global variable is purely for debugging purposes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28410
2026-01-28 19:14:59 +02:00
Marcin Maliszkiewicz
5d4e2ec522 Merge 'docs: add documentation for automatic repair' from Botond Dénes
Explain what automatic repair is and how to configure it. While at it, improve the existing repair documentation a bit.

Fixes: SCYLLADB-130

This PR missed the 2026.1 branch date, so it needs backport to 2026.1, where the auto repair feature debuts.

Closes scylladb/scylladb#28199

* github.com:scylladb/scylladb:
  docs: add feature page for automatic repair
  docs: inter-link incremental-repair and repair documents
  docs: incremental-repair: fix curl example
2026-01-28 17:46:53 +01:00
Nadav Har'El
1454228a05 test/cqlpy: fix "assertion rewriting" in translated Cassandra tests
One of the best features of the pytest framework is "assertion
rewriting": If your test does for example "assert a + 1 == b", the
assertion is "rewritten" so that if it fails it tells you not only
that "a+1" and "b" are not equal, what the non-equal values are,
how they are not equal (e.g., find different elements of arrays) and
how each side of the equality was calculated.

But pytest can only "rewrite" assertion that it sees. If you call a
utility function checksomething() from another module and that utility
function calls assert, it will not be able to rewrite it, and you'll
get ugly, hard-to-debug, assertion failures.

This problem is especially noticable in tests we translated from
Cassandra, in test/cqlpy/cassandra_tests. Those tests use a bunch of
assertion-performing utility functions like assertRows() et al.
Those utility functions are defined in a separate source file,
porting.py, so by default do not get their assertions rewritten.

We had a solution for this: test/cqlpy/cassandra_test/__init__.py had:

    pytest.register_assert_rewrite("cassandra_tests.porting")

This tells pytest to rewrite assertions in porting.py the first time
that it is imported.

It used to work well, but recently it stopped working. This is because
we change the module paths recently, and it should be written as
test.cqlpy.cassandra_tests.porting.

I verified by editing one of the cassandra_tests to make a bad check
that indeed this statement stopped working, and fixing the module
path in this way solves it, and makes assertion rewriting work
again.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#28411
2026-01-28 18:34:57 +02:00
Pavel Emelyanov
3ebd02513a view_builder: Start background in maintenance group
Currently view_builder::start() is called in default scheduling group.
Once it initializes itself, it wakes up the step fiber that explicitly
switches to maintenance scheduling group.

This explicit switch made sence before previous patch, when the fiber
was implemented as a serialized action. Now the fiber starts directly
from .start() method and can inherit scheduling group from it.

Said that, main code calls view_builder::start() in maintenance
scheduling group killing two birds with one stone. First, the step fiber
no longer needs borrow its scheduling group indirectly via database.
Second, the start_in_background() code itself runs in a more suitable
scheduling group.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:34:59 +03:00
Pavel Emelyanov
2439d27b60 view_builder: Wake-up step fiber with condition variable
View builder runs a background fiber that perform build steps. To kick
the fiber it uses serizlized action, but it's an overkill -- nobody
waits for the action to finish, but on stop, when it's joined.

This patch uses condition variable to kick the fiber, and starts it
instantly, in the place where serialized action was first kicked.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:34:58 +03:00
Botond Dénes
1713d75c0d docs: add feature page for automatic repair
Explain what the feature is and how to confiture it.
Inter-link all the repair related pages, so one can discover all about
repair, regardless of which page they land on.
2026-01-28 16:45:57 +02:00
Tomasz Grabiec
df949dc506 Merge 'topology_coordinator: make cleanup reliable on barrier failures' from Łukasz Paszkowski
Fix a subtle but damaging failure mode in the tablet migration state machine: when a barrier fails, the follow-up barrier is triggered asynchronously, and cleanup can get skipped for that iteration. On the next loop, the original failure may no longer be visible (because the failing node got excluded), so the tablet can incorrectly move forward instead of entering `cleanup_target`.

To make cleanup reliable this PR:

Adds an additional “fallback cleanup” stage

- `write_both_read_old_fallback_cleanup`

that does not modify read/write selectors. This stage is safe to enter immediately after a barrier failure, and it funnels the tablet into cleanup with the required barriers.

Avoids changing both read and write selectors in a single step transitioning from `write_both_read_new` to `cleanup_target`. The fallback path updates selectors in a safe order: read first, then write.

Allows a direct no-barrier transition from `allow_write_both_read_old` to `cleanup_target` after failure, because in that specific case `cleanup_target` doesn’t change selectors and the hop is safe.

No need for backport. It's an improvement. Currently, tablets transition to `cleanup_target` eventually via failed streaming.

Closes scylladb/scylladb#28169

* github.com:scylladb/scylladb:
  topology_coordinator: add write_both_read_old_fallback_cleanup state
  topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier
  topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old
  topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old
2026-01-28 13:33:39 +01:00
Botond Dénes
ee631f31a0 Merge 'Do not export system keyspace from raft_group0_client' from Pavel Emelyanov
There are few places that use raft_group0_client as a way to get to system_keyspace. Mostly they can live without it -- either the needed reference is already at hand, or it's (ab)used to get to the database reference. The only place that really needs the system keyspace is the state merger code that needs last state ID. For that, the explicit helper method is added to group0_client.

Refining API between components, not backporting

Closes scylladb/scylladb#28387

* github.com:scylladb/scylladb:
  raft_group0_client: Dont export system keyspace
  raft_group0_client: Add and use get_last_group0_state_id()
  group0_state_machine: Call ensure_group0_sched() with data_dictionary
  view_building_worker: Use its own system_keyspace& reference
2026-01-28 13:24:32 +02:00
Yaron Kaikov
7c49711906 test/cqlpy: Remove redundant pytest.register_assert_rewrite call
During test.py run, noticed this warning:
```
10:38:22  test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: 32 warnings
10:38:22    /jenkins/workspace/releng-testing/scylla-ci/scylla/test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: test.cqlpy.cassandra_tests.porting
10:38:22      pytest.register_assert_rewrite('test.cqlpy.cassandra_tests.porting')
```

The insert_update_if_condition_test.py was calling
pytest.register_assert_rewrite() for the porting module, but this
registration is already handled by cassandra_tests/__init__.py which
is automatically loaded before any test runs.

Closes scylladb/scylladb#28409
2026-01-28 13:17:05 +02:00
Avi Kivity
42fdea7410 github: fix iwyu workflow permissions
The include-what-you-use workflow fails with

```
Invalid workflow file: .github/workflows/iwyu.yaml#L25
The workflow is not valid. .github/workflows/iwyu.yaml (Line: 25, Col: 3): Error calling workflow 'scylladb/scylladb/.github/workflows/read-toolchain.yaml@257054deffbef0bde95f0428dc01ad10d7b30093'. The nested job 'read-toolchain' is requesting 'contents: read', but is only allowed 'contents: none'.
```

Fix by adding the correct permissions.

Closes scylladb/scylladb#28390
2026-01-28 12:38:54 +02:00
Jakub Smolar
e1f623dd69 skip_mode: Allow multiple build modes in pytest skip_mode marker
Enhance the skip_mode marker to accept either a single mode string
or a list of modes, allowing tests to be skipped across multiple
build configurations with a single marker.

Before:
  @pytest.mark.skip_mode("dev", reason="...")
  @pytest.mark.skip_mode("debug", reason="...")

After:
  @pytest.mark.skip_mode(["dev", "debug"], reason="...")

This reduces duplication when the same skip condition applies
to multiple build modes.

Closes scylladb/scylladb#28406
2026-01-28 12:27:41 +02:00
Patryk Jędrzejczak
a2c1569e04 test: test_gossiper_orphan_remover: get host ID of the bootstrapping node before it crashes
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.

We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await log.wait_for("finished do_send_ack2_msg")` above guarantees that the
node has started the gossip shadow round, which happens after starting the REST
API.

Fixes #28385

Closes scylladb/scylladb#28388
2026-01-28 10:54:22 +02:00
Avi Kivity
8d2689d1b5 build: avoid sccache by default for Rust targets
A bug[1] in sccache prevents correct distributed compilation of wasmtime.

Disable it by default for now, but allow users to enable it.

[1] https://github.com/mozilla/sccache/issues/2575

Closes scylladb/scylladb#28389
2026-01-28 10:36:49 +02:00
Pavel Emelyanov
2ffe5b7d80 tablet_allocator: Have its own explicit background scheduling group
Currently, tablet_allocator switches to streaming scheduling group that
it gets from database. It's not nice to use database as provider of
configs/scheduling_groups.

This patch adds a background scheduling group for tablet allocator
configured via its config and sets it to streaming group in main.cc
code.

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28356
2026-01-28 10:34:28 +02:00
Avi Kivity
47315c63dc treewide: include Seastar headers with angle brackets
Seastar is a "system" library from our point of view, so
should be included with angle brackets.

Closes scylladb/scylladb#28395
2026-01-28 10:33:06 +02:00
Botond Dénes
b7dccdbe93 Merge 'test/storage: speed up out-of-space prevention tests' from Łukasz Paszkowski
This PR reduces the runtime of `test_out_of_space_prevention.py` by addressing two main sources of overhead: slow “critical utilization” setup and delayed tablet load stats propagation. Combined, these changes cut the module’s total execution time from 324s to 185s.

Improvements. No backup is required.

Closes scylladb/scylladb#28396

* github.com:scylladb/scylladb:
  test/storage: speed up out-of-space prevention tests by using smaller volumes
  test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests
2026-01-28 10:28:20 +02:00
Marcin Maliszkiewicz
931a38de6e service: remove unused has_schema_access
It became unused after we dropped support for thrift
in ad649be1bf

Closes scylladb/scylladb#28341
2026-01-28 10:18:26 +02:00
Pavel Emelyanov
834921251b test: Replace memory_data_source with seastar::util::as_input_stream
The existing test-only implementation is a simplified version of the
generic one.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28339
2026-01-28 10:15:03 +02:00
Andrei Chekun
335e81cdf7 test.py: migrate nodetool to run by pytest
As a next step of migration to the pytest runner, this PR moves
responsibility for nodetool tests execution solely to the pytest.

Closes scylladb/scylladb#28348
2026-01-28 09:49:59 +02:00
Tomasz Grabiec
8e831a7b6d tablets: tablet_allocator.cc: Convert tabs to spaces 2026-01-28 01:32:01 +01:00
Tomasz Grabiec
9715965d0c tablets: load_balancer: Warn about incomplete stats once for all offending nodes
To reduce log spamming when all nodes are missing stats.
2026-01-28 01:32:01 +01:00
Tomasz Grabiec
ef0e9ad34a tablets: load_balancer: Improve node stats printout
Make it more concise:
- reduce precision for load to 6 fractional digits
- reduce precision for tablets/shard to 3 fractional digits
- print "dc1/rack1" instead of "dc=dc1 rack=rack1", like in other places
- print "rd=0 wr=0" instead of "stream_read=0 stream_write=0"

Example:

 load_balancer - Node 477569c0-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=170.666667 tablets=1 shards=12 tablets/shard=0.083 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47678711-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0
 load_balancer - Node 47832560-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0
2026-01-28 01:32:01 +01:00
Tomasz Grabiec
4a161bff2d tablets: load_balancer: Warn about imbalance only when there are no more active migrations
Otherwise, it may be only a temporary situation due to lack of
candidates, and may be unnecessarily alerting.

Also, print node stats to allow assessing how bad the situation is on
the spot. Those stats can hint to a cause of imbalance, if balancing
is per-DC and racks have different capacity.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
7228bd1502 tablets: load_balancer: Extract print_node_stats() 2026-01-28 01:32:00 +01:00
Tomasz Grabiec
615b86e88b tablet: load_balancer: Use empty() instead of size() where applicable 2026-01-28 01:32:00 +01:00
Tomasz Grabiec
12fdd205d6 tablets: Fix redundancy in migration_plan::empty() 2026-01-28 01:32:00 +01:00
Tomasz Grabiec
0d090aa47b tablets: Cache pointer to stats during plan-making
Saves on lookup cost, esp. for candidate evaluation. This showed up in
perf profile in the past.

Also, lays the ground for splitting stats per rack.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
f2b0146f0f tablets: load_balancer: Print rack in addition to DC when giving context
Load-balancing can be now per-rack instead of per-DC. So just printing
"in DC" is confusing. If we're balancing a rack, we should print which
rack is that.
2026-01-28 01:32:00 +01:00
Tomasz Grabiec
df32318f66 tablets: load_balancer: Make plan summary concise
Before:

  load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s)

After:

  load_balancer - Prepared plan: migrations: 1

We print only stats about elements which are present.
2026-01-28 01:32:00 +01:00
Emil Maskovsky
834961c308 db/view: add missing include for coroutine::all to fix build without precompiled headers
When building with `--disable-precompiled-header`, view.cc failed to
compile due to missing <seastar/coroutine/all.hh> include, which provides
`coroutine::all`.

The problem doesn't manifest when precompiled headers are used, which is
the default. So that's likely why it was missed by the CI.

Adding the explicit include fixes the build.

Fixes: scylladb/scylladb#28378
Ref: scylladb/scylladb#28093

No backport: This problem is only present in master.

Closes scylladb/scylladb#28379
2026-01-27 18:56:56 +01:00
Pavel Emelyanov
02af292869 Merge 'Introduce TTL and retries to address resolution' from Ernest Zaslavsky
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.

Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request

Fixes: CUSTOMER-96
Fixes: CUSTOMER-139

Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3

Closes scylladb/scylladb#27891

* github.com:scylladb/scylladb:
  connection_factory: includes cleanup
  dns_connection_factory: refine the move constructor
  connection_factory: retry on failure
  connection_factory: introduce TTL timer
  connection_factory: get rid of shared_future in dns_connection_factory
  connection_factory: extract connection logic into a member
  connection_factory: remove unnecessary `else`
  connection_factory: use all resolved DNS addresses
  s3_test: remove client double-close
2026-01-27 18:45:43 +03:00
Botond Dénes
7ac32097da docs/cql/ddl.rst: Tombstones GC: explain propagation delay
This parameter was not mentioned at all anywhere in the documentation.
Add an explanation of this parameter: why we need it, what is the
default and how it can be changed.

Closes scylladb/scylladb#28132
2026-01-27 16:05:52 +01:00
Tomasz Grabiec
32b336e062 tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()
Just a cleanup. After this, we don't have a new scope in the outmost
make_plan() just for injection handling.
2026-01-27 16:01:36 +01:00
Łukasz Paszkowski
3ef594f9eb test/storage: speed up out-of-space prevention tests by using smaller volumes
Tests in test_out_of_space_prevention.py spend a large fraction of
time creating a random “blob” file to cross the 0.8 critical disk
utilization threshold. With 100MB volumes this requires writing
~70–80MB of data, which is slow inside Docker/Podman-backed volumes.

Most tests only use ~11MB of data, so large volumes are unnecessary.
Reduce the test volume size to 20MB so the critical threshold is
reached at ~16MB and the blob file is much smaller.

This cuts ~5–6s per test.
2026-01-27 15:28:59 +01:00
Łukasz Paszkowski
0f86fc680c test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests
Set `--tablet-load-stats-refresh-interval-in-seconds=1` for this module’s
clusters applicable to all tests. This significantly reduces runtime
for the slowest cases:
- test_reject_split_compaction: 75.62s -> 23.04s
- test_split_compaction_not_triggered: 69.36s -> 22.98s
2026-01-27 15:28:59 +01:00
Pavel Emelyanov
87920d16d8 raft_group0_client: Dont export system keyspace
Now system_keyspace reference is used internally by the client code
itself, no need to encourage other services abuse it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:51:40 +03:00
Pavel Emelyanov
966119ce30 raft_group0_client: Add and use get_last_group0_state_id()
There are several places that want to get last state id and for that
they make raft_group0_client() export system_keyspace reference.

This patch adds a helper method to provide the needed ID.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:50:25 +03:00
Pavel Emelyanov
dded1feeb7 group0_state_machine: Call ensure_group0_sched() with data_dictionary
There's a validation for tables being used by group0 commands are marked
with the respective prop. For it the caller code needs to provide
database reference and it gets one from client -> system_keyspace chain.

There's more explicit way -- get the data_dictionary via proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:48:22 +03:00
Pavel Emelyanov
20a2b944df view_building_worker: Use its own system_keyspace& reference
Some code in the worker need to mess with system_keyspace&. While
there's a reference on it from the worker object, it gets one via
group0 -> group0_client, which is a bit an overkill.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:46:48 +03:00
Avi Kivity
16b56c2451 Merge 'Audit: avoid dynamic_cast on a hot path' from Marcin Maliszkiewicz
This patch set eliminates special audit info guard used before for batch statements
and simplifies audit::inspect function by returning quickly if audit is not needed.
It saves around 300 instructions on a request's hot path.

Related: https://github.com/scylladb/scylladb/issues/27941
Backport: no, not a bug

Closes scylladb/scylladb#28326

* github.com:scylladb/scylladb:
  audit: replace batch dynamic_cast with static_cast
  audit: eliminate dynamic_cast to batch_statement in inspect
  audit: cql: remove create_no_audit_info
  audit: add batch bool to audit_info class
2026-01-27 12:54:16 +02:00
Pavel Emelyanov
c61d855250 hints: Provide explicit scheduling group for hint_sender
Currently it grabs one from database, but it's not nice to use database
as config/sched-groups provider.

This PR passes the scheduling group to use for sending hints via manager
which, in turn, gets one from proxy via its config (proxy config already
carries configuration for hints manager). The group is initialized in
main.cc code and is set to the maintenance one (nowadays it's the same
as streaming group).

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28358
2026-01-27 12:50:11 +02:00
Gleb Natapov
9daa109d2c test: get rid of consistent_cluster_management usage in test
consistent_cluster_management is deprecated since scylla-5.2 and no
longer used by Scylladb, so it should not be used by test either.

Closes scylladb/scylladb#28340
2026-01-27 11:31:30 +01:00
Avi Kivity
fa5ed619e8 Merge 'test: perf: add perf-cql-raw benchmarking tool' from Marcin Maliszkiewicz
The tool supports:
- auth or no auth modes
- simple read and write workloads
- connection pool or connection per request modes
- in-process or remote modes, remote may be usefull to assess tool's overhead or use it as bigger scale benchmark
- multi table mode
- non superuser mode

It could support in the future:
- TLS mode
- different workloads
- shard awareness

Example usage:
> build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 2
--cpus 0,1 \
--developer-mode 1 --workload read --duration 5 2> /dev/null

> Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0}
Pre-populated 10000 partitions
97438.42 tps (269.2 allocs/op,   1.1 logallocs/op,  35.2 tasks/op,  113325 insns/op,   80572 cycles/op,        0 errors)
102460.77 tps (261.1 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108222 insns/op,   75447 cycles/op,        0 errors)
95707.93 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108443 insns/op,   75320 cycles/op,        0 errors)
102487.87 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  107956 insns/op,   74320 cycles/op,        0 errors)
100409.60 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108337 insns/op,   75262 cycles/op,        0 errors)
throughput:
        mean=   99700.92 standard-deviation=3039.28
        median= 100409.60 median-absolute-deviation=2759.85
        maximum=102487.87 minimum=95707.93
instructions_per_op:
        mean=   109256.53 standard-deviation=2281.39
        median= 108337.37 median-absolute-deviation=1034.83
        maximum=113324.69 minimum=107955.97
cpu_cycles_per_op:
        mean=   76184.36 standard-deviation=2493.46
        median= 75320.20 median-absolute-deviation=922.09
        maximum=80572.19 minimum=74320.00

Backports: no, new tool

Closes scylladb/scylladb#25990

* github.com:scylladb/scylladb:
  test: perf: reuse stream id
  main: test: add future and abort_source to after_init_func
  test: perf: add option to stress multiple tables in perf-cql-raw
  test: perf: add perf-cql-raw benchmarking tool
  test: perf: move cut_arg helper func to common code
2026-01-27 12:23:25 +02:00
Yaron Kaikov
3f10f44232 .github/workflows/backport-pr-fixes-validation: support Atlassian URL format in backport PR fixes validation
Add support for matching full Atlassian JIRA URLs in the format
https://scylladb.atlassian.net/browse/SCYLLADB-400 in addition to
the bare JIRA key format (SCYLLADB-400).

This makes the validation more flexible by accepting both formats
that developers commonly use when referencing JIRA issues.

Fixes: https://github.com/scylladb/scylladb/issues/28373

Closes scylladb/scylladb#28374
2026-01-27 10:59:21 +02:00
Avi Kivity
f1c6094150 Merge 'Remove buffer_input_stream and limiting_input_stream from core code' from Pavel Emelyanov
These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader.

This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib.

Enanching testing code, not backporting

Closes scylladb/scylladb#28352

* github.com:scylladb/scylladb:
  code: Move limiting data source to test/lib
  util: Simplify limiting_data_source API
  util: Remove buffer_input_stream
  test: Use seastar::util::temporary_buffer_data_source in data consumer test
  sstables: Use seastar::util::as_input_stream() in mx reader
2026-01-26 22:05:59 +02:00
Raphael S. Carvalho
0e07c6556d test: Remove useless compaction group testing in database_test
This compaction group testing is useless because the machinery for it
to work was removed. This was useful in the early tablet days, where
we wanted to test compaction groups directly. Today groups are stressed
and tested on every tablet test.

I see a ~40% reduction time after this patch, since database_test is
one of the most (if not the most) time consuming in boost suite.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#28324
2026-01-26 19:16:27 +02:00
Marcin Maliszkiewicz
19af46d83a audit: replace batch dynamic_cast with static_cast
Since we know already it's a batch we can use static
cast now.
2026-01-26 18:14:38 +01:00
Anna Stuchlik
edc291961b doc: update the GPG keys
Update the keys in the installation instructions (linux packages).

Fixes https://github.com/scylladb/scylladb/issues/28330

Closes scylladb/scylladb#28357
2026-01-26 19:13:18 +02:00
Piotr Dulikowski
5d5e829107 Merge 'db: view: refactor usage and building of semaphore in create and drop views plus change continuation to co routine style' from Alex Dathskovsky
db: view: refactor semaphore usage in create/drop view paths
Refactor the construction and usage of semaphore units in the create and drop view flows.
The previous semaphore handling was hard to follow (as noted while working on https://github.com/scylladb/scylladb/pull/27929), so this change restructures unit creation and movement to follow a clearer and symmetric pattern across shards.

The semaphore usage model is now documented with a detailed in-code comment to make the intended behavior and invariants explicit.

As part of the refactor, the control flow is modernized by replacing continuation-based logic with coroutine-style code, improving readability and maintainability.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-250

backport: not required, this is a refactor

Closes scylladb/scylladb#28093

* github.com:scylladb/scylladb:
  db: view: extend try/catch scope in handle_create_view_local The try/catch region is extended to cover step functions and inner helpers, which may throw or abort during view creation. This change is safe because we are just swolowing more parts that may throw due to semaphore abortion or any other abortion request, and doesnt change the logic
  db: view: refine create/drop coroutine signatures Refactor the create/drop coroutine interfaces to accept parameters as const references, enabling a clearer workflow and safer data flow.
  db: view: switch from continuations to coroutines Refactor the flow and style of create and drop view to use coroutines instead of continuations. This simplifies the logic, improves readability, and makes the code easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit. this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
  db: view: introduce helper to acquire or reuse semaphore units Introduce a small helper that acquires semaphore units when needed or reuses units provided by the caller. This centralizes semaphore handling, simplifies the current logic, and enables refactoring the view create/drop path to a coroutine-based implementation instead of continuation-style code.
  db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0
2026-01-26 17:16:01 +01:00
Avi Kivity
32cc593558 Merge 'tools/scylla-sstable: introduce filter command' from Botond Dénes
Filter the content of sstable(s), including or excluding the specified partitions. Partitions can be provided on the command line via `--partition`, or in a file via `--partitions-file`. Produces one output sstable per input sstable -- if the filter selects at least one partition in the respective input sstable. Output sstables are placed in the path provided via `--oputput-dir`. Use `--merge` to filter all input sstables combined, producing one output sstable.

Fixes: #13076

New functionality, no backport.

Closes scylladb/scylladb#27836

* github.com:scylladb/scylladb:
  tools/scylla-sstable: introduce filter command
  tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir
  tools/scylla-sstable: make partition_set ordered
  tools/scylla-stable: remove unused boost/algorithm/string.hpp include
2026-01-26 16:32:38 +02:00
Ernest Zaslavsky
912c48a806 connection_factory: includes cleanup 2026-01-26 15:15:21 +02:00
Ernest Zaslavsky
3a31380b2c dns_connection_factory: refine the move constructor
Clean up the awkward move constructor that was declared in the header
but defaulted in a separate compilation unit, improving clarity and
consistency.
2026-01-26 15:15:15 +02:00
Ernest Zaslavsky
a05a4593a6 connection_factory: retry on failure
If connecting to a provided address throws, renew the address list and
retry once (and only once) before giving up.
2026-01-26 15:14:18 +02:00
Ernest Zaslavsky
6eb7dba352 connection_factory: introduce TTL timer
Add a TTL-based timer to connection_factory to automatically refresh
resolved host name addresses when they expire.
2026-01-26 15:11:49 +02:00
Nadav Har'El
25ff4bec2a Merge 'Refactor streams' from Radosław Cybulski
Refactor streams.cc - turn `.then` calls into coroutines.

Reduces amount of clutter, lambdas and referenced variables.

Fixes #28342

Closes scylladb/scylladb#28185

* github.com:scylladb/scylladb:
  alternator: refactor streams.cc
  alternator: refactor streams.cc
2026-01-26 14:31:15 +02:00
Andrei Chekun
3d3fabf5fb test.py: change the name of the test in failed directory
Generally square brackets are non allowed in URI, while pytest uses it
the test name to show that there were additional parameters for the same
test. When such a test fail it shows the directory correctly in Jenkins,
however attempt to download only this will fail, because of the square
brackets in URI. This change substitute the square brackets with round
brackets.

Closes scylladb/scylladb#28226
2026-01-26 13:29:45 +01:00
Łukasz Paszkowski
f06094aa95 topology_coordinator: add write_both_read_old_fallback_cleanup state
Yet another barrier-failure scenario exists in the `write_both_read_new`
state. When the barrier fails, the tablet is expected to transition
to `cleanup_target`, but because barrier execution is asynchronous,
the cleanup transition can be skipped entirely and the tablet may
continue forward instead.

Both `write_both_read_new` and `cleanup_target` modify read and write
selectors. In this situation, a barrier is required, and transitioning
directly between these states without one is unsafe.

Introduce an intermediate `write_both_read_old_fallback_cleanup`
state that modifies only a read selector and can be entered without
a barrier (there is no need to wait for all nodes to start using the
"new" read selector). From there, the tablet can proceed to `cleanup_target`,
where the required barriers are enforced.

This also avoids changing both selectors in a single step. A direct
transition from `write_both_read_new` to `cleanup_target` updates
both selectors at once, which can leave coordinators using the old
selector for writes and the new selector for reads, causing reads to
miss preceding writes.

By routing through the fallback state, selectors are updated in
order—read first, then write—preserving read-after-write correctness.
2026-01-26 13:14:37 +01:00
Łukasz Paszkowski
0a058e53c7 topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier
In both `streaming` and `rebuild_repair` stages, the read/write
selectors are unchanged compared to the preceding stage. Because
entry into these stages is already fenced by a barrier from
`write_both_read_old`, and the `cleanup_target` itself requires
barrier, rolling back directly to `cleanup_target` is safe without
an additional barrier.
2026-01-26 13:14:36 +01:00
Łukasz Paszkowski
b12f46babd topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old
A similar barrier-failure scenario exists in the `write_both_read_old`
state. If the barrier fails, the tablet is expected to transition to
`cleanup_target`, but due to the barrier being evaluated asynchronously
the cleanup path can be skipped and the tablet may continue forward
instead.

In `write_both_read_old`, we already switched group0 writes from old
to both, while the barrier may not have executed yet. As a result,
nodes can be at most one step apart (some still use old, others use
both).

Transitioning to `cleanup_target` reverts the write selector back to
old. Nodes still differ by at most one step (old vs both), so the
transition is safe without an additional barrier.

This prevents cleanup from being skipped while keeping selector semantics
and barrier guarantees intact.
2026-01-26 13:14:36 +01:00
Łukasz Paszkowski
7c331b7319 topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old
When a tablet is in `allow_write_both_read_old`, progressing normally
requires a barrier. If this first barrier fails, the tablet is supposed
to transition to `cleanup_target` on the next iteration:
```
case locator::tablet_transition_stage::allow_write_both_read_old:
    if (action_failed(tablet_state.barriers[trinfo.stage])) {
        if (check_excluded_replicas()) {
            transition_to_with_barrier(locator::tablet_transition_stage::cleanup_target);
            break;
        }
    }
    if (do_barrier()) {
        ...
    }
    break;
```
That transition itself requires a barrier, which is executed asynchronously.
Because the barrier runs in the background, the cleanup logic is skipped in
that iteration.

On the following iteration, `action_failed(barriers[stage])` no longer
returns true, since the node that caused the original barrier failure
has been excluded. The barrier is therefore observed as successful,
and the tablet incorrectly proceeds to the next stage instead of entering
`cleanup_target`.

Since `cleanup_target` does not modify read/write selectors, the transition
can be done safely without a barrier, simplifying the state machine and
ensuring cleanup is not skipped.

Without it, the tablet would still eventually reach `cleanup_target` via
`write_both_read_old` and `streaming`, but that path is unnecessary.
2026-01-26 13:14:36 +01:00
Anna Stuchlik
84281f900f doc: remove the troubleshooting section on upgrades from OSS
This commit removes a document originally created to troubleshoot
upgrades from Open Source to Enterprise.

Since we no longer support Open Source, this document is now redundant.

Fixes https://github.com/scylladb/scylladb/issues/28246

Closes scylladb/scylladb#28248
2026-01-26 14:02:53 +02:00
Anna Stuchlik
c25b770342 doc: add the version name to the Install pages
This is a follow-up to https://github.com/scylladb/scylladb/pull/28022
It adds the version name to more install pages.

Closes scylladb/scylladb#28289
2026-01-26 13:11:14 +02:00
Alex
954d18903e db: view: extend try/catch scope in handle_create_view_local
The try/catch region is extended to cover step functions and inner helpers,
which may throw or abort during view creation.
This change is safe because we are just swolowing more parts that may throw due to semaphore abortion
or any other abortion request, and doesnt change the logic
2026-01-26 13:10:37 +02:00
Alex
2c3ab8490c db: view: refine create/drop coroutine signatures
Refactor the create/drop coroutine interfaces to accept parameters as
const references, enabling a clearer workflow and safer data flow.
2026-01-26 13:10:34 +02:00
Alex
114f88cb9b db: view: switch from continuations to coroutines
Refactor the flow and style of create and drop view to use coroutines instead of continuations.
This simplifies the logic, improves readability, and makes the code
easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit.
this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
2026-01-26 13:08:24 +02:00
Alex
87c1c6f40f db: view: introduce helper to acquire or reuse semaphore units
Introduce a small helper that acquires semaphore units when needed or
reuses units provided by the caller.
This centralizes semaphore handling, simplifies the current logic, and
enables refactoring the view create/drop path to a coroutine-based
implementation instead of continuation-style code.
2026-01-26 13:03:26 +02:00
Avi Kivity
ec70cea2a1 test/cqlpy: restore LWT tests marked XFAIL for tablets
Commit 0156e97560 ("storage_proxy: cas: reject for
tablets-enabled tables") marked a bunch of LWT tests as
XFAIL with tablets enabled, pending resolution of #18066.
But since that event is now in the past, we undo the XFAIL
markings (or in some cases, use an any-keyspace fixture
instead of a vnodes-only fixture).

Ref #18066.

Closes scylladb/scylladb#28336
2026-01-26 12:27:19 +02:00
Pavel Emelyanov
77435206b9 code: Move limiting data source to test/lib
Only two tests use it now -- the limit-data-source-test iself and a test
that validates continuous_data_consumer template.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:49:42 +03:00
Pavel Emelyanov
111b376d0d util: Simplify limiting_data_source API
The source maintains "limit generator" -- a function that returns the
maximum size of bytes to return from the next buffer.

Currently all callers just return constant numbers from it. Passing a
function that returns non-constant one can, probably, be used for a
fuzzy test, but even the limiting-data-source-test itself doesn't do it,
so what's the point...

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:46:37 +03:00
Pavel Emelyanov
e297ed0b88 util: Remove buffer_input_stream
It's now unused. All the users had been patched to use seastar memory
data source implementation.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:46:10 +03:00
Pavel Emelyanov
4639681907 test: Use seastar::util::temporary_buffer_data_source in data consumer test
The test creates buffer_data_source_impl and wraps it with limiting data
source. The former data_source duplicates the functionality of the
existing seastar temporary_buffer_data_source.

This patch makes the test code use seastar facility. The
buffer_data_source_impl will be removed soon.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:44:33 +03:00
Pavel Emelyanov
2399bb8995 sstables: Use seastar::util::as_input_stream() in mx reader
Right now the code uses make_buffer_input_stream() helper that creates
an input stream with buffer_data_source_impl inside which, in turn,
provides the data_source_impl API over a single temporary_buffer.

Seastar has the very same facility, it's better to use it. Eventually
the buffer_data_source_impl will be removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-26 12:43:17 +03:00
Marcin Maliszkiewicz
6f32290756 audit: eliminate dynamic_cast to batch_statement in inspect
This is costly and not needed we can use a simple
bool flag for such check. It burns around 300 cpu
instructions on a hot request's path.
2026-01-26 10:18:38 +01:00
Marcin Maliszkiewicz
a93ad3838f audit: cql: remove create_no_audit_info
We don't need a special guard value, it's
only being filled for batch statements for
which we can simply ignore the value.

Not having special value allows us to return
fast when audit is not enabled.
2026-01-26 10:18:38 +01:00
Marcin Maliszkiewicz
02d59a0529 audit: add batch bool to audit_info class
In the following commit we'll use this field
instead of costly dynamic_cast when emitting
audit log.
2026-01-26 10:18:38 +01:00
Botond Dénes
57b2cd2c16 docs: inter-link incremental-repair and repair documents
The user can now discover the general explanatio of repair when reading
about incremental repair, useful if they don't know what repair is.
The user can now discover incremental repair while reading the generic
repair procedure document.
2026-01-26 09:55:54 +02:00
Botond Dénes
a84b1b8b78 docs: incremental-repair: fix curl example
Currently it is regular text, make it a code block so it is easier to
read and copy+paste.
2026-01-26 09:55:54 +02:00
Pavel Emelyanov
1796997ace Merge 'restore: Enable direct download of fully contained SSTables' from Ernest Zaslavsky
This PR refactors the streaming subsystem to support direct download of fully contained sstables. Instead of streaming these files, they are downloaded and attached directly to their corresponding tables. This approach reduces overhead, simplifies logic, and improves efficiency. Expected node scope restore performance improvement: ~4 times faster in best case scenario when all sstables are fully contained.

1. Add storage options field to sstable Introduce a data member to store storage options, enabling distinction between local and object storage types.
2. Add method to create component source Extend the storage interface with a public method to create a data_source for any sstable component.
3. Inline streamer instance creation Remove make_sstable_streamer and inline its usage to allow different sets of arguments at call sites.
4. Skip streaming empty sstable sets Avoid unnecessary streaming calls when the sstable set is empty.
5. Enable direct download of contained sstables Replace streaming of fully contained sstables with direct download, attaching them to their corresponding table.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-200
Refs: https://github.com/scylladb/scylladb/issues/23908

No need to backport as this code targets 2026.2 release (for tablet-aware restore)

Closes scylladb/scylladb#26834

* github.com:scylladb/scylladb:
  tests: reuse test_backup_broken_streaming
  streaming: enable direct download of contained sstables
  storage: add method to create component source
  streaming: keep sharded database reference on tablet_sstable_streamer
  streaming: skip streaming empty sstable sets
  streaming: inline streamer instance creation
  tests: fix incorrect backup/restore test flow
2026-01-26 10:22:34 +03:00
Ernest Zaslavsky
cb2aa85cf5 aws_error: handle all restartable nested exception types
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.

Closes scylladb/scylladb#28240
2026-01-26 10:19:57 +03:00
Avi Kivity
55422593a7 Merge 'test/lib: Fix bugs in boost_test_tree_lister.cc' from Dawid Mędrek
In this PR, we fix two bugs present in `boost_test_tree_lister` that
affected the output of `--list_json_content` added in
scylladb/scylladb@afde5f668a:

* The labels test units use were duplicated in the output.
* If a test suite or a test file didn't contain any tests, it wasn't
  listed in the output.

Refs scylladb/scylladb#25415

Backport: not needed. The code hasn't been used anywhere yet.

Closes scylladb/scylladb#28255

* github.com:scylladb/scylladb:
  test/lib/boost_test_tree_lister.cc: Record empty test suites
  test/lib/boost_test_tree_lister.cc: Deduplicate labels
2026-01-25 21:34:32 +02:00
Andrei Chekun
cc5ac75d73 test.py: remove deprecated skip_mode decorator
Finishing the deprecation of the skip_mode function in favor of
pytest.mark.skip_mode. This PR is only cleaning and migrating leftover tests
that are still used and old way of skip_mode.

Closes scylladb/scylladb#28299
2026-01-25 18:17:27 +02:00
Ernest Zaslavsky
66a33619da connection_factory: get rid of shared_future in dns_connection_factory
Move state management from dns_connection_factory into state class
itself to encapsulate its internal state and stop managing it from the
`dns_connection_factory`
2026-01-25 16:12:29 +02:00
Ernest Zaslavsky
5b3e513cba connection_factory: extract connection logic into a member
extract connection logic into a private member function to make it reusable
2026-01-25 15:42:48 +02:00
Ernest Zaslavsky
ce0c7b5896 connection_factory: remove unnecessary else 2026-01-25 15:42:48 +02:00
Ernest Zaslavsky
359d0b7a3e connection_factory: use all resolved DNS addresses
Improve dns_connection_factory to iterate over all resolved
addresses instead of using only the first one.
2026-01-25 15:42:48 +02:00
Ernest Zaslavsky
bd9d5ad75b s3_test: remove client double-close
`test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call
2026-01-25 15:42:48 +02:00
Alex
1aadedc596 db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0 2026-01-25 14:29:09 +02:00
Ernest Zaslavsky
70f5bc1a50 tests: reuse test_backup_broken_streaming
reuse the `test_backup_broken_streaming` test to check for direct
sstable download
2026-01-25 13:27:44 +02:00
Ernest Zaslavsky
13fb605edb streaming: enable direct download of contained sstables
Instead of streaming fully contained sstables, download them directly
and attach them to their corresponding table. This simplifies the
process and avoids unnecessary streaming overhead.
2026-01-25 13:27:44 +02:00
Yaron Kaikov
3dea15bc9d Update ScyllaDB version to: 2026.2.0-dev 2026-01-25 11:09:17 +02:00
Botond Dénes
f375288b58 tools/scylla-sstable: introduce filter command
Filter the content of sstable(s), including or excluding the specified
partitions. Partitions can be provided on the command line via
`--partition`, or in a file via `--partitions-file`.
Produces one output sstable per input sstable -- if the filter selects
at least one partition in the respective input sstable.
Output sstables are placed in the path provided via `--oputput-dir`.
Use `--merge` to filter all input sstables combined, producing one
output sstable.
2026-01-22 17:20:07 +02:00
Botond Dénes
21900c55eb tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir
This flag was added to operations which have an --output-dir
command-line arguments. These operations write sstables and need a
directory where to write them. Back in the numeric-generation world this
posed a problem: if the directory contained any sstable, generation
clash was almost guaranteed, because each scylla-sstable command
invokation would start output generations from 1. To avoid this, empty
output directory was a requirement, with the
--unsafe-accept-nonempty-output-dir allowing for a force-override.

Now in the timeuuid generation days, all this is not necessary anymore:
generations are unique, so it is not a problem if the output directory
already contains sstables: the probability of generation clash is almost
0. Even if it happens, the tool will just simply fail to write the new
sstable with the clashing generation.

Remove this historic relic of a flag and the related logic, it is just a
pointless nuissance nowadays.
2026-01-22 13:55:59 +02:00
Botond Dénes
a1ed73820f tools/scylla-sstable: make partition_set ordered
Next patch will want partitions to be ordered. Remove unused
partition_map type.
2026-01-22 13:55:59 +02:00
Botond Dénes
d228e6eda6 tools/scylla-stable: remove unused boost/algorithm/string.hpp include 2026-01-22 13:55:59 +02:00
Marcin Maliszkiewicz
32543625fc test: perf: reuse stream id
When one request is super slow and req/s high
in theory we have a collision on id, this patch
avoids that by reusing id and aborting when there
is no free one (unlikely).
2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz
7bf7ff785a main: test: add future and abort_source to after_init_func
This commit avoids leaking seastar::async future from two benchmark
tools: perf-alternator and perf-cql-raw. Additionally it adds
abort_source for fast and clean shutdown.
2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz
0d20300313 test: perf: add option to stress multiple tables in perf-cql-raw 2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz
a033b70704 test: perf: add perf-cql-raw benchmarking tool
The tool supports:
- auth or no auth modes
- simple read and write workloads
- connection pool or connection per request modes
- in-process or remote modes, remote may be usefull
to assess tool's overhead or use it as bigger scale benchmark
- uses prepared statements by default
- connection only mode, for testing storms

It could support in the future:
- TLS mode
- different workloads
- shard awareness

Example usage:
> build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 2
--cpus 0,1 \
--developer-mode 1 --workload read --duration 5 2> /dev/null

Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0}
Pre-populated 10000 partitions
97438.42 tps (269.2 allocs/op,   1.1 logallocs/op,  35.2 tasks/op,  113325 insns/op,   80572 cycles/op,        0 errors)
102460.77 tps (261.1 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108222 insns/op,   75447 cycles/op,        0 errors)
95707.93 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108443 insns/op,   75320 cycles/op,        0 errors)
102487.87 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  107956 insns/op,   74320 cycles/op,        0 errors)
100409.60 tps (261.0 allocs/op,   0.0 logallocs/op,  31.7 tasks/op,  108337 insns/op,   75262 cycles/op,        0 errors)
throughput:
	mean=   99700.92 standard-deviation=3039.28
	median= 100409.60 median-absolute-deviation=2759.85
	maximum=102487.87 minimum=95707.93
instructions_per_op:
	mean=   109256.53 standard-deviation=2281.39
	median= 108337.37 median-absolute-deviation=1034.83
	maximum=113324.69 minimum=107955.97
cpu_cycles_per_op:
	mean=   76184.36 standard-deviation=2493.46
	median= 75320.20 median-absolute-deviation=922.09
	maximum=80572.19 minimum=74320.00
2026-01-22 12:26:50 +01:00
Ernest Zaslavsky
51285785fa storage: add method to create component source
Extend the storage interface with a public method to create a
`data_source` for any sstable component.
2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
757e9d0f52 streaming: keep sharded database reference on tablet_sstable_streamer 2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
4ffa070715 streaming: skip streaming empty sstable sets
Avoid invoking streaming when the sstable set is empty. This prevents
unnecessary calls and simplifies the streaming logic.
2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
0fcd369ef2 streaming: inline streamer instance creation
Remove the `make_sstable_streamer` function and inline its usage where
needed. This change allows passing different sets of arguments
directly at the call sites.
2026-01-21 16:40:12 +02:00
Ernest Zaslavsky
32173ccfe1 tests: fix incorrect backup/restore test flow
When working directly with sstable components, the provided name should
be only the file name without path prefixes. Any prefixing tokens
belong in the 'prefix' argument, as the name suggests.
2026-01-21 16:40:12 +02:00
Dawid Mędrek
3b8bf85fbc test/lib/boost_test_tree_lister.cc: Record empty test suites
Before this commit, if a test file or a test suite didn't include
any actual test cases, it was ignored by `boost_test_tree_lister`.

However, this information is useful; for example, it allows us to tell
if the test file the user wants to run doesn't exist or simply doesn't
contain any tests. The kind of error we would return to them should be
different depending on which situation we're dealing with.

We start including those empty suites and files in the output of
`--list_json_content`.

---

Examples (with additional formatting):

* Consider the following test file, `test/boost/dummy_test.cc` [1]:

  ```
  BOOST_AUTO_TEST_SUITE(dummy_suite1)
  BOOST_AUTO_TEST_SUITE(dummy_suite2)
  BOOST_AUTO_TEST_SUITE_END()
  BOOST_AUTO_TEST_SUITE_END()

  BOOST_AUTO_TEST_SUITE(dummy_suite3)
  BOOST_AUTO_TEST_SUITE_END()
  ```

  Before this commit:

  ```
  $ ./build/debug/test/boost/dummy_test -- --list_json_content
  [{"file": "test/boost/dummy_test.cc", "content": {"suites": [], "tests": []}}]
  ```

  After this commit:

  ```
  $ ./build/debug/test/boost/dummy_test -- --list_json_content
  [{"file":"test/boost/dummy_test.cc", "content": {"suites": [
    {"name": "dummy_suite1", "suites": [
       {"name": "dummy_suite2", "suites": [], "tests": []}
    ], "tests": []},
    {"name": "dummy_suite3", "suites": [], "tests": []}
  ], "tests": []}}]
  ```

* Consider the same test file as in Example 1, but also assume it's compiled
  into `test/boost/combined_tests`.

  Before this commit:

  ```
  $ ./build/debug/test/boost/combined_tests -- --list_json_content | grep dummy
  $
  ```

  After this commit:

  ```
  $ ./build/debug/test/boost/combined_tests -- --list_json_content
  [..., {"file": "test/boost/dummy_test.cc", "content": {"suites": [
    {"name": "dummy_suite1", "suites":
      [{"name": "dummy_suite2", "suites": [], "tests": []}],
    "tests": []},
    {"name": "dummy_suite3", "suites": [], "tests": []}],
  "tests":[]}}, ...]
  ```

[1] Note that the example is simplified. As of now, it's not possible to use
    `--list_json_content` with a file without any Boost tests. That will
    result in the following error: `Test setup error: test tree is empty`.

Refs scylladb/scylladb#25415
2026-01-19 18:03:24 +01:00
Dawid Mędrek
1129599df8 test/lib/boost_test_tree_lister.cc: Deduplicate labels
In scylladb/scylladb@afde5f668a, we
implemented custom collection of information about Boost tests
in the repository. The solution boiled down to traversing through
the test tree via callbacks provided by Boost.Test and calling that
code from a global fixture. This way, the code is called automatically
by the framework.

Unfortunately, for an unknown reason, this leads to labels of test units
being duplicated. We haven't found the root cause yet and so we
deduplicate the labels manually.

---

Example (with additional formatting):

Consider the following test in the file `test/boost/dummy_test.cc`:

```
SEASTAR_TEST_CASE(dummy_case, *boost::unit_test::label("mylabel1")) {
    return make_ready_future();
}
```

Before this commit:

```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
  "tests": [{"name": "dummy_case", "labels": "mylabel1,mylabel1"}]}
}]
```

After this commit:

```
$ ./build/dev/test/boost/dummy_test -- --list_json_content
[{"file": "test/boost/dummy_test.cc", "content": {"suites": [],
  "tests": [{"name": "dummy_case", "labels": "mylabel1"}]}
}]
```

Refs scylladb/scylladb#25415
2026-01-19 18:01:14 +01:00
Marcin Maliszkiewicz
1318ff5a0d test: perf: move cut_arg helper func to common code
It will be reused later.
2026-01-19 14:33:10 +01:00
Radosław Cybulski
7b1060fad3 alternator: refactor streams.cc
Fix indentation levels from previous commit.
2026-01-13 12:04:13 +01:00
Radosław Cybulski
ef63fe400a alternator: refactor streams.cc
Refactor streams.cc - turn `.then` calls into coroutines.
Reduces amount of clutter, lambdas and referenced variables.
Note - the code is kept at the same indentation level to ease review,
the next commit will fix this.
2026-01-13 12:03:54 +01:00
210 changed files with 2369 additions and 1396 deletions

View File

@@ -18,7 +18,7 @@ jobs:
// Regular expression pattern to check for "Fixes" prefix
// Adjusted to dynamically insert the repository full name
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|([A-Z]+-\\d+))`;
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|(?:https://scylladb\\.atlassian\\.net/browse/)?([A-Z]+-\\d+))`;
const regex = new RegExp(pattern);
if (!regex.test(body)) {

View File

@@ -14,7 +14,8 @@ env:
CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
SEASTAR_BAD_INCLUDE_OUTPUT_PATH: build/seastar-bad-include.log
permissions: {}
permissions:
contents: read
# cancel the in-progress run upon a repush
concurrency:

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2026.1.0-dev
VERSION=2026.2.0-dev
if test -f version
then

View File

@@ -491,7 +491,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
if (!opts.enabled()) {
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
co_return rjson::print(std::move(ret));
}
// TODO: label
@@ -502,123 +502,121 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
// filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
return _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners }).then([db, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)] (std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {
std::map<db_clock::time_point, cdc::streams_version> topologies = co_await _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners });
auto e = topologies.end();
auto prev = e;
auto shards = rjson::empty_array();
auto e = topologies.end();
auto prev = e;
auto shards = rjson::empty_array();
std::optional<shard_id> last;
std::optional<shard_id> last;
auto i = topologies.begin();
// if we're a paged query, skip to the generation where we left of.
if (shard_start) {
i = topologies.find(shard_start->time);
}
auto i = topologies.begin();
// if we're a paged query, skip to the generation where we left of.
if (shard_start) {
i = topologies.find(shard_start->time);
}
// for parent-child stuff we need id:s to be sorted by token
// (see explanation above) since we want to find closest
// token boundary when determining parent.
// #7346 - we processed and searched children/parents in
// stored order, which is not necessarily token order,
// so the finding of "closest" token boundary (using upper bound)
// could give somewhat weird results.
static auto token_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return id1.token() < id2.token();
};
// for parent-child stuff we need id:s to be sorted by token
// (see explanation above) since we want to find closest
// token boundary when determining parent.
// #7346 - we processed and searched children/parents in
// stored order, which is not necessarily token order,
// so the finding of "closest" token boundary (using upper bound)
// could give somewhat weird results.
static auto token_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return id1.token() < id2.token();
};
// #7409 - shards must be returned in lexicographical order,
// normal bytes compare is string_traits<int8_t>::compare.
// thus bytes 0x8000 is less than 0x0000. By doing unsigned
// compare instead we inadvertently will sort in string lexical.
static auto id_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return compare_unsigned(id1.to_bytes(), id2.to_bytes()) < 0;
};
// need a prev even if we are skipping stuff
if (i != topologies.begin()) {
prev = std::prev(i);
}
for (; limit > 0 && i != e; prev = i, ++i) {
auto& [ts, sv] = *i;
last = std::nullopt;
auto lo = sv.streams.begin();
auto end = sv.streams.end();
// #7409 - shards must be returned in lexicographical order,
// normal bytes compare is string_traits<int8_t>::compare.
// thus bytes 0x8000 is less than 0x0000. By doing unsigned
// compare instead we inadvertently will sort in string lexical.
static auto id_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return compare_unsigned(id1.to_bytes(), id2.to_bytes()) < 0;
};
std::sort(lo, end, id_cmp);
// need a prev even if we are skipping stuff
if (i != topologies.begin()) {
prev = std::prev(i);
if (shard_start) {
// find next shard position
lo = std::upper_bound(lo, end, shard_start->id, id_cmp);
shard_start = std::nullopt;
}
for (; limit > 0 && i != e; prev = i, ++i) {
auto& [ts, sv] = *i;
if (lo != end && prev != e) {
// We want older stuff sorted in token order so we can find matching
// token range when determining parent shard.
std::stable_sort(prev->second.streams.begin(), prev->second.streams.end(), token_cmp);
}
auto expired = [&]() -> std::optional<db_clock::time_point> {
auto j = std::next(i);
if (j == e) {
return std::nullopt;
}
// add this so we sort of match potential
// sequence numbers in get_records result.
return j->first + confidence_interval(db);
}();
while (lo != end) {
auto& id = *lo++;
auto shard = rjson::empty_object();
if (prev != e) {
auto& pids = prev->second.streams;
auto pid = std::upper_bound(pids.begin(), pids.end(), id.token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
if (pid != pids.begin()) {
pid = std::prev(pid);
}
if (pid != pids.end()) {
rjson::add(shard, "ParentShardId", shard_id(prev->first, *pid));
}
}
last.emplace(ts, id);
rjson::add(shard, "ShardId", *last);
auto range = rjson::empty_object();
rjson::add(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch())));
if (expired) {
rjson::add(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch())));
}
rjson::add(shard, "SequenceNumberRange", std::move(range));
rjson::push_back(shards, std::move(shard));
if (--limit == 0) {
break;
}
last = std::nullopt;
auto lo = sv.streams.begin();
auto end = sv.streams.end();
// #7409 - shards must be returned in lexicographical order,
std::sort(lo, end, id_cmp);
if (shard_start) {
// find next shard position
lo = std::upper_bound(lo, end, shard_start->id, id_cmp);
shard_start = std::nullopt;
}
if (lo != end && prev != e) {
// We want older stuff sorted in token order so we can find matching
// token range when determining parent shard.
std::stable_sort(prev->second.streams.begin(), prev->second.streams.end(), token_cmp);
}
auto expired = [&]() -> std::optional<db_clock::time_point> {
auto j = std::next(i);
if (j == e) {
return std::nullopt;
}
// add this so we sort of match potential
// sequence numbers in get_records result.
return j->first + confidence_interval(db);
}();
while (lo != end) {
auto& id = *lo++;
auto shard = rjson::empty_object();
if (prev != e) {
auto& pids = prev->second.streams;
auto pid = std::upper_bound(pids.begin(), pids.end(), id.token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
if (pid != pids.begin()) {
pid = std::prev(pid);
}
if (pid != pids.end()) {
rjson::add(shard, "ParentShardId", shard_id(prev->first, *pid));
}
}
last.emplace(ts, id);
rjson::add(shard, "ShardId", *last);
auto range = rjson::empty_object();
rjson::add(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch())));
if (expired) {
rjson::add(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch())));
}
rjson::add(shard, "SequenceNumberRange", std::move(range));
rjson::push_back(shards, std::move(shard));
if (--limit == 0) {
break;
}
last = std::nullopt;
}
}
}
if (last) {
rjson::add(stream_desc, "LastEvaluatedShardId", *last);
}
if (last) {
rjson::add(stream_desc, "LastEvaluatedShardId", *last);
}
rjson::add(stream_desc, "Shards", std::move(shards));
rjson::add(ret, "StreamDescription", std::move(stream_desc));
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
rjson::add(stream_desc, "Shards", std::move(shards));
rjson::add(ret, "StreamDescription", std::move(stream_desc));
co_return rjson::print(std::move(ret));
}
enum class shard_iterator_type {
@@ -898,172 +896,169 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
query::tombstone_limit(_proxy.get_tombstone_limit()), query::row_limit(limit * mul));
co_return co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state)).then(
[this, schema, partition_slice = std::move(partition_slice), selection = std::move(selection), start_time = std::move(start_time), limit, key_names = std::move(key_names), attr_names = std::move(attr_names), type, iter, high_ts] (service::storage_proxy::coordinator_query_result qr) mutable {
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
service::storage_proxy::coordinator_query_result qr = co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state));
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();
auto records = rjson::empty_array();
auto result_set = builder.build();
auto records = rjson::empty_array();
auto& metadata = result_set->get_metadata();
auto& metadata = result_set->get_metadata();
auto op_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == op_column_name;
})
);
auto ts_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == timestamp_column_name;
})
);
auto eor_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == eor_column_name;
})
);
auto op_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == op_column_name;
})
);
auto ts_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == timestamp_column_name;
})
);
auto eor_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == eor_column_name;
})
);
std::optional<utils::UUID> timestamp;
auto dynamodb = rjson::empty_object();
auto record = rjson::empty_object();
const auto dc_name = _proxy.get_token_metadata_ptr()->get_topology().get_datacenter();
std::optional<utils::UUID> timestamp;
auto dynamodb = rjson::empty_object();
auto record = rjson::empty_object();
const auto dc_name = _proxy.get_token_metadata_ptr()->get_topology().get_datacenter();
using op_utype = std::underlying_type_t<cdc::operation>;
using op_utype = std::underlying_type_t<cdc::operation>;
auto maybe_add_record = [&] {
if (!dynamodb.ObjectEmpty()) {
rjson::add(record, "dynamodb", std::move(dynamodb));
dynamodb = rjson::empty_object();
}
if (!record.ObjectEmpty()) {
rjson::add(record, "awsRegion", rjson::from_string(dc_name));
rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));
rjson::add(record, "eventSource", "scylladb:alternator");
rjson::add(record, "eventVersion", "1.1");
rjson::push_back(records, std::move(record));
record = rjson::empty_object();
--limit;
}
};
auto maybe_add_record = [&] {
if (!dynamodb.ObjectEmpty()) {
rjson::add(record, "dynamodb", std::move(dynamodb));
dynamodb = rjson::empty_object();
}
if (!record.ObjectEmpty()) {
rjson::add(record, "awsRegion", rjson::from_string(dc_name));
rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));
rjson::add(record, "eventSource", "scylladb:alternator");
rjson::add(record, "eventVersion", "1.1");
rjson::push_back(records, std::move(record));
record = rjson::empty_object();
--limit;
}
};
for (auto& row : result_set->rows()) {
auto op = static_cast<cdc::operation>(value_cast<op_utype>(data_type_for<op_utype>()->deserialize(*row[op_index])));
auto ts = value_cast<utils::UUID>(data_type_for<utils::UUID>()->deserialize(*row[ts_index]));
auto eor = row[eor_index].has_value() ? value_cast<bool>(boolean_type->deserialize(*row[eor_index])) : false;
for (auto& row : result_set->rows()) {
auto op = static_cast<cdc::operation>(value_cast<op_utype>(data_type_for<op_utype>()->deserialize(*row[op_index])));
auto ts = value_cast<utils::UUID>(data_type_for<utils::UUID>()->deserialize(*row[ts_index]));
auto eor = row[eor_index].has_value() ? value_cast<bool>(boolean_type->deserialize(*row[eor_index])) : false;
if (!dynamodb.HasMember("Keys")) {
auto keys = rjson::empty_object();
describe_single_item(*selection, row, key_names, keys);
rjson::add(dynamodb, "Keys", std::move(keys));
rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));
rjson::add(dynamodb, "StreamViewType", type);
// TODO: SizeBytes
}
/**
* We merge rows with same timestamp into a single event.
* This is pretty much needed, because a CDC row typically
* encodes ~half the info of an alternator write.
*
* A big, big downside to how alternator records are written
* (i.e. CQL), is that the distinction between INSERT and UPDATE
* is somewhat lost/unmappable to actual eventName.
* A write (currently) always looks like an insert+modify
* regardless whether we wrote existing record or not.
*
* Maybe RMW ops could be done slightly differently so
* we can distinguish them here...
*
* For now, all writes will become MODIFY.
*
* Note: we do not check the current pre/post
* flags on CDC log, instead we use data to
* drive what is returned. This is (afaict)
* consistent with dynamo streams
*/
switch (op) {
case cdc::operation::pre_image:
case cdc::operation::post_image:
{
auto item = rjson::empty_object();
describe_single_item(*selection, row, attr_names, item, nullptr, true);
describe_single_item(*selection, row, key_names, item);
rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
break;
}
case cdc::operation::update:
rjson::add(record, "eventName", "MODIFY");
break;
case cdc::operation::insert:
rjson::add(record, "eventName", "INSERT");
break;
case cdc::operation::service_row_delete:
case cdc::operation::service_partition_delete:
{
auto user_identity = rjson::empty_object();
rjson::add(user_identity, "Type", "Service");
rjson::add(user_identity, "PrincipalId", "dynamodb.amazonaws.com");
rjson::add(record, "userIdentity", std::move(user_identity));
rjson::add(record, "eventName", "REMOVE");
break;
}
default:
rjson::add(record, "eventName", "REMOVE");
break;
}
if (eor) {
maybe_add_record();
timestamp = ts;
if (limit == 0) {
break;
}
}
if (!dynamodb.HasMember("Keys")) {
auto keys = rjson::empty_object();
describe_single_item(*selection, row, key_names, keys);
rjson::add(dynamodb, "Keys", std::move(keys));
rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());
rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));
rjson::add(dynamodb, "StreamViewType", type);
// TODO: SizeBytes
}
auto ret = rjson::empty_object();
auto nrecords = records.Size();
rjson::add(ret, "Records", std::move(records));
if (nrecords != 0) {
// #9642. Set next iterators threshold to > last
shard_iterator next_iter(iter.table, iter.shard, *timestamp, false);
// Note that here we unconditionally return NextShardIterator,
// without checking if maybe we reached the end-of-shard. If the
// shard did end, then the next read will have nrecords == 0 and
// will notice end end of shard and not return NextShardIterator.
rjson::add(ret, "NextShardIterator", next_iter);
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
/**
* We merge rows with same timestamp into a single event.
* This is pretty much needed, because a CDC row typically
* encodes ~half the info of an alternator write.
*
* A big, big downside to how alternator records are written
* (i.e. CQL), is that the distinction between INSERT and UPDATE
* is somewhat lost/unmappable to actual eventName.
* A write (currently) always looks like an insert+modify
* regardless whether we wrote existing record or not.
*
* Maybe RMW ops could be done slightly differently so
* we can distinguish them here...
*
* For now, all writes will become MODIFY.
*
* Note: we do not check the current pre/post
* flags on CDC log, instead we use data to
* drive what is returned. This is (afaict)
* consistent with dynamo streams
*/
switch (op) {
case cdc::operation::pre_image:
case cdc::operation::post_image:
{
auto item = rjson::empty_object();
describe_single_item(*selection, row, attr_names, item, nullptr, true);
describe_single_item(*selection, row, key_names, item);
rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));
break;
}
// ugh. figure out if we are and end-of-shard
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
return _sdks.cdc_current_generation_timestamp({ normal_token_owners }).then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {
auto& shard = iter.shard;
if (shard.time < ts && ts < high_ts) {
// The DynamoDB documentation states that when a shard is
// closed, reading it until the end has NextShardIterator
// "set to null". Our test test_streams_closed_read
// confirms that by "null" they meant not set at all.
} else {
// We could have return the same iterator again, but we did
// a search from it until high_ts and found nothing, so we
// can also start the next search from high_ts.
// TODO: but why? It's simpler just to leave the iterator be.
shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);
rjson::add(ret, "NextShardIterator", iter);
case cdc::operation::update:
rjson::add(record, "eventName", "MODIFY");
break;
case cdc::operation::insert:
rjson::add(record, "eventName", "INSERT");
break;
case cdc::operation::service_row_delete:
case cdc::operation::service_partition_delete:
{
auto user_identity = rjson::empty_object();
rjson::add(user_identity, "Type", "Service");
rjson::add(user_identity, "PrincipalId", "dynamodb.amazonaws.com");
rjson::add(record, "userIdentity", std::move(user_identity));
rjson::add(record, "eventName", "REMOVE");
break;
}
default:
rjson::add(record, "eventName", "REMOVE");
break;
}
if (eor) {
maybe_add_record();
timestamp = ts;
if (limit == 0) {
break;
}
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
if (is_big(ret)) {
return make_ready_future<executor::request_return_type>(make_streamed(std::move(ret)));
}
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
});
});
}
}
auto ret = rjson::empty_object();
auto nrecords = records.Size();
rjson::add(ret, "Records", std::move(records));
if (nrecords != 0) {
// #9642. Set next iterators threshold to > last
shard_iterator next_iter(iter.table, iter.shard, *timestamp, false);
// Note that here we unconditionally return NextShardIterator,
// without checking if maybe we reached the end-of-shard. If the
// shard did end, then the next read will have nrecords == 0 and
// will notice end end of shard and not return NextShardIterator.
rjson::add(ret, "NextShardIterator", next_iter);
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
co_return rjson::print(std::move(ret));
}
// ugh. figure out if we are and end-of-shard
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
db_clock::time_point ts = co_await _sdks.cdc_current_generation_timestamp({ normal_token_owners });
auto& shard = iter.shard;
if (shard.time < ts && ts < high_ts) {
// The DynamoDB documentation states that when a shard is
// closed, reading it until the end has NextShardIterator
// "set to null". Our test test_streams_closed_read
// confirms that by "null" they meant not set at all.
} else {
// We could have return the same iterator again, but we did
// a search from it until high_ts and found nothing, so we
// can also start the next search from high_ts.
// TODO: but why? It's simpler just to leave the iterator be.
shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);
rjson::add(ret, "NextShardIterator", iter);
}
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
if (is_big(ret)) {
co_return make_streamed(std::move(ret));
}
co_return rjson::print(std::move(ret));
}
bool executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {

View File

@@ -209,15 +209,11 @@ future<> audit::stop_audit() {
});
}
audit_info_ptr audit::create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table) {
audit_info_ptr audit::create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table, bool batch) {
if (!audit_instance().local_is_initialized()) {
return nullptr;
}
return std::make_unique<audit_info>(cat, keyspace, table);
}
audit_info_ptr audit::create_no_audit_info() {
return audit_info_ptr();
return std::make_unique<audit_info>(cat, keyspace, table, batch);
}
future<> audit::start(const db::config& cfg) {
@@ -267,18 +263,21 @@ future<> audit::log_login(const sstring& username, socket_address client_ip, boo
}
future<> inspect(shared_ptr<cql3::cql_statement> statement, service::query_state& query_state, const cql3::query_options& options, bool error) {
cql3::statements::batch_statement* batch = dynamic_cast<cql3::statements::batch_statement*>(statement.get());
if (batch != nullptr) {
auto audit_info = statement->get_audit_info();
if (!audit_info) {
return make_ready_future<>();
}
if (audit_info->batch()) {
cql3::statements::batch_statement* batch = static_cast<cql3::statements::batch_statement*>(statement.get());
return do_for_each(batch->statements().begin(), batch->statements().end(), [&query_state, &options, error] (auto&& m) {
return inspect(m.statement, query_state, options, error);
});
} else {
auto audit_info = statement->get_audit_info();
if (bool(audit_info) && audit::local_audit_instance().should_log(audit_info)) {
if (audit::local_audit_instance().should_log(audit_info)) {
return audit::local_audit_instance().log(audit_info, query_state, options, error);
}
return make_ready_future<>();
}
return make_ready_future<>();
}
future<> inspect_login(const sstring& username, socket_address client_ip, bool error) {

View File

@@ -75,11 +75,13 @@ class audit_info final {
sstring _keyspace;
sstring _table;
sstring _query;
bool _batch;
public:
audit_info(statement_category cat, sstring keyspace, sstring table)
audit_info(statement_category cat, sstring keyspace, sstring table, bool batch)
: _category(cat)
, _keyspace(std::move(keyspace))
, _table(std::move(table))
, _batch(batch)
{ }
void set_query_string(const std::string_view& query_string) {
_query = sstring(query_string);
@@ -89,6 +91,7 @@ public:
const sstring& query() const { return _query; }
sstring category_string() const;
statement_category category() const { return _category; }
bool batch() const { return _batch; }
};
using audit_info_ptr = std::unique_ptr<audit_info>;
@@ -126,8 +129,7 @@ public:
}
static future<> start_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm);
static future<> stop_audit();
static audit_info_ptr create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table);
static audit_info_ptr create_no_audit_info();
static audit_info_ptr create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table, bool batch = false);
audit(locator::shared_token_metadata& stm,
cql3::query_processor& qp,
service::migration_manager& mm,

View File

@@ -329,13 +329,13 @@ public:
auto it = candidates.begin();
auto& first_sstable = *it;
it++;
dht::token first = first_sstable->get_first_decorated_key()._token;
dht::token last = first_sstable->get_last_decorated_key()._token;
dht::token first = first_sstable->get_first_decorated_key().token();
dht::token last = first_sstable->get_last_decorated_key().token();
while (it != candidates.end()) {
auto& candidate_sstable = *it;
it++;
dht::token first_candidate = candidate_sstable->get_first_decorated_key()._token;
dht::token last_candidate = candidate_sstable->get_last_decorated_key()._token;
dht::token first_candidate = candidate_sstable->get_first_decorated_key().token();
dht::token last_candidate = candidate_sstable->get_last_decorated_key().token();
first = first <= first_candidate? first : first_candidate;
last = last >= last_candidate ? last : last_candidate;
@@ -345,7 +345,7 @@ public:
template <typename T>
static std::vector<sstables::shared_sstable> overlapping(const schema& s, const sstables::shared_sstable& sstable, const T& others) {
return overlapping(s, sstable->get_first_decorated_key()._token, sstable->get_last_decorated_key()._token, others);
return overlapping(s, sstable->get_first_decorated_key().token(), sstable->get_last_decorated_key().token(), others);
}
/**
@@ -359,7 +359,7 @@ public:
auto range = ::wrapping_interval<dht::token>::make(start, end);
for (auto& candidate : sstables) {
auto candidate_range = ::wrapping_interval<dht::token>::make(candidate->get_first_decorated_key()._token, candidate->get_last_decorated_key()._token);
auto candidate_range = ::wrapping_interval<dht::token>::make(candidate->get_first_decorated_key().token(), candidate->get_last_decorated_key().token());
if (range.overlaps(candidate_range, dht::token_comparator())) {
overlapped.push_back(candidate);

View File

@@ -817,6 +817,9 @@ arg_parser.add_argument('--c-compiler', action='store', dest='cc', default='clan
help='C compiler path')
arg_parser.add_argument('--compiler-cache', action='store', dest='compiler_cache', default='auto',
help='Compiler cache to use: auto (default, prefers sccache), sccache, ccache, none, or a path to a binary')
# Workaround for https://github.com/mozilla/sccache/issues/2575
arg_parser.add_argument('--sccache-rust', action=argparse.BooleanOptionalAction, default=False,
help='Use sccache for rust code (if sccache is selected as compiler cache). Doesn\'t work with distributed builds.')
add_tristate(arg_parser, name='dpdk', dest='dpdk', default=False,
help='Use dpdk (from seastar dpdk sources)')
arg_parser.add_argument('--dpdk-target', action='store', dest='dpdk_target', default='',
@@ -947,8 +950,7 @@ scylla_core = (['message/messaging_service.cc',
'utils/crypt_sha512.cc',
'utils/logalloc.cc',
'utils/large_bitset.cc',
'utils/buffer_input_stream.cc',
'utils/limiting_data_source.cc',
'test/lib/limiting_data_source.cc',
'utils/updateable_value.cc',
'message/dictionary_service.cc',
'utils/directories.cc',
@@ -1557,6 +1559,7 @@ scylla_perfs = ['test/perf/perf_alternator.cc',
'test/perf/perf_fast_forward.cc',
'test/perf/perf_row_cache_update.cc',
'test/perf/perf_simple_query.cc',
'test/perf/perf_cql_raw.cc',
'test/perf/perf_sstable.cc',
'test/perf/perf_tablets.cc',
'test/perf/tablet_load_balancing.cc',
@@ -2405,7 +2408,7 @@ def write_build_file(f,
# If compiler cache is available, prefix the compiler with it
cxx_with_cache = f'{compiler_cache} {args.cxx}' if compiler_cache else args.cxx
# For Rust, sccache is used via RUSTC_WRAPPER environment variable
rustc_wrapper = f'RUSTC_WRAPPER={compiler_cache} ' if compiler_cache and 'sccache' in compiler_cache else ''
rustc_wrapper = f'RUSTC_WRAPPER={compiler_cache} ' if compiler_cache and 'sccache' in compiler_cache and args.sccache_rust else ''
f.write(textwrap.dedent('''\
configure_args = {configure_args}
builddir = {outdir}
@@ -3149,7 +3152,7 @@ def configure_using_cmake(args):
settings['CMAKE_CXX_COMPILER_LAUNCHER'] = compiler_cache
settings['CMAKE_C_COMPILER_LAUNCHER'] = compiler_cache
# For Rust, sccache is used via RUSTC_WRAPPER
if 'sccache' in compiler_cache:
if 'sccache' in compiler_cache and args.sccache_rust:
settings['Scylla_RUSTC_WRAPPER'] = compiler_cache
if args.date_stamp:

View File

@@ -50,8 +50,8 @@ public:
protected:
virtual audit::statement_category category() const override;
virtual audit::audit_info_ptr audit_info() const override {
// We don't audit batch statements. Instead we audit statements that are inside the batch.
return audit::audit::create_no_audit_info();
constexpr bool batch = true;
return audit::audit::create_audit_info(category(), sstring(), sstring(), batch);
}
};

View File

@@ -158,7 +158,7 @@ void hint_endpoint_manager::cancel_draining() noexcept {
_sender.cancel_draining();
}
hint_endpoint_manager::hint_endpoint_manager(const endpoint_id& key, fs::path hint_directory, manager& shard_manager)
hint_endpoint_manager::hint_endpoint_manager(const endpoint_id& key, fs::path hint_directory, manager& shard_manager, scheduling_group send_sg)
: _key(key)
, _shard_manager(shard_manager)
, _store_gate("hint_endpoint_manager")
@@ -169,7 +169,7 @@ hint_endpoint_manager::hint_endpoint_manager(const endpoint_id& key, fs::path hi
// Approximate the position of the last written hint by using the same formula as for segment id calculation in commitlog
// TODO: Should this logic be deduplicated with what is in the commitlog?
, _last_written_rp(this_shard_id(), std::chrono::duration_cast<std::chrono::milliseconds>(runtime::get_boot_time().time_since_epoch()).count())
, _sender(*this, _shard_manager.local_storage_proxy(), _shard_manager.local_db(), _shard_manager.local_gossiper())
, _sender(*this, _shard_manager.local_storage_proxy(), _shard_manager.local_db(), _shard_manager.local_gossiper(), send_sg)
{}
hint_endpoint_manager::hint_endpoint_manager(hint_endpoint_manager&& other)

View File

@@ -63,7 +63,7 @@ private:
hint_sender _sender;
public:
hint_endpoint_manager(const endpoint_id& key, std::filesystem::path hint_directory, manager& shard_manager);
hint_endpoint_manager(const endpoint_id& key, std::filesystem::path hint_directory, manager& shard_manager, scheduling_group send_sg);
hint_endpoint_manager(hint_endpoint_manager&&);
~hint_endpoint_manager();

View File

@@ -122,7 +122,7 @@ const column_mapping& hint_sender::get_column_mapping(lw_shared_ptr<send_one_fil
return cm_it->second;
}
hint_sender::hint_sender(hint_endpoint_manager& parent, service::storage_proxy& local_storage_proxy,replica::database& local_db, const gms::gossiper& local_gossiper) noexcept
hint_sender::hint_sender(hint_endpoint_manager& parent, service::storage_proxy& local_storage_proxy,replica::database& local_db, const gms::gossiper& local_gossiper, scheduling_group sg) noexcept
: _stopped(make_ready_future<>())
, _ep_key(parent.end_point_key())
, _ep_manager(parent)
@@ -130,7 +130,7 @@ hint_sender::hint_sender(hint_endpoint_manager& parent, service::storage_proxy&
, _resource_manager(_shard_manager._resource_manager)
, _proxy(local_storage_proxy)
, _db(local_db)
, _hints_cpu_sched_group(_db.get_streaming_scheduling_group())
, _hints_cpu_sched_group(sg)
, _gossiper(local_gossiper)
, _file_update_mutex(_ep_manager.file_update_mutex())
{}

View File

@@ -120,7 +120,7 @@ private:
std::multimap<db::replay_position, lw_shared_ptr<std::optional<promise<>>>> _replay_waiters;
public:
hint_sender(hint_endpoint_manager& parent, service::storage_proxy& local_storage_proxy, replica::database& local_db, const gms::gossiper& local_gossiper) noexcept;
hint_sender(hint_endpoint_manager& parent, service::storage_proxy& local_storage_proxy, replica::database& local_db, const gms::gossiper& local_gossiper, scheduling_group sg) noexcept;
~hint_sender();
/// \brief A constructor that should be called from the copy/move-constructor of hint_endpoint_manager.

View File

@@ -142,7 +142,7 @@ future<> directory_initializer::ensure_rebalanced() {
}
manager::manager(service::storage_proxy& proxy, sstring hints_directory, host_filter filter, int64_t max_hint_window_ms,
resource_manager& res_manager, sharded<replica::database>& db)
resource_manager& res_manager, sharded<replica::database>& db, scheduling_group sg)
: _hints_dir(fs::path(hints_directory) / fmt::to_string(this_shard_id()))
, _host_filter(std::move(filter))
, _proxy(proxy)
@@ -150,6 +150,7 @@ manager::manager(service::storage_proxy& proxy, sstring hints_directory, host_fi
, _local_db(db.local())
, _draining_eps_gate(seastar::format("hints::manager::{}", _hints_dir.native()))
, _resource_manager(res_manager)
, _hints_sending_sched_group(sg)
{
if (utils::get_local_injector().enter("decrease_hints_flush_period")) {
hints_flush_period = std::chrono::seconds{1};
@@ -415,7 +416,7 @@ hint_endpoint_manager& manager::get_ep_manager(const endpoint_id& host_id, const
try {
std::filesystem::path hint_directory = hints_dir() / (_uses_host_id ? fmt::to_string(host_id) : fmt::to_string(ip));
auto [it, _] = _ep_managers.emplace(host_id, hint_endpoint_manager{host_id, std::move(hint_directory), *this});
auto [it, _] = _ep_managers.emplace(host_id, hint_endpoint_manager{host_id, std::move(hint_directory), *this, _hints_sending_sched_group});
hint_endpoint_manager& ep_man = it->second;
manager_logger.trace("Created an endpoint manager for {}", host_id);

View File

@@ -133,6 +133,7 @@ private:
hint_stats _stats;
seastar::metrics::metric_groups _metrics;
scheduling_group _hints_sending_sched_group;
// We need to keep a variant here. Before migrating hinted handoff to using host ID, hint directories will
// still represent IP addresses. But after the migration, they will start representing host IDs.
@@ -155,7 +156,7 @@ private:
public:
manager(service::storage_proxy& proxy, sstring hints_directory, host_filter filter,
int64_t max_hint_window_ms, resource_manager& res_manager, sharded<replica::database>& db);
int64_t max_hint_window_ms, resource_manager& res_manager, sharded<replica::database>& db, scheduling_group sg);
manager(const manager&) = delete;
manager& operator=(const manager&) = delete;

View File

@@ -23,6 +23,7 @@
#include <seastar/core/future-util.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/coroutine/all.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <flat_map>
@@ -65,6 +66,7 @@
#include "mutation/timestamp.hh"
#include "utils/assert.hh"
#include "utils/small_vector.hh"
#include "view_builder.hh"
#include "view_info.hh"
#include "view_update_checks.hh"
#include "types/list.hh"
@@ -2238,12 +2240,20 @@ void view_builder::setup_metrics() {
}
future<> view_builder::start_in_background(service::migration_manager& mm, utils::cross_shard_barrier barrier) {
auto step_fiber = make_ready_future<>();
try {
view_builder_init_state vbi;
auto fail = defer([&barrier] mutable { barrier.abort(); });
// Guard the whole startup routine with a semaphore,
// so that it's not intercepted by `on_drop_view`, `on_create_view`
// or `on_update_view` events.
// Semaphore usage invariants:
// - One unit of _sem serializes all per-shard bookkeeping that mutates view-builder state
// (_base_to_build_step, _built_views, build_status, reader resets).
// - The unit is held for the whole operation, including the async chain, until the state
// is stable for the next operation on that shard.
// - Cross-shard operations acquire _sem on shard 0 for the duration of the broadcast.
// Other shards acquire their own _sem only around their local handling; shard 0 skips
// the local acquire because it already holds the unit from the dispatcher.
// Guard the whole startup routine with a semaphore so that it's not intercepted by
// `on_drop_view`, `on_create_view`, or `on_update_view` events.
auto units = co_await get_units(_sem, view_builder_semaphore_units);
// Wait for schema agreement even if we're a seed node.
co_await mm.wait_for_schema_agreement(_db, db::timeout_clock::time_point::max(), &_as);
@@ -2264,8 +2274,10 @@ future<> view_builder::start_in_background(service::migration_manager& mm, utils
_mnotifier.register_listener(this);
co_await calculate_shard_build_step(vbi);
_current_step = _base_to_build_step.begin();
// Waited on indirectly in stop().
(void)_build_step.trigger();
// If preparation above fails, run_in_background() is not invoked, just
// the start_in_background() emits a warning into logs and resolves
step_fiber = run_in_background();
} catch (...) {
auto ex = std::current_exception();
auto ll = log_level::error;
@@ -2280,10 +2292,12 @@ future<> view_builder::start_in_background(service::migration_manager& mm, utils
}
vlogger.log(ll, "start aborted: {}", ex);
}
co_await std::move(step_fiber);
}
future<> view_builder::start(service::migration_manager& mm, utils::cross_shard_barrier barrier) {
_started = start_in_background(mm, std::move(barrier));
_step_fiber = start_in_background(mm, std::move(barrier));
return make_ready_future<>();
}
@@ -2293,12 +2307,12 @@ future<> view_builder::drain() {
}
vlogger.info("Draining view builder");
_as.request_abort();
co_await std::move(_started);
co_await _mnotifier.unregister_listener(this);
co_await _vug.drain();
co_await _sem.wait();
_sem.broken();
co_await _build_step.join();
_build_step.broken();
co_await std::move(_step_fiber);
co_await coroutine::parallel_for_each(_base_to_build_step, [] (std::pair<const table_id, build_step>& p) {
return p.second.reader.close();
});
@@ -2667,63 +2681,59 @@ static bool should_ignore_tablet_keyspace(const replica::database& db, const sst
return db.features().view_building_coordinator && db.has_keyspace(ks_name) && db.find_keyspace(ks_name).uses_tablets();
}
future<> view_builder::dispatch_create_view(sstring ks_name, sstring view_name) {
if (should_ignore_tablet_keyspace(_db, ks_name)) {
return make_ready_future<>();
}
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
// This runs on shard 0 only; seed the global rows before broadcasting.
return handle_seed_view_build_progress(ks_name, view_name).then([this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return container().invoke_on_all([ks_name = std::move(ks_name), view_name = std::move(view_name)] (view_builder& vb) mutable {
return vb.handle_create_view_local(std::move(ks_name), std::move(view_name));
});
});
});
future<view_builder::view_builder_units> view_builder::get_or_adopt_view_builder_lock(view_builder_units_opt units) {
co_return units ? std::move(*units) : co_await get_units(_sem, view_builder_semaphore_units);
}
future<> view_builder::handle_seed_view_build_progress(sstring ks_name, sstring view_name) {
future<> view_builder::dispatch_create_view(sstring ks_name, sstring view_name) {
if (should_ignore_tablet_keyspace(_db, ks_name)) {
co_return;
}
auto units = co_await get_or_adopt_view_builder_lock(std::nullopt);
co_await handle_seed_view_build_progress(ks_name, view_name);
co_await coroutine::all(
[this, ks_name, view_name, units = std::move(units)] mutable -> future<> {
co_await handle_create_view_local(ks_name, view_name, std::move(units)); },
[this, ks_name, view_name] mutable -> future<> {
co_await container().invoke_on_others([ks_name = std::move(ks_name), view_name = std::move(view_name)] (view_builder& vb) mutable -> future<> {
return vb.handle_create_view_local(ks_name, view_name, std::nullopt); }); });
}
future<> view_builder::handle_seed_view_build_progress(const sstring& ks_name, const sstring& view_name) {
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto& step = get_or_create_build_step(view->view_info()->base_id());
return _sys_ks.register_view_for_building_for_all_shards(view->ks_name(), view->cf_name(), step.current_token());
}
future<> view_builder::handle_create_view_local(sstring ks_name, sstring view_name){
if (this_shard_id() == 0) {
return handle_create_view_local_impl(std::move(ks_name), std::move(view_name));
} else {
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return handle_create_view_local_impl(std::move(ks_name), std::move(view_name));
});
}
}
future<> view_builder::handle_create_view_local_impl(sstring ks_name, sstring view_name) {
future<> view_builder::handle_create_view_local(const sstring& ks_name, const sstring& view_name, view_builder_units_opt units) {
[[maybe_unused]] auto sem_units = co_await get_or_adopt_view_builder_lock(std::move(units));
auto view = view_ptr(_db.find_schema(ks_name, view_name));
auto& step = get_or_create_build_step(view->view_info()->base_id());
return when_all(step.base->await_pending_writes(), step.base->await_pending_streams()).discard_result().then([this, &step] {
return flush_base(step.base, _as);
}).then([this, view, &step] () {
try {
co_await coroutine::all(
[&step] -> future<> {
co_await step.base->await_pending_writes(); },
[&step] -> future<> {
co_await step.base->await_pending_streams(); });
co_await flush_base(step.base, _as);
// This resets the build step to the current token. It may result in views currently
// being built to receive duplicate updates, but it simplifies things as we don't have
// to keep around a list of new views to build the next time the reader crosses a token
// threshold.
return initialize_reader_at_current_token(step).then([this, view, &step] () mutable {
return add_new_view(view, step);
}).then_wrapped([this, view] (future<>&& f) {
try {
f.get();
} catch (abort_requested_exception&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (raft::request_aborted&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (...) {
vlogger.error("Error setting up view for building {}.{}: {}", view->ks_name(), view->cf_name(), std::current_exception());
}
co_await initialize_reader_at_current_token(step);
co_await add_new_view(view, step);
} catch (abort_requested_exception&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (raft::request_aborted&) {
vlogger.debug("Aborted while setting up view for building {}.{}", view->ks_name(), view->cf_name());
} catch (...) {
vlogger.error("Error setting up view for building {}.{}: {}", view->ks_name(), view->cf_name(), std::current_exception());
}
// Waited on indirectly in stop().
static_cast<void>(_build_step.trigger());
});
});
_build_step.signal();
}
void view_builder::on_create_view(const sstring& ks_name, const sstring& view_name) {
@@ -2760,62 +2770,55 @@ void view_builder::on_update_view(const sstring& ks_name, const sstring& view_na
future<> view_builder::dispatch_drop_view(sstring ks_name, sstring view_name) {
if (should_ignore_tablet_keyspace(_db, ks_name)) {
return make_ready_future<>();
co_return;
}
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
// This runs on shard 0 only; broadcast local cleanup before global cleanup.
return container().invoke_on_all([ks_name, view_name] (view_builder& vb) mutable {
return vb.handle_drop_view_local(std::move(ks_name), std::move(view_name));
}).then([this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return handle_drop_view_global_cleanup(std::move(ks_name), std::move(view_name));
});
});
auto units = co_await get_or_adopt_view_builder_lock(std::nullopt);
co_await coroutine::all(
[this, ks_name, view_name, units = std::move(units)] mutable -> future<> {
co_await handle_drop_view_local(ks_name, view_name, std::move(units)); },
[this, ks_name, view_name] mutable -> future<> {
co_await container().invoke_on_others([ks_name = std::move(ks_name), view_name = std::move(view_name)] (view_builder& vb) mutable -> future<> {
return vb.handle_drop_view_local(ks_name, view_name, std::nullopt); });});
co_await handle_drop_view_global_cleanup(ks_name, view_name);
}
future<> view_builder::handle_drop_view_local(sstring ks_name, sstring view_name) {
if (this_shard_id() == 0) {
return handle_drop_view_local_impl(std::move(ks_name), std::move(view_name));
} else {
return with_semaphore(_sem, view_builder_semaphore_units, [this, ks_name = std::move(ks_name), view_name = std::move(view_name)] () mutable {
return handle_drop_view_local_impl(std::move(ks_name), std::move(view_name));
});
}
}
future<> view_builder::handle_drop_view_local_impl(sstring ks_name, sstring view_name) {
future<> view_builder::handle_drop_view_local(const sstring& ks_name, const sstring& view_name, view_builder_units_opt units) {
[[maybe_unused]] auto sem_units = co_await get_or_adopt_view_builder_lock(std::move(units));
vlogger.info0("Stopping to build view {}.{}", ks_name, view_name);
// The view is absent from the database at this point, so find it by brute force.
([&, this] {
for (auto& [_, step] : _base_to_build_step) {
if (step.build_status.empty() || step.build_status.front().view->ks_name() != ks_name) {
continue;
}
for (auto it = step.build_status.begin(); it != step.build_status.end(); ++it) {
if (it->view->cf_name() == view_name) {
_built_views.erase(it->view->id());
step.build_status.erase(it);
return;
}
for (auto& [_, step] : _base_to_build_step) {
if (step.build_status.empty() || step.build_status.front().view->ks_name() != ks_name) {
continue;
}
for (auto it = step.build_status.begin(); it != step.build_status.end(); ++it) {
if (it->view->cf_name() == view_name) {
_built_views.erase(it->view->id());
step.build_status.erase(it);
co_return;
}
}
})();
return make_ready_future<>();
}
}
future<> view_builder::handle_drop_view_global_cleanup(sstring ks_name, sstring view_name) {
future<> view_builder::handle_drop_view_global_cleanup(const sstring& ks_name, const sstring& view_name) {
if (this_shard_id() != 0) {
return make_ready_future<>();
co_return;
}
vlogger.info0("Starting view global cleanup {}.{}", ks_name, view_name);
return when_all_succeed(
_sys_ks.remove_view_build_progress_across_all_shards(ks_name, view_name),
_sys_ks.remove_built_view(ks_name, view_name),
remove_view_build_status(ks_name, view_name))
.discard_result()
.handle_exception([ks_name, view_name] (std::exception_ptr ep) {
vlogger.warn("Failed to cleanup view {}.{}: {}", ks_name, view_name, ep);
});
try {
co_await coroutine::all(
[this, &ks_name, &view_name] -> future<> {
co_await _sys_ks.remove_view_build_progress_across_all_shards(ks_name, view_name); },
[this, &ks_name, &view_name] -> future<> {
co_await _sys_ks.remove_built_view(ks_name, view_name); },
[this, &ks_name, &view_name] -> future<> {
co_await remove_view_build_status(ks_name, view_name); });
} catch (...) {
vlogger.warn("Failed to cleanup view {}.{}: {}", ks_name, view_name, std::current_exception());
}
}
void view_builder::on_drop_view(const sstring& ks_name, const sstring& view_name) {
@@ -2829,14 +2832,15 @@ void view_builder::on_drop_view(const sstring& ks_name, const sstring& view_name
}));
}
future<> view_builder::do_build_step() {
// Run the view building in the streaming scheduling group
// so that it doesn't impact other tasks with higher priority.
seastar::thread_attributes attr;
attr.sched_group = _db.get_streaming_scheduling_group();
return seastar::async(std::move(attr), [this] {
future<> view_builder::run_in_background() {
return seastar::async([this] {
exponential_backoff_retry r(1s, 1min);
while (!_base_to_build_step.empty() && !_as.abort_requested()) {
while (!_as.abort_requested()) {
try {
_build_step.wait([this] { return !_base_to_build_step.empty(); }).get();
} catch (const seastar::broken_condition_variable&) {
return;
}
auto units = get_units(_sem, view_builder_semaphore_units).get();
++_stats.steps_performed;
try {

View File

@@ -11,13 +11,13 @@
#include "query/query-request.hh"
#include "service/migration_listener.hh"
#include "service/raft/raft_group0_client.hh"
#include "utils/serialized_action.hh"
#include "utils/cross-shard-barrier.hh"
#include "replica/database.hh"
#include <seastar/core/abort_source.hh>
#include <seastar/core/future.hh>
#include <seastar/core/semaphore.hh>
#include <seastar/core/condition-variable.hh>
#include <seastar/core/sharded.hh>
#include <seastar/core/shared_future.hh>
#include <seastar/core/shared_ptr.hh>
@@ -104,6 +104,12 @@ class view_update_generator;
* redo the missing step, for simplicity.
*/
class view_builder final : public service::migration_listener::only_view_notifications, public seastar::peering_sharded_service<view_builder> {
//aliasing for semaphore units that will be used throughout the class
using view_builder_units = semaphore_units<named_semaphore_exception_factory>;
//aliasing for optional semaphore units that will be used throughout the class
using view_builder_units_opt = std::optional<view_builder_units>;
/**
* Keeps track of the build progress for a particular view.
* When the view is built, next_token == first_token.
@@ -168,14 +174,24 @@ class view_builder final : public service::migration_listener::only_view_notific
reader_permit _permit;
base_to_build_step_type _base_to_build_step;
base_to_build_step_type::iterator _current_step = _base_to_build_step.end();
serialized_action _build_step{std::bind(&view_builder::do_build_step, this)};
condition_variable _build_step;
static constexpr size_t view_builder_semaphore_units = 1;
// Ensures bookkeeping operations are serialized, meaning that while we execute
// a build step we don't consider newly added or removed views. This simplifies
// the algorithms. Also synchronizes an operation wrt. a call to stop().
// Semaphore usage invariants:
// - One unit of _sem serializes all per-shard bookkeeping that mutates view-builder state
// (_base_to_build_step, _built_views, build_status, reader resets).
// - The unit is held for the whole operation, including the async chain, until the state
// is stable for the next operation on that shard.
// - Cross-shard operations acquire _sem on shard 0 for the duration of the broadcast.
// Other shards acquire their own _sem only around their local handling; shard 0 skips
// the local acquire because it already holds the unit from the dispatcher.
// Guard the whole startup routine with a semaphore so that it's not intercepted by
// `on_drop_view`, `on_create_view`, or `on_update_view` events.
seastar::named_semaphore _sem{view_builder_semaphore_units, named_semaphore_exception_factory{"view builder"}};
seastar::abort_source _as;
future<> _started = make_ready_future<>();
future<> _step_fiber = make_ready_future<>();
// Used to coordinate between shards the conclusion of the build process for a particular view.
std::unordered_set<table_id> _built_views;
// Used for testing.
@@ -262,19 +278,18 @@ private:
void setup_shard_build_step(view_builder_init_state& vbi, std::vector<system_keyspace_view_name>, std::vector<system_keyspace_view_build_progress>);
future<> calculate_shard_build_step(view_builder_init_state& vbi);
future<> add_new_view(view_ptr, build_step&);
future<> do_build_step();
future<> run_in_background();
void execute(build_step&, exponential_backoff_retry);
future<> maybe_mark_view_as_built(view_ptr, dht::token);
future<> mark_as_built(view_ptr);
void setup_metrics();
future<> dispatch_create_view(sstring ks_name, sstring view_name);
future<> dispatch_drop_view(sstring ks_name, sstring view_name);
future<> handle_seed_view_build_progress(sstring ks_name, sstring view_name);
future<> handle_create_view_local(sstring ks_name, sstring view_name);
future<> handle_drop_view_local(sstring ks_name, sstring view_name);
future<> handle_create_view_local_impl(sstring ks_name, sstring view_name);
future<> handle_drop_view_local_impl(sstring ks_name, sstring view_name);
future<> handle_drop_view_global_cleanup(sstring ks_name, sstring view_name);
future<> handle_seed_view_build_progress(const sstring& ks_name, const sstring& view_name);
future<> handle_create_view_local(const sstring& ks_name, const sstring& view_name, view_builder_units_opt units);
future<> handle_drop_view_local(const sstring& ks_name, const sstring& view_name, view_builder_units_opt units);
future<> handle_drop_view_global_cleanup(const sstring& ks_name, const sstring& view_name);
future<view_builder_units> get_or_adopt_view_builder_lock(view_builder_units_opt units);
template <typename Func1, typename Func2>
future<> write_view_build_status(Func1&& fn_group0, Func2&& fn_sys_dist) {

View File

@@ -242,7 +242,7 @@ future<> view_building_worker::create_staging_sstable_tasks() {
utils::UUID_gen::get_time_UUID(), view_building_task::task_type::process_staging, false,
table_id, ::table_id{}, {my_host_id, sst_info.shard}, sst_info.last_token
};
auto mut = co_await _group0.client().sys_ks().make_view_building_task_mutation(guard.write_timestamp(), task);
auto mut = co_await _sys_ks.make_view_building_task_mutation(guard.write_timestamp(), task);
cmuts.emplace_back(std::move(mut));
}
}
@@ -386,7 +386,6 @@ future<> view_building_worker::update_built_views() {
auto schema = _db.find_schema(table_id);
return std::make_pair(schema->ks_name(), schema->cf_name());
};
auto& sys_ks = _group0.client().sys_ks();
std::set<std::pair<sstring, sstring>> built_views;
for (auto& [id, statuses]: _vb_state_machine.views_state.status_map) {
@@ -395,22 +394,22 @@ future<> view_building_worker::update_built_views() {
}
}
auto local_built = co_await sys_ks.load_built_views() | std::views::filter([&] (auto& v) {
auto local_built = co_await _sys_ks.load_built_views() | std::views::filter([&] (auto& v) {
return !_db.has_keyspace(v.first) || _db.find_keyspace(v.first).uses_tablets();
}) | std::ranges::to<std::set>();
// Remove dead entries
for (auto& view: local_built) {
if (!built_views.contains(view)) {
co_await sys_ks.remove_built_view(view.first, view.second);
co_await _sys_ks.remove_built_view(view.first, view.second);
}
}
// Add new entries
for (auto& view: built_views) {
if (!local_built.contains(view)) {
co_await sys_ks.mark_view_as_built(view.first, view.second);
co_await sys_ks.remove_view_build_progress_across_all_shards(view.first, view.second);
co_await _sys_ks.mark_view_as_built(view.first, view.second);
co_await _sys_ks.remove_view_build_progress_across_all_shards(view.first, view.second);
}
}
}

View File

@@ -11,5 +11,6 @@
namespace debug {
seastar::sharded<replica::database>* volatile the_database = nullptr;
seastar::scheduling_group streaming_scheduling_group;
}

View File

@@ -17,7 +17,7 @@ class database;
namespace debug {
extern seastar::sharded<replica::database>* volatile the_database;
extern seastar::scheduling_group streaming_scheduling_group;
}

View File

@@ -30,11 +30,13 @@ namespace dht {
// Total ordering defined by comparators is compatible with Origin's ordering.
class decorated_key {
public:
dht::token _token;
// Store only the token data as int64_t to avoid the bloat of storing
// token_kind, which is always token_kind::key for decorated_key.
int64_t _token_data;
partition_key _key;
decorated_key(dht::token t, partition_key k)
: _token(std::move(t))
: _token_data(t._data)
, _key(std::move(k)) {
}
@@ -56,8 +58,8 @@ public:
std::strong_ordering tri_compare(const schema& s, const decorated_key& other) const;
std::strong_ordering tri_compare(const schema& s, const ring_position& other) const;
const dht::token& token() const noexcept {
return _token;
dht::token token() const noexcept {
return dht::token(_token_data);
}
const partition_key& key() const {
@@ -65,7 +67,7 @@ public:
}
size_t external_memory_usage() const {
return _key.external_memory_usage() + _token.external_memory_usage();
return _key.external_memory_usage();
}
size_t memory_usage() const {
@@ -102,6 +104,6 @@ template <> struct fmt::formatter<dht::decorated_key> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
template <typename FormatContext>
auto format(const dht::decorated_key& dk, FormatContext& ctx) const {
return fmt::format_to(ctx.out(), "{{key: {}, token: {}}}", dk._key, dk._token);
return fmt::format_to(ctx.out(), "{{key: {}, token: {}}}", dk._key, dk.token());
}
};

View File

@@ -95,7 +95,7 @@ std::unique_ptr<dht::i_partitioner> make_partitioner(sstring partitioner_name) {
bool
decorated_key::equal(const schema& s, const decorated_key& other) const {
if (_token == other._token) {
if (_token_data == other._token_data) {
return _key.legacy_equal(s, other._key);
}
return false;
@@ -103,7 +103,7 @@ decorated_key::equal(const schema& s, const decorated_key& other) const {
std::strong_ordering
decorated_key::tri_compare(const schema& s, const decorated_key& other) const {
auto r = _token <=> other._token;
auto r = _token_data <=> other._token_data;
if (r != 0) {
return r;
} else {
@@ -113,13 +113,24 @@ decorated_key::tri_compare(const schema& s, const decorated_key& other) const {
std::strong_ordering
decorated_key::tri_compare(const schema& s, const ring_position& other) const {
auto r = _token <=> other.token();
if (r != 0) {
return r;
} else if (other.has_key()) {
return _key.legacy_tri_compare(s, *other.key());
// decorated_key tokens are always of token_kind::key, so we need to
// account for ring_position tokens that might be before_all_keys or after_all_keys
const auto& other_token = other.token();
if (other_token._kind == token_kind::key) [[likely]] {
auto r = _token_data <=> other_token._data;
if (r != 0) {
return r;
} else if (other.has_key()) {
return _key.legacy_tri_compare(s, *other.key());
}
return 0 <=> other.relation_to_keys();
} else if (other_token._kind == token_kind::before_all_keys) {
// decorated_key (token_kind::key) > before_all_keys
return std::strong_ordering::greater;
} else {
// decorated_key (token_kind::key) < after_all_keys
return std::strong_ordering::less;
}
return 0 <=> other.relation_to_keys();
}
bool

View File

@@ -93,12 +93,12 @@ public:
{ }
ring_position(const dht::decorated_key& dk)
: _token(dk._token)
: _token(dk.token())
, _key(std::make_optional(dk._key))
{ }
ring_position(dht::decorated_key&& dk)
: _token(std::move(dk._token))
: _token(dk.token())
, _key(std::make_optional(std::move(dk._key)))
{ }

View File

@@ -5,6 +5,10 @@
/stable/kb/perftune-modes-sync.html: /stable/kb/index.html
# Remove the troubleshooting page relevant for Open Source only
/stable/troubleshooting/missing-dotmount-files.html: /troubleshooting/index.html
# Move the diver information to another project
/stable/using-scylla/drivers/index.html: https://docs.scylladb.com/stable/drivers/index.html

View File

@@ -1026,7 +1026,29 @@ You can enable the after-repair tombstone GC by setting the ``repair`` mode usin
ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'} ;
The following modes are available:
To support writes arriving out-of-order -- either due to natural delays, or user provided timestamps -- the repair mode has a propagation delay.
Out-of-order writes present a problem for repair mode tombstone gc. Consider the following example sequence of events:
1) Write ``DELETE FROM table WHERE key = K1`` arrives at the node.
2) Repair is run.
3) Compaction runs and garbage collects the tombstone for ``key = K1``.
4) Write ``INSERT INTO table (key, ...) VALUES (K1, ...)`` arrives at the node with timestamp smaller than that of the delete. The tombstone for ``key = K1`` should apply to this write, but it is already garbage collected, so this data is resurrected.
Propagation delay solves this problem by establishing a window before repair, where tombstones are not yet garbage collectible: a tombstone is garbage collectible if it was written before the last repair by at least the propagation delay.
The value of the propagation delay can be set via the ``propagation_delay_in_seconds`` parameter:
.. code-block:: cql
CREATE TABLE ks.cf (key blob PRIMARY KEY, val blob) WITH tombstone_gc = {'mode':'repair', 'propagation_delay_in_seconds': 120};
.. code-block:: cql
ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair', 'propagation_delay_in_seconds': 120};
The default value of the propagation delay is 1 hour. This parameter should only be changed if your application uses user provided timestamps and writes and deletes can arrive out-of-order by more than the default 1 hour.
The following tombstone gc modes are available:
.. list-table::
:widths: 20 80

View File

@@ -0,0 +1,23 @@
.. _automatic-repair:
Automatic Repair
================
Traditionally, launching `repairs </operating-scylla/procedures/maintenance/repair>`_ in a ScyllaDB cluster is left to an external process, typically done via `Scylla Manager <https://manager.docs.scylladb.com/stable/repair/index.html>`_.
Automatic repair offers built-in scheduling in ScyllaDB itself. If the time since the last repair is greater than the configured repair interval, ScyllaDB will start a repair for the tablet `tablet </architecture/tablets>`_ automatically.
Repairs are spread over time and among nodes and shards, to avoid load spikes or any adverse effects on user workloads.
To enable automatic repair, add this to the configuration (``scylla.yaml``):
.. code-block:: yaml
auto_repair_enabled_default: true
auto_repair_threshold_default_in_seconds: 86400
This will enable automatic repair for all tables with a repair period of 1 day. This configuration has to be set on each node, to an identical value.
More featureful configuration methods will be implemented in the future.
To disable, set ``auto_repair_enabled_default: false``.
Automatic repair relies on `Incremental Repair </features/incremental-repair>`_ and as such it only works with `tablet </architecture/tablets>`_ tables.

View File

@@ -3,7 +3,7 @@
Incremental Repair
==================
ScyllaDB's standard repair process scans and processes all the data on a node, regardless of whether it has changed since the last repair. This operation can be resource-intensive and time-consuming. The Incremental Repair feature provides a much more efficient and lightweight alternative for maintaining data consistency.
ScyllaDB's standard `repair </operating-scylla/procedures/maintenance/repair>`_ process scans and processes all the data on a node, regardless of whether it has changed since the last repair. This operation can be resource-intensive and time-consuming. The Incremental Repair feature provides a much more efficient and lightweight alternative for maintaining data consistency.
The core idea of incremental repair is to repair only the data that has been written or changed since the last repair was run. It intelligently skips data that has already been verified, dramatically reducing the time, I/O, and CPU resources required for the repair operation.
@@ -37,7 +37,12 @@ The available modes are:
* ``disabled``: Completely disables the incremental repair logic for the current operation. The repair behaves like a classic, non-incremental repair, and it does not read or update any incremental repair status markers.
The incremental_mode parameter can be specified using nodetool cluster repair, e.g., nodetool cluster repair --incremental-mode incremental. It can also be specified with the REST API, e.g., curl -X POST "http://127.0.0.1:10000/storage_service/tablets/repair?ks=ks1&table=tb1&tokens=all&incremental_mode=incremental"
The incremental_mode parameter can be specified using nodetool cluster repair, e.g., nodetool cluster repair --incremental-mode incremental.
It can also be specified with the REST API, e.g.:
.. code::
curl -X POST "http://127.0.0.1:10000/storage_service/tablets/repair?ks=ks1&table=tb1&tokens=all&incremental_mode=incremental"
Benefits of Incremental Repair
------------------------------
@@ -46,6 +51,8 @@ Benefits of Incremental Repair
* **Reduced Resource Usage:** Consumes significantly less CPU, I/O, and network bandwidth compared to a full repair.
* **More Frequent Repairs:** The efficiency of incremental repair allows you to run it more frequently, ensuring a higher level of data consistency across your cluster at all times.
Tables using Incremental Repair can schedule repairs in ScyllaDB itself, with `Automatic Repair </features/automatic-repair>`_.
Notes
-----

View File

@@ -17,6 +17,7 @@ This document highlights ScyllaDB's key data modeling features.
Workload Prioritization </features/workload-prioritization>
Backup and Restore </features/backup-and-restore>
Incremental Repair </features/incremental-repair/>
Automatic Repair </features/automatic-repair/>
Vector Search </features/vector-search/>
.. panel-box::
@@ -44,5 +45,7 @@ This document highlights ScyllaDB's key data modeling features.
* :doc:`Incremental Repair </features/incremental-repair/>` provides a much more
efficient and lightweight approach to maintaining data consistency by
repairing only the data that has changed since the last repair.
* :doc:`Automatic Repair </features/automatic-repair/>` schedules and runs repairs
directly in ScyllaDB, without external schedulers.
* :doc:`Vector Search in ScyllaDB </features/vector-search/>` enables
similarity-based queries on vector embeddings.

View File

@@ -24,9 +24,9 @@ Keep your versions up-to-date. The two latest versions are supported. Also, alwa
:id: "getting-started"
:class: my-panel
* :doc:`Launch ScyllaDB on AWS </getting-started/install-scylla/launch-on-aws>`
* :doc:`Launch ScyllaDB on GCP </getting-started/install-scylla/launch-on-gcp>`
* :doc:`Launch ScyllaDB on Azure </getting-started/install-scylla/launch-on-azure>`
* :doc:`Launch ScyllaDB |CURRENT_VERSION| on AWS </getting-started/install-scylla/launch-on-aws>`
* :doc:`Launch ScyllaDB |CURRENT_VERSION| on GCP </getting-started/install-scylla/launch-on-gcp>`
* :doc:`Launch ScyllaDB |CURRENT_VERSION| on Azure </getting-started/install-scylla/launch-on-azure>`
.. panel-box::
@@ -35,7 +35,7 @@ Keep your versions up-to-date. The two latest versions are supported. Also, alwa
:class: my-panel
* :doc:`Install ScyllaDB with Web Installer (recommended) </getting-started/installation-common/scylla-web-installer>`
* :doc:`Install ScyllaDB Linux Packages </getting-started/install-scylla/install-on-linux>`
* :doc:`Install ScyllaDB |CURRENT_VERSION| Linux Packages </getting-started/install-scylla/install-on-linux>`
* :doc:`Install scylla-jmx Package </getting-started/installation-common/install-jmx>`
* :doc:`Install ScyllaDB Without root Privileges </getting-started/installation-common/unified-installer>`
* :doc:`Air-gapped Server Installation </getting-started/installation-common/air-gapped-install>`

View File

@@ -4,9 +4,9 @@
.. |RHEL_EPEL_8| replace:: https://dl.fedoraproject.org/pub/epel/epel-release-latest-8.noarch.rpm
.. |RHEL_EPEL_9| replace:: https://dl.fedoraproject.org/pub/epel/epel-release-latest-9.noarch.rpm
======================================
Install ScyllaDB Linux Packages
======================================
========================================================
Install ScyllaDB |CURRENT_VERSION| Linux Packages
========================================================
We recommend installing ScyllaDB using :doc:`ScyllaDB Web Installer for Linux </getting-started/installation-common/scylla-web-installer/>`,
a platform-agnostic installation script, to install ScyllaDB on any supported Linux platform.
@@ -46,8 +46,8 @@ Install ScyllaDB
.. code-block:: console
sudo gpg --homedir /tmp --no-default-keyring --keyring /tmp/temp.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys a43e06657bac99e3
sudo gpg --homedir /tmp --no-default-keyring --keyring /tmp/temp.gpg --export --armor a43e06657bac99e3 | gpg --dearmor > /etc/apt/keyrings/scylladb.gpg
sudo gpg --homedir /tmp --no-default-keyring --keyring /tmp/temp.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys c503c686b007f39e
sudo gpg --homedir /tmp --no-default-keyring --keyring /tmp/temp.gpg --export --armor c503c686b007f39e | gpg --dearmor > /etc/apt/keyrings/scylladb.gpg
.. code-block:: console
:substitutions:

View File

@@ -1,6 +1,6 @@
==========================
Launch ScyllaDB on AWS
==========================
===============================================
Launch ScyllaDB |CURRENT_VERSION| on AWS
===============================================
This article will guide you through self-managed ScyllaDB deployment on AWS. For a fully-managed deployment of ScyllaDB
as-a-service, see `ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/>`_.

View File

@@ -1,6 +1,6 @@
==========================
Launch ScyllaDB on Azure
==========================
===============================================
Launch ScyllaDB |CURRENT_VERSION| on Azure
===============================================
This article will guide you through self-managed ScyllaDB deployment on Azure. For a fully-managed deployment of ScyllaDB
as-a-service, see `ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/>`_.

View File

@@ -1,6 +1,6 @@
==========================
Launch ScyllaDB on GCP
==========================
=============================================
Launch ScyllaDB |CURRENT_VERSION| on GCP
=============================================
This article will guide you through self-managed ScyllaDB deployment on GCP. For a fully-managed deployment of ScyllaDB
as-a-service, see `ScyllaDB Cloud documentation <https://cloud.docs.scylladb.com/>`_.

View File

@@ -601,11 +601,7 @@ Scrub has several modes:
* **segregate** - Fixes partition/row/mutation-fragment out-of-order errors by segregating the output into as many SStables as required so that the content of each output SStable is properly ordered.
* **validate** - Validates the content of the SStable, reporting any corruptions found. Writes no output SStables. In this mode, scrub has the same outcome as the `validate operation <scylla-sstable-validate-operation_>`_ - and the validate operation is recommended over scrub.
Output SStables are written to the directory specified via ``--output-directory``. They will be written with the ``BIG`` format and the highest supported SStable format, with generations chosen by scylla-sstable. Generations are chosen such
that they are unique among the SStables written by the current scrub.
The output directory must be empty; otherwise, scylla-sstable will abort scrub. You can allow writing to a non-empty directory by setting the ``--unsafe-accept-nonempty-output-dir`` command line flag.
Note that scrub will be aborted if an SStable cannot be written because its generation clashes with a pre-existing SStable in the output directory.
Output SStables are written to the directory specified via ``--output-dir``. They will be written with the ``BIG`` format and the highest supported SStable format, with random generation.
validate-checksums
^^^^^^^^^^^^^^^^^^
@@ -870,7 +866,7 @@ The SSTable version to be used can be overridden with the ``--version`` flag, al
SSTables which are already on the designated version are skipped. To force rewriting *all* SSTables, use the ``--all`` flag.
Output SSTables are written to the path provided by the ``--output-dir`` flag, or to the current directory if not specified.
This directory is expected to exist and be empty. If not empty the tool will refuse to run. This can be overridden with the ``--unsafe-accept-nonempty-output-dir`` flag.
This directory is expected to exist.
It is strongly recommended to use the system schema tables as the schema source for this command, see the :ref:`schema options <scylla-sstable-schema>` for more details.
A schema which is good enough to read the SSTable and dump its content, may not be good enough to write its content back verbatim.
@@ -882,6 +878,25 @@ But even an altered schema which changed only the table options can lead to data
The mapping of input SSTables to output SSTables is printed to ``stdout``.
filter
^^^^^^
Filter the SSTable(s), including/excluding specified partitions.
Similar to ``scylla sstable dump-data --partition|--partition-file``, with some notable differences:
* Instead of dumping the content to stdout, the filtered content is written back to SSTable(s) on disk.
* Also supports negative filters (keep all partitions except the those specified).
The partition list can be provided either via the ``--partition`` command line argument, or via a file path passed to the the ``--partitions-file`` argument. The file should contain one partition key per line.
Partition keys should be provided in the hex format, as produced by `scylla types serialize </operating-scylla/admin-tools/scylla-types/>`_.
With ``--include``, only the specified partitions are kept from the input SSTable(s). With ``--exclude``, the specified partitions are discarded and won't be written to the output SSTable(s).
It is possible that certain input SSTable(s) won't have any content left after the filtering. These input SSTable(s) will not have a matching output SSTable.
By default, each input sstable is filtered individually. Use ``--merge`` to filter the combined content of all input sstables, producing a single output SSTable.
Output sstables use the latest supported sstable format (can be changed with ``--sstable-version``).
Examples
--------

View File

@@ -58,4 +58,12 @@ See also
* `Blog: ScyllaDB Open Source 3.1: Efficiently Maintaining Consistency with Row-Level Repair <https://www.scylladb.com/2019/08/13/scylla-open-source-3-1-efficiently-maintaining-consistency-with-row-level-repair/>`_
Incremental Repair
------------------
Built on top of `Row-level Repair <row-level-repair_>`_ and `Tablets </architecture/tablets>`_, Incremental Repair enables frequent and quick repairs. For more details, see `Incremental Repair </features/incremental-repair>`_.
Automatic Repair
----------------
Built on top of `Incremental Repair </features/incremental-repair>`_, `Automatic Repair </features/automatic-repair>`_ offers repair scheduling and execution directly in ScyllaDB, without external processes.

View File

@@ -8,7 +8,6 @@ Troubleshooting ScyllaDB
support/index
startup/index
upgrade/index
cluster/index
modeling/index
storage/index
@@ -29,7 +28,6 @@ Keep your versions up-to-date. The two latest versions are supported. Also, alwa
* :doc:`Errors and ScyllaDB Customer Support <support/index>`
* :doc:`ScyllaDB Startup <startup/index>`
* :doc:`ScyllaDB Cluster and Node <cluster/index>`
* :doc:`ScyllaDB Upgrade <upgrade/index>`
* :doc:`Data Modeling <modeling/index>`
* :doc:`Data Storage and SSTables <storage/index>`
* :doc:`CQL errors <CQL/index>`

View File

@@ -1,79 +0,0 @@
Inaccessible "/var/lib/scylla" and "/var/lib/systemd/coredump" after ScyllaDB upgrade
======================================================================================
Problem
^^^^^^^
When you reboot the machine after a ScyllaDB upgrade, you cannot access data directories under ``/var/lib/scylla``, and
coredump saves to ``rootfs``.
The problem may occur when you upgrade ScylaDB Open Source 4.6 or later to a version of ScyllaDB Enterprise if
the ``/etc/systemd/system/var-lib-scylla.mount`` and ``/etc/systemd/system/var-lib-systemd-coredump.mount`` are
deleted by RPM.
To avoid losing the files, the upgrade procedure includes a step to backup the .mount files. The following
example shows the command to backup the files before the upgrade from version 5.0:
.. code-block:: console
for conf in $( rpm -qc $(rpm -qa | grep scylla) | grep -v contains ) /etc/systemd/system/{var-lib-scylla,var-lib-systemd-coredump}.mount; do sudo cp -v $conf $conf.backup-5.0; done
If you don't backup the .mount files before the upgrade, the files may be lost.
Solution
^^^^^^^^
If you didn't backup the .mount files before the upgrade and the files were deleted during the upgrade,
you need to restore them manually.
To restore ``/etc/systemd/system/var-lib-systemd-coredump.mount``, run the following:
.. code-block:: console
$ cat << EOS | sudo tee /etc/systemd/system/var-lib-systemd-coredump.mount
[Unit]
Description=Save coredump to scylla data directory
Conflicts=umount.target
Before=scylla-server.service
After=local-fs.target
DefaultDependencies=no
[Mount]
What=/var/lib/scylla/coredump
Where=/var/lib/systemd/coredump
Type=none
Options=bind
[Install]
WantedBy=multi-user.target
EOS
To restore ``/etc/systemd/system/var-lib-scylla.mount``, run the following (specifying your data disk):
.. code-block:: console
$ UUID=`blkid -s UUID -o value <specify your data disk, eg: /dev/md0>`
$ cat << EOS | sudo tee /etc/systemd/system/var-lib-scylla.mount
[Unit]
Description=ScyllaDB data directory
Before=scylla-server.service
After=local-fs.target
DefaultDependencies=no
[Mount]
What=/dev/disk/by-uuid/$UUID
Where=/var/lib/scylla
Type=xfs
Options=noatime
[Install]
WantedBy=multi-user.target
EOS
After restoring .mount files, you need to enable them:
.. code-block:: console
$ sudo systemctl daemon-reload
$ sudo systemctl enable --now var-lib-scylla.mount
$ sudo systemctl enable --now var-lib-systemd-coredump.mount
.. include:: /troubleshooting/_common/ts-return.rst

View File

@@ -1,16 +0,0 @@
Upgrade
=================
.. toctree::
:hidden:
:maxdepth: 2
Inaccessible configuration files after ScyllaDB upgrade </troubleshooting/missing-dotmount-files>
.. panel-box::
:title: Upgrade Issues
:id: "getting-started"
:class: my-panel
* :doc:`Inaccessible "/var/lib/scylla" and "/var/lib/systemd/coredump" after ScyllaDB upgrade </troubleshooting//missing-dotmount-files>`

View File

@@ -182,6 +182,7 @@ public:
gms::feature removenode_with_left_token_ring { *this, "REMOVENODE_WITH_LEFT_TOKEN_RING"sv };
gms::feature size_based_load_balancing { *this, "SIZE_BASED_LOAD_BALANCING"sv };
gms::feature topology_noop_request { *this, "TOPOLOGY_NOOP_REQUEST"sv };
gms::feature tablets_intermediate_fallback_cleanup { *this, "TABLETS_INTERMEDIATE_FALLBACK_CLEANUP"sv };
public:
const std::unordered_map<sstring, std::reference_wrapper<feature>>& registered_features() const;

View File

@@ -50,6 +50,8 @@ write_replica_set_selector get_selector_for_writes(tablet_transition_stage stage
return write_replica_set_selector::previous;
case tablet_transition_stage::write_both_read_old:
return write_replica_set_selector::both;
case tablet_transition_stage::write_both_read_old_fallback_cleanup:
return write_replica_set_selector::both;
case tablet_transition_stage::streaming:
return write_replica_set_selector::both;
case tablet_transition_stage::rebuild_repair:
@@ -81,6 +83,8 @@ read_replica_set_selector get_selector_for_reads(tablet_transition_stage stage)
return read_replica_set_selector::previous;
case tablet_transition_stage::write_both_read_old:
return read_replica_set_selector::previous;
case tablet_transition_stage::write_both_read_old_fallback_cleanup:
return read_replica_set_selector::previous;
case tablet_transition_stage::streaming:
return read_replica_set_selector::previous;
case tablet_transition_stage::rebuild_repair:
@@ -741,6 +745,7 @@ void tablet_map::set_tablet_raft_info(tablet_id id, tablet_raft_info raft_info)
static const std::unordered_map<tablet_transition_stage, sstring> tablet_transition_stage_to_name = {
{tablet_transition_stage::allow_write_both_read_old, "allow_write_both_read_old"},
{tablet_transition_stage::write_both_read_old, "write_both_read_old"},
{tablet_transition_stage::write_both_read_old_fallback_cleanup, "write_both_read_old_fallback_cleanup"},
{tablet_transition_stage::write_both_read_new, "write_both_read_new"},
{tablet_transition_stage::streaming, "streaming"},
{tablet_transition_stage::rebuild_repair, "rebuild_repair"},

View File

@@ -277,6 +277,7 @@ std::optional<tablet_info> merge_tablet_info(tablet_info a, tablet_info b);
enum class tablet_transition_stage {
allow_write_both_read_old,
write_both_read_old,
write_both_read_old_fallback_cleanup,
streaming,
rebuild_repair,
write_both_read_new,

19
main.cc
View File

@@ -571,7 +571,7 @@ sharded<service::storage_proxy> *the_storage_proxy;
// This is used by perf-alternator to allow running scylla together with the tool
// in a single process. So that it's easier to measure internals. It's not added
// to main_func_type to not complicate common flow as no other tool needs such logic.
std::function<void(lw_shared_ptr<db::config>)> after_init_func;
std::function<future<>(lw_shared_ptr<db::config>, sharded<abort_source>&)> after_init_func;
static locator::host_id initialize_local_info_thread(sharded<db::system_keyspace>& sys_ks,
sharded<locator::snitch_ptr>& snitch,
@@ -906,6 +906,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
auto background_reclaim_scheduling_group = create_scheduling_group("background_reclaim", "bgre", 50).get();
auto maintenance_scheduling_group = create_scheduling_group("streaming", "strm", 200).get();
debug::streaming_scheduling_group = maintenance_scheduling_group;
smp::invoke_on_all([&cfg, background_reclaim_scheduling_group] {
logalloc::tracker::config st_cfg;
@@ -1306,6 +1307,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
checkpoint(stop_signal, "starting storage proxy");
service::storage_proxy::config spcfg {
.hints_directory_initializer = hints_dir_initializer,
.hints_sched_group = maintenance_scheduling_group,
};
spcfg.hinted_handoff_enabled = hinted_handoff_enabled;
spcfg.available_memory = memory::stats().total_memory();
@@ -1677,7 +1679,9 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
gossiper.local(), feature_service.local(), sys_ks.local(), group0_client, dbcfg.gossip_scheduling_group};
checkpoint(stop_signal, "starting tablet allocator");
service::tablet_allocator::config tacfg;
service::tablet_allocator::config tacfg {
.background_sg = maintenance_scheduling_group,
};
sharded<service::tablet_allocator> tablet_allocator;
tablet_allocator.start(tacfg, std::ref(mm_notifier), std::ref(db)).get();
auto stop_tablet_allocator = defer_verbose_shutdown("tablet allocator", [&tablet_allocator] {
@@ -2490,7 +2494,9 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
if (cfg->view_building()) {
checkpoint(stop_signal, "starting view builders");
view_builder.invoke_on_all(&db::view::view_builder::start, std::ref(mm), utils::cross_shard_barrier()).get();
with_scheduling_group(maintenance_scheduling_group, [&mm] {
return view_builder.invoke_on_all(&db::view::view_builder::start, std::ref(mm), utils::cross_shard_barrier());
}).get();
}
auto drain_view_builder = defer_verbose_shutdown("draining view builders", [&] {
view_builder.invoke_on_all(&db::view::view_builder::drain).get();
@@ -2576,11 +2582,13 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
supervisor::notify("serving");
startlog.info("Scylla version {} initialization completed.", scylla_version());
future<> after_init_fut = make_ready_future<>();
if (after_init_func) {
after_init_func(cfg);
after_init_fut = after_init_func(cfg, stop_signal.as_sharded_abort_source());
}
stop_signal.wait().get();
startlog.info("Signal received; shutting down");
std::move(after_init_fut).get();
// At this point, all objects destructors and all shutdown hooks registered with defer() are executed
} catch (const sleep_aborted&) {
startlog.info("Startup interrupted");
@@ -2650,7 +2658,8 @@ int main(int ac, char** av) {
{"perf-load-balancing", perf::scylla_tablet_load_balancing_main, "run tablet load balancer tests"},
{"perf-simple-query", perf::scylla_simple_query_main, "run performance tests by sending simple queries to this server"},
{"perf-sstable", perf::scylla_sstable_main, "run performance tests by exercising sstable related operations on this server"},
{"perf-alternator", perf::alternator(scylla_main, &after_init_func), "run performance tests on full alternator stack"}
{"perf-alternator", perf::alternator(scylla_main, &after_init_func), "run performance tests on full alternator stack"},
{"perf-cql-raw", perf::perf_cql_raw(scylla_main, &after_init_func), "run performance tests using raw CQL protocol frames"}
};
main_func_type main_func;

View File

@@ -316,7 +316,7 @@ auto fmt::formatter<mutation>::format(const mutation& m, fmt::format_context& ct
++column_iterator;
}
return fmt::format_to(out, "token: {}}}, {}\n}}", dk._token, mutation_partition::printer(s, m.partition()));
return fmt::format_to(out, "token: {}}}, {}\n}}", dk.token(), mutation_partition::printer(s, m.partition()));
}
namespace mutation_json {

View File

@@ -126,7 +126,7 @@ public:
const partition_key& key() const { return _ptr->_dk._key; };
const dht::decorated_key& decorated_key() const { return _ptr->_dk; };
dht::ring_position ring_position() const { return { decorated_key() }; }
const dht::token& token() const { return _ptr->_dk._token; }
dht::token token() const { return _ptr->_dk.token(); }
const schema_ptr& schema() const { return _ptr->_schema; }
const mutation_partition& partition() const { return _ptr->_p; }
mutation_partition& partition() { return _ptr->_p; }

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/seastar.hh>
#include <seastar/core/shard_id.hh>
#include <seastar/core/coroutine.hh>
#include <seastar/core/with_scheduling_group.hh>
#include <seastar/coroutine/maybe_yield.hh>
@@ -23,7 +24,6 @@
#include "replica/data_dictionary_impl.hh"
#include "replica/compaction_group.hh"
#include "replica/query_state.hh"
#include "seastar/core/shard_id.hh"
#include "sstables/shared_sstable.hh"
#include "sstables/sstable_set.hh"
#include "sstables/sstables.hh"

View File

@@ -98,16 +98,6 @@ future<> service::client_state::has_column_family_access(const sstring& ks,
co_return co_await has_access(ks, {p, r, t}, is_vector_indexed);
}
future<> service::client_state::has_schema_access(const schema& s, auth::permission p) const {
auth::resource r = auth::make_data_resource(s.ks_name(), s.cf_name());
co_return co_await has_access(s.ks_name(), {p, r});
}
future<> service::client_state::has_schema_access(const sstring& ks_name, const sstring& cf_name, auth::permission p) const {
auth::resource r = auth::make_data_resource(ks_name, cf_name);
co_return co_await has_access(ks_name, {p, r});
}
future<> service::client_state::check_internal_table_permissions(std::string_view ks, std::string_view table_name, const auth::command_desc& cmd) const {
// 1. CDC and $paxos tables are managed internally by Scylla. Users are prohibited
// from running ALTER or DROP commands on them.
@@ -363,4 +353,4 @@ future<> service::client_state::set_client_options(
});
_client_options.emplace_back(std::move(cached_key), std::move(cached_value));
}
}
}

View File

@@ -359,8 +359,6 @@ public:
future<> has_keyspace_access(const sstring&, auth::permission) const;
future<> has_column_family_access(const sstring&, const sstring&, auth::permission,
auth::command_desc::type = auth::command_desc::type::OTHER, std::optional<bool> is_vector_indexed = std::nullopt) const;
future<> has_schema_access(const schema& s, auth::permission p) const;
future<> has_schema_access(const sstring&, const sstring&, auth::permission p) const;
future<> has_functions_access(auth::permission p) const;
future<> has_functions_access(const sstring& ks, auth::permission p) const;

View File

@@ -338,7 +338,7 @@ future<> group0_state_machine::merge_and_apply(group0_state_machine_merger& merg
}
#ifndef SCYLLA_BUILD_MODE_RELEASE
static void ensure_group0_schema(const group0_command& cmd, const replica::database& db) {
static void ensure_group0_schema(const group0_command& cmd, data_dictionary::database db) {
auto validate_schema = [&db](const utils::chunked_vector<canonical_mutation>& mutations) {
for (const auto& mut : mutations) {
// Get the schema for the column family
@@ -382,7 +382,7 @@ future<> group0_state_machine::apply(std::vector<raft::command_cref> command) {
// max_mutation_size = 1/2 of commitlog segment size, thus max_command_size is set 1/3 of commitlog segment size to leave space for metadata.
size_t max_command_size = _sp.data_dictionary().get_config().commitlog_segment_size_in_mb() * 1024 * 1024 / 3;
group0_state_machine_merger m(co_await _client.sys_ks().get_last_group0_state_id(), std::move(read_apply_mutex_holder),
group0_state_machine_merger m(co_await _client.get_last_group0_state_id(), std::move(read_apply_mutex_holder),
max_command_size, _sp.data_dictionary());
for (auto&& c : command) {
@@ -392,7 +392,7 @@ future<> group0_state_machine::apply(std::vector<raft::command_cref> command) {
#ifndef SCYLLA_BUILD_MODE_RELEASE
// Ensure that the schema of the mutations is a group0 schema.
// This validation is supposed to be only performed in tests, so it is skipped in the release mode.
ensure_group0_schema(cmd, _client.sys_ks().local_db());
ensure_group0_schema(cmd, _sp.data_dictionary());
#endif
slogger.trace("cmd: prev_state_id: {}, new_state_id: {}, creator_addr: {}, creator_id: {}",

View File

@@ -245,6 +245,10 @@ utils::UUID raft_group0_client::generate_group0_state_id(utils::UUID prev_state_
return utils::UUID_gen::get_random_time_UUID_from_micros(std::chrono::microseconds{ts});
}
future<utils::UUID> raft_group0_client::get_last_group0_state_id() {
return _sys_ks.get_last_group0_state_id();
}
future<group0_guard> raft_group0_client::start_operation(seastar::abort_source& as, std::optional<raft_timeout> timeout) {
if (this_shard_id() != 0) {
on_internal_error(logger, "start_group0_operation: must run on shard 0");
@@ -282,7 +286,7 @@ future<group0_guard> raft_group0_client::start_operation(seastar::abort_source&
// Read barrier may wait for `group0_state_machine::apply` which also takes this mutex.
auto read_apply_holder = co_await hold_read_apply_mutex(as);
auto observed_group0_state_id = co_await _sys_ks.get_last_group0_state_id();
auto observed_group0_state_id = co_await get_last_group0_state_id();
auto new_group0_state_id = generate_group0_state_id(observed_group0_state_id);
co_return group0_guard {
@@ -467,10 +471,6 @@ future<semaphore_units<>> raft_group0_client::hold_read_apply_mutex(abort_source
return get_units(_read_apply_mutex, 1, as);
}
db::system_keyspace& raft_group0_client::sys_ks() {
return _sys_ks;
}
bool raft_group0_client::in_recovery() const {
return _upgrade_state == group0_upgrade_state::recovery;
}

View File

@@ -200,8 +200,6 @@ public:
future<semaphore_units<>> hold_read_apply_mutex(abort_source&);
db::system_keyspace& sys_ks();
bool in_recovery() const;
gc_clock::duration get_history_gc_duration() const;
@@ -212,6 +210,7 @@ public:
query_result_guard create_result_guard(utils::UUID query_id);
void set_query_result(utils::UUID query_id, service::broadcast_tables::query_result qr);
static utils::UUID generate_group0_state_id(utils::UUID prev_state_id);
future<utils::UUID> get_last_group0_state_id();
};
using mutations_generator = coroutine::experimental::generator<mutation>;

View File

@@ -3221,9 +3221,9 @@ storage_proxy::storage_proxy(sharded<replica::database>& db, storage_proxy::conf
, _write_ack_smp_service_group(cfg.write_ack_smp_service_group)
, _next_response_id(std::chrono::system_clock::now().time_since_epoch()/1ms)
, _hints_resource_manager(*this, cfg.available_memory / 10, _db.local().get_config().max_hinted_handoff_concurrency)
, _hints_manager(*this, _db.local().get_config().hints_directory(), cfg.hinted_handoff_enabled, _db.local().get_config().max_hint_window_in_ms(), _hints_resource_manager, _db)
, _hints_manager(*this, _db.local().get_config().hints_directory(), cfg.hinted_handoff_enabled, _db.local().get_config().max_hint_window_in_ms(), _hints_resource_manager, _db, cfg.hints_sched_group)
, _hints_directory_initializer(std::move(cfg.hints_directory_initializer))
, _hints_for_views_manager(*this, _db.local().get_config().view_hints_directory(), {}, _db.local().get_config().max_hint_window_in_ms(), _hints_resource_manager, _db)
, _hints_for_views_manager(*this, _db.local().get_config().view_hints_directory(), {}, _db.local().get_config().max_hint_window_in_ms(), _hints_resource_manager, _db, cfg.hints_sched_group)
, _stats_key(stats_key)
, _features(feat)
, _background_write_throttle_threahsold(cfg.available_memory / 10)

View File

@@ -195,6 +195,7 @@ public:
// they need a separate smp_service_group to prevent an ABBA deadlock
// with writes.
smp_service_group write_ack_smp_service_group = default_smp_service_group();
scheduling_group hints_sched_group;
};
private:

View File

@@ -90,14 +90,14 @@ load_balancer_stats_manager::load_balancer_stats_manager(sstring group_name):
setup_metrics(_cluster_stats);
}
load_balancer_dc_stats& load_balancer_stats_manager::for_dc(const dc_name& dc) {
const lw_shared_ptr<load_balancer_dc_stats>& load_balancer_stats_manager::for_dc(const dc_name& dc) {
auto it = _dc_stats.find(dc);
if (it == _dc_stats.end()) {
auto stats = std::make_unique<load_balancer_dc_stats>();
auto stats = make_lw_shared<load_balancer_dc_stats>();
setup_metrics(dc, *stats);
it = _dc_stats.emplace(dc, std::move(stats)).first;
}
return *it->second;
return it->second;
}
load_balancer_node_stats& load_balancer_stats_manager::for_node(const dc_name& dc, host_id node) {
@@ -149,22 +149,22 @@ db::tablet_options combine_tablet_options(R&& opts) {
static std::unordered_set<locator::tablet_id> split_string_to_tablet_id(std::string_view s, char delimiter) {
auto tokens_view = s | std::views::split(delimiter)
| std::views::transform([](auto&& range) {
return std::string_view(&*range.begin(), std::ranges::distance(range));
})
| std::views::transform([](std::string_view sv) {
return locator::tablet_id(std::stoul(std::string(sv)));
});
| std::views::transform([](auto&& range) {
return std::string_view(&*range.begin(), std::ranges::distance(range));
})
| std::views::transform([](std::string_view sv) {
return locator::tablet_id(std::stoul(std::string(sv)));
});
return std::unordered_set<locator::tablet_id>{tokens_view.begin(), tokens_view.end()};
}
struct repair_plan {
locator::global_tablet_id gid;
locator::tablet_info tinfo;
dht::token_range range;
dht::token last_token;
db_clock::duration repair_time_diff;
bool is_user_reuqest;
locator::global_tablet_id gid;
locator::tablet_info tinfo;
dht::token_range range;
dht::token last_token;
db_clock::duration repair_time_diff;
bool is_user_reuqest;
};
// Used to compare different migration choices in regard to impact on load imbalance.
@@ -291,6 +291,12 @@ struct rack_list_colocation_state {
}
};
/// Formattable wrapper for migration_plan, whose formatter prints a short summary of the plan.
struct plan_summary {
migration_plan& plan;
explicit plan_summary(migration_plan& plan) : plan(plan) {}
};
future<rack_list_colocation_state> find_required_rack_list_colocations(
replica::database& db,
token_metadata_ptr tmptr,
@@ -452,7 +458,36 @@ struct fmt::formatter<service::repair_plan> : fmt::formatter<std::string_view> {
template <typename FormatContext>
auto format(const service::repair_plan& p, FormatContext& ctx) const {
auto diff_seconds = std::chrono::duration<float>(p.repair_time_diff).count();
fmt::format_to(ctx.out(), "{{tablet={} last_token={} is_user_req={} diff_seconds={}}}", p.gid, p.last_token, p.is_user_reuqest, diff_seconds);
fmt::format_to(ctx.out(), "{{tablet={} last_token={} is_user_req={} diff_seconds={}}}", p.gid, p.last_token, p.is_user_reuqest, diff_seconds);
return ctx.out();
}
};
template<>
struct fmt::formatter<service::plan_summary> : fmt::formatter<std::string_view> {
template <typename FormatContext>
auto format(const service::plan_summary& p, FormatContext& ctx) const {
auto& plan = p.plan;
std::string_view delim = "";
auto get_delim = [&] { return std::exchange(delim, ", "); };
if (plan.migrations().size()) {
fmt::format_to(ctx.out(), "{}migrations: {}", get_delim(), plan.migrations().size());
}
if (plan.repair_plan().repairs().size()) {
fmt::format_to(ctx.out(), "{}repairs: {}", get_delim(), plan.repair_plan().repairs().size());
}
if (plan.resize_plan().resize.size()) {
fmt::format_to(ctx.out(), "{}resize: {}", get_delim(), plan.resize_plan().resize.size());
}
if (plan.resize_plan().finalize_resize.size()) {
fmt::format_to(ctx.out(), "{}resize-ready: {}", get_delim(), plan.resize_plan().finalize_resize.size());
}
if (plan.rack_list_colocation_plan().size()) {
fmt::format_to(ctx.out(), "{}rack-list colocation ready: {}", get_delim(), plan.rack_list_colocation_plan().request_to_resume());
}
if (delim.empty()) {
fmt::format_to(ctx.out(), "empty");
}
return ctx.out();
}
};
@@ -868,9 +903,12 @@ class load_balancer {
absl::flat_hash_map<table_id, uint64_t> _disk_used_per_table;
dc_name _dc;
std::optional<sstring> _rack; // Set when plan making is limited to a single rack.
sstring _location; // Name of the current scope of plan making. DC or DC+rack.
lw_shared_ptr<load_balancer_dc_stats> _current_stats; // Stats for current scope of plan making.
size_t _total_capacity_shards; // Total number of non-drained shards in the balanced node set.
size_t _total_capacity_nodes; // Total number of non-drained nodes in the balanced node set.
uint64_t _total_capacity_storage; // Total storage of non-drained nodes in the balanced node set.
size_t _migrating_candidates; // Number of candidate replicas skipped because tablet is migrating.
locator::load_stats_ptr _table_load_stats;
load_balancer_stats_manager& _stats;
std::unordered_set<host_id> _skiplist;
@@ -913,6 +951,8 @@ private:
return true;
case tablet_transition_stage::write_both_read_old:
return true;
case tablet_transition_stage::write_both_read_old_fallback_cleanup:
return false;
case tablet_transition_stage::streaming:
return true;
case tablet_transition_stage::rebuild_repair:
@@ -995,22 +1035,21 @@ public:
migration_plan plan;
auto rack_list_colocation = ongoing_rack_list_colocation();
if (!utils::get_local_injector().enter("tablet_migration_bypass")) {
// Prepare plans for each DC separately and combine them to be executed in parallel.
for (auto&& dc : topo.get_datacenters()) {
if (_db.get_config().rf_rack_valid_keyspaces() || _db.get_config().enforce_rack_list() || rack_list_colocation) {
for (auto rack : topo.get_datacenter_racks().at(dc) | std::views::keys) {
auto rack_plan = co_await make_plan(dc, rack);
auto level = rack_plan.size() > 0 ? seastar::log_level::info : seastar::log_level::debug;
lblogger.log(level, "Prepared {} migrations in rack {} in DC {}", rack_plan.size(), rack, dc);
plan.merge(std::move(rack_plan));
}
} else {
auto dc_plan = co_await make_plan(dc);
auto level = dc_plan.size() > 0 ? seastar::log_level::info : seastar::log_level::debug;
lblogger.log(level, "Prepared {} migrations in DC {}", dc_plan.size(), dc);
plan.merge(std::move(dc_plan));
// Prepare plans for each DC separately and combine them to be executed in parallel.
for (auto&& dc : topo.get_datacenters()) {
if (_db.get_config().rf_rack_valid_keyspaces() || _db.get_config().enforce_rack_list() || rack_list_colocation) {
for (auto rack : topo.get_datacenter_racks().at(dc) | std::views::keys) {
auto rack_plan = co_await make_plan(dc, rack);
auto level = rack_plan.empty() ? seastar::log_level::debug : seastar::log_level::info;
lblogger.log(level, "Plan for {}/{}: {}", dc, rack, plan_summary(rack_plan));
plan.merge(std::move(rack_plan));
}
} else {
auto dc_plan = co_await make_plan(dc);
auto level = dc_plan.empty() ? seastar::log_level::debug : seastar::log_level::info;
lblogger.log(level, "Plan for {}: {}", dc, plan_summary(dc_plan));
plan.merge(std::move(dc_plan));
}
}
@@ -1027,9 +1066,8 @@ public:
plan.set_repair_plan(co_await make_repair_plan(plan));
}
auto level = plan.size() > 0 ? seastar::log_level::info : seastar::log_level::debug;
lblogger.log(level, "Prepared {} migration plans, out of which there were {} tablet migration(s) and {} resize decision(s) and {} tablet repair(s) and {} rack-list colocation(s)",
plan.size(), plan.tablet_migration_count(), plan.resize_decision_count(), plan.tablet_repair_count(), plan.tablet_rack_list_colocation_count());
auto level = plan.empty() ? seastar::log_level::debug : seastar::log_level::info;
lblogger.log(level, "Prepared plan: {}", plan_summary(plan));
co_return std::move(plan);
}
@@ -1408,7 +1446,7 @@ public:
co_return all_colocated;
}
future<migration_plan> make_merge_colocation_plan(const dc_name& dc, node_load_map& nodes) {
future<migration_plan> make_merge_colocation_plan(node_load_map& nodes) {
migration_plan plan;
table_resize_plan resize_plan;
@@ -1565,7 +1603,7 @@ public:
if (cross_rack_migration(src, dst)) {
// FIXME: This is illegal if table has views, as it breaks base-view pairing.
// Can happen when RF!=#racks.
_stats.for_dc(_dc).cross_rack_collocations++;
_current_stats->cross_rack_collocations++;
lblogger.debug("Cross-rack co-location migration for {}@{} (rack: {}) to co-habit {}@{} (rack: {})",
t2_id, src, rack_of(src), t1_id, dst, rack_of(dst));
utils::get_local_injector().inject("forbid_cross_rack_migration_attempt", [&] {
@@ -2215,7 +2253,7 @@ public:
// Evaluates impact on load balance of migrating a tablet set of a given table to dst.
migration_badness evaluate_dst_badness(node_load_map& nodes, table_id table, tablet_replica dst, uint64_t tablet_set_disk_size) {
_stats.for_dc(_dc).candidates_evaluated++;
_current_stats->candidates_evaluated++;
auto& node_info = nodes[dst.host];
@@ -2254,7 +2292,7 @@ public:
// Evaluates impact on load balance of migrating a tablet set of a given table from src.
migration_badness evaluate_src_badness(node_load_map& nodes, table_id table, tablet_replica src, uint64_t tablet_set_disk_size) {
_stats.for_dc(_dc).candidates_evaluated++;
_current_stats->candidates_evaluated++;
auto& node_info = nodes[src.host];
@@ -2603,15 +2641,15 @@ public:
auto mig_streaming_info = get_migration_streaming_infos(_tm->get_topology(), tmap, mig);
if (!can_accept_load(nodes, mig_streaming_info)) {
_stats.for_dc(node_load.dc()).migrations_skipped++;
_current_stats->migrations_skipped++;
lblogger.debug("Unable to balance {}: load limit reached", host);
break;
}
apply_load(nodes, mig_streaming_info);
lblogger.debug("Adding migration: {} size: {}", mig, tablets.tablet_set_disk_size);
_stats.for_dc(node_load.dc()).migrations_produced++;
_stats.for_dc(node_load.dc()).intranode_migrations_produced++;
_current_stats->migrations_produced++;
_current_stats->intranode_migrations_produced++;
mark_as_scheduled(mig);
plan.add(std::move(mig));
@@ -2718,21 +2756,21 @@ public:
auto targets = get_viable_targets();
if (rs->is_rack_based(_dc)) {
lblogger.debug("candidate tablet {} skipped because RF is rack-based and it's in a different rack", tablet);
_stats.for_dc(src_info.dc()).tablets_skipped_rack++;
_current_stats->tablets_skipped_rack++;
return skip_info{std::move(targets)};
}
if (!targets.contains(dst_info.id)) {
auto new_rack_load = rack_load[dst_info.rack()] + 1;
lblogger.debug("candidate tablet {} skipped because it would increase load on rack {} to {}, max={}",
tablet, dst_info.rack(), new_rack_load, max_rack_load);
_stats.for_dc(src_info.dc()).tablets_skipped_rack++;
_current_stats->tablets_skipped_rack++;
return skip_info{std::move(targets)};
}
}
for (auto&& r : tmap.get_tablet_info(tablet.tablet).replicas) {
if (r.host == dst_info.id) {
_stats.for_dc(src_info.dc()).tablets_skipped_node++;
_current_stats->tablets_skipped_node++;
lblogger.debug("candidate tablet {} skipped because it has a replica on target node", tablet);
if (need_viable_targets) {
return skip_info{get_viable_targets()};
@@ -2939,7 +2977,7 @@ public:
};
if (min_candidate.badness.is_bad() && _use_table_aware_balancing) {
_stats.for_dc(_dc).bad_first_candidates++;
_current_stats->bad_first_candidates++;
// Consider better alternatives.
if (drain_skipped) {
@@ -3060,7 +3098,7 @@ public:
lblogger.debug("Table {} shard overcommit: {}", table, overcommit);
}
future<migration_plan> make_internode_plan(const dc_name& dc, node_load_map& nodes,
future<migration_plan> make_internode_plan(node_load_map& nodes,
const std::unordered_set<host_id>& nodes_to_drain,
host_id target) {
migration_plan plan;
@@ -3120,7 +3158,7 @@ public:
if (nodes_by_load.empty()) {
lblogger.debug("No more candidate nodes");
_stats.for_dc(dc).stop_no_candidates++;
_current_stats->stop_no_candidates++;
break;
}
@@ -3191,7 +3229,7 @@ public:
if (nodes_by_load_dst.empty()) {
lblogger.debug("No more target nodes");
_stats.for_dc(dc).stop_no_candidates++;
_current_stats->stop_no_candidates++;
break;
}
@@ -3221,7 +3259,7 @@ public:
const load_type max_load = std::max(max_off_candidate_load, src_node_info.avg_load);
if (is_balanced(target_info.avg_load, max_load)) {
lblogger.debug("Balance achieved.");
_stats.for_dc(dc).stop_balance++;
_current_stats->stop_balance++;
break;
}
}
@@ -3255,7 +3293,7 @@ public:
auto& tmap = tmeta.get_tablet_map(source_tablets.table());
if (can_check_convergence && !check_convergence(src_node_info, target_info, source_tablets)) {
lblogger.debug("No more candidates. Load would be inverted.");
_stats.for_dc(dc).stop_load_inversion++;
_current_stats->stop_load_inversion++;
break;
}
@@ -3289,11 +3327,11 @@ public:
}
}
if (candidate.badness.is_bad()) {
_stats.for_dc(_dc).bad_migrations++;
_current_stats->bad_migrations++;
}
if (drain_skipped) {
_stats.for_dc(_dc).migrations_from_skiplist++;
_current_stats->migrations_from_skiplist++;
}
if (src_node_info.req && *src_node_info.req == topology_request::leave && src_node_info.excluded) {
@@ -3313,7 +3351,7 @@ public:
if (can_accept_load(nodes, mig_streaming_info)) {
apply_load(nodes, mig_streaming_info);
lblogger.debug("Adding migration: {} size: {}", mig, source_tablets.tablet_set_disk_size);
_stats.for_dc(dc).migrations_produced++;
_current_stats->migrations_produced++;
mark_as_scheduled(mig);
plan.add(std::move(mig));
} else {
@@ -3324,10 +3362,10 @@ public:
// Just because the next migration is blocked doesn't mean we could not proceed with migrations
// for other shards which are produced by the planner subsequently.
skipped_migrations++;
_stats.for_dc(dc).migrations_skipped++;
_current_stats->migrations_skipped++;
if (skipped_migrations >= max_skipped_migrations) {
lblogger.debug("Too many migrations skipped, aborting balancing");
_stats.for_dc(dc).stop_skip_limit++;
_current_stats->stop_skip_limit++;
break;
}
}
@@ -3346,7 +3384,7 @@ public:
}
if (plan.size() == batch_size) {
_stats.for_dc(dc).stop_batch_size++;
_current_stats->stop_batch_size++;
}
if (plan.empty()) {
@@ -3363,7 +3401,13 @@ public:
// If there are 7 tablets and RF=3, each node must have 1 tablet replica.
// So node3 will have average load of 1, and node1 and node2 will have
// average shard load of 7.
lblogger.info("Not possible to achieve balance.");
// Show when this is the final plan with no active migrations left to execute,
// otherwise it may just be a temporary situation due to lack of candidates.
if (_migrating_candidates == 0) {
lblogger.info("Not possible to achieve balance in {}", _location);
print_node_stats(nodes, only_active::no);
}
}
co_return std::move(plan);
@@ -3420,11 +3464,37 @@ public:
}
};
using only_active = bool_class<struct only_active_tag>;
void print_node_stats(node_load_map& nodes, only_active only_active_) {
for (auto&& [host, load] : nodes) {
size_t read = 0;
size_t write = 0;
for (auto& shard_load : load.shards) {
read += shard_load.streaming_read_load;
write += shard_load.streaming_write_load;
}
auto level = !only_active_ || (read + write) > 0 ? seastar::log_level::info : seastar::log_level::debug;
lblogger.log(level, "Node {}: {}/{} load={:.6f} tablets={} shards={} tablets/shard={:.3f} state={} cap={}"
" rd={} wr={}",
host, load.dc(), load.rack(), load.avg_load, load.tablet_count, load.shard_count,
load.tablets_per_shard(), load.state(), load.dusage->capacity, read, write);
}
}
future<migration_plan> make_plan(dc_name dc, std::optional<sstring> rack = std::nullopt) {
migration_plan plan;
if (utils::get_local_injector().enter("tablet_migration_bypass")) {
co_return std::move(plan);
}
_dc = dc;
_rack = rack;
_location = fmt::format("{}{}", dc, rack ? fmt::format("/{}", *rack) : "");
_current_stats = _stats.for_dc(dc);
auto _ = seastar::defer([&] { _current_stats = nullptr; });
_migrating_candidates = 0;
auto node_filter = [&] (const locator::node& node) {
return node.dc_rack().dc == dc && (!rack || node.dc_rack().rack == *rack);
@@ -3433,7 +3503,7 @@ public:
// Causes load balancer to move some tablet even though load is balanced.
auto shuffle = in_shuffle_mode();
_stats.for_dc(dc).calls++;
_current_stats->calls++;
lblogger.debug("Examining DC {} rack {} (shuffle={}, balancing={}, tablets_per_shard_goal={}, force_capacity_based_balancing={})",
dc, rack, shuffle, _tm->tablets().balancing_enabled(), _tablets_per_shard_goal, _force_capacity_based_balancing);
@@ -3529,7 +3599,7 @@ public:
if (nodes.empty()) {
lblogger.debug("No nodes to balance.");
_stats.for_dc(dc).stop_balance++;
_current_stats->stop_balance++;
co_return plan;
}
@@ -3552,15 +3622,23 @@ public:
// If we don't have nodes to drain, remove nodes which don't have complete tablet sizes
if (nodes_to_drain.empty()) {
std::optional<host_id> incomplete_host;
size_t incomplete_count = 0;
for (auto nodes_i = nodes.begin(); nodes_i != nodes.end();) {
host_id host = nodes_i->first;
if (!_load_sketch->has_complete_data(host)) {
lblogger.info("Node {} does not have complete tablet stats, ignoring", nodes_i->first);
incomplete_host.emplace(host);
incomplete_count++;
nodes_i = nodes.erase(nodes_i);
} else {
++nodes_i;
}
}
if (incomplete_host) {
lblogger.info("Ignoring {} node(s) with incomplete tablet stats, e.g. {}", incomplete_count, *incomplete_host);
}
}
plan.set_has_nodes_to_drain(!nodes_to_drain.empty());
@@ -3594,11 +3672,11 @@ public:
});
if (!has_dest_nodes) {
for (auto host : nodes_to_drain) {
plan.add(drain_failure(host, format("No candidate nodes in DC {} to drain {}."
" Consider adding new nodes or reducing replication factor.", dc, host)));
plan.add(drain_failure(host, format("No candidate nodes in {} to drain {}."
" Consider adding new nodes or reducing replication factor.", _location, host)));
}
lblogger.debug("No candidate nodes");
_stats.for_dc(dc).stop_no_candidates++;
_current_stats->stop_no_candidates++;
co_return plan;
}
@@ -3704,6 +3782,8 @@ public:
if (!migrating(t1) && !migrating(t2)) {
auto candidate = colocated_tablets{global_tablet_id{table, t1.tid}, global_tablet_id{table, t2->tid}};
add_candidate(shard_load_info, migration_tablet_set{std::move(candidate), tablet_sizes_sum});
} else {
_migrating_candidates++;
}
} else {
if (tids.size() != tablet_sizes.size()) {
@@ -3712,6 +3792,8 @@ public:
for (size_t i = 0; i < tids.size(); i++) {
if (!migrating(get_table_desc(tids[i]))) { // migrating tablets are not candidates
add_candidate(shard_load_info, migration_tablet_set{global_tablet_id{table, tids[i]}, tablet_sizes[i]});
} else {
_migrating_candidates++;
}
}
}
@@ -3749,26 +3831,14 @@ public:
}
}
for (auto&& [host, load] : nodes) {
size_t read = 0;
size_t write = 0;
for (auto& shard_load : load.shards) {
read += shard_load.streaming_read_load;
write += shard_load.streaming_write_load;
}
auto level = (read + write) > 0 ? seastar::log_level::info : seastar::log_level::debug;
lblogger.log(level, "Node {}: dc={} rack={} load={} tablets={} shards={} tablets/shard={} state={} cap={}"
" stream_read={} stream_write={}",
host, dc, load.rack(), load.avg_load, load.tablet_count, load.shard_count,
load.tablets_per_shard(), load.state(), load.dusage->capacity, read, write);
}
print_node_stats(nodes, only_active::yes);
if (!nodes_to_drain.empty() || (_tm->tablets().balancing_enabled() && (shuffle || !is_balanced(min_load, max_load)))) {
host_id target = *min_load_node;
lblogger.info("target node: {}, avg_load: {}, max: {}", target, min_load, max_load);
plan.merge(co_await make_internode_plan(dc, nodes, nodes_to_drain, target));
plan.merge(co_await make_internode_plan(nodes, nodes_to_drain, target));
} else {
_stats.for_dc(dc).stop_balance++;
_current_stats->stop_balance++;
}
if (_tm->tablets().balancing_enabled()) {
@@ -3776,9 +3846,9 @@ public:
}
if (_tm->tablets().balancing_enabled() && plan.empty() && !ongoing_rack_list_colocation()) {
auto dc_merge_plan = co_await make_merge_colocation_plan(dc, nodes);
auto dc_merge_plan = co_await make_merge_colocation_plan(nodes);
auto level = dc_merge_plan.tablet_migration_count() > 0 ? seastar::log_level::info : seastar::log_level::debug;
lblogger.log(level, "Prepared {} migrations for co-locating sibling tablets in DC {}", dc_merge_plan.tablet_migration_count(), dc);
lblogger.log(level, "Prepared {} migrations for co-locating sibling tablets in {}", dc_merge_plan.tablet_migration_count(), _location);
plan.merge(std::move(dc_merge_plan));
}
@@ -3792,6 +3862,7 @@ class tablet_allocator_impl : public tablet_allocator::impl
service::migration_notifier& _migration_notifier;
replica::database& _db;
load_balancer_stats_manager _load_balancer_stats;
scheduling_group _background;
bool _stopped = false;
bool _use_tablet_aware_balancing = true;
locator::load_stats_ptr _load_stats;
@@ -3813,7 +3884,9 @@ public:
tablet_allocator_impl(tablet_allocator::config cfg, service::migration_notifier& mn, replica::database& db)
: _migration_notifier(mn)
, _db(db)
, _load_balancer_stats("load_balancer") {
, _load_balancer_stats("load_balancer")
, _background(cfg.background_sg)
{
_migration_notifier.register_listener(this);
}
@@ -3830,7 +3903,7 @@ public:
future<migration_plan> balance_tablets(token_metadata_ptr tm, service::topology* topology, db::system_keyspace* sys_ks, locator::load_stats_ptr table_load_stats, std::unordered_set<host_id> skiplist) {
auto lb = make_load_balancer(tm, topology, sys_ks, table_load_stats ? table_load_stats : _load_stats, std::move(skiplist));
co_await coroutine::switch_to(_db.get_streaming_scheduling_group());
co_await coroutine::switch_to(_background);
co_return co_await lb.make_plan();
}

View File

@@ -100,7 +100,7 @@ class load_balancer_stats_manager {
using host_id = locator::host_id;
sstring group_name;
std::unordered_map<dc_name, std::unique_ptr<load_balancer_dc_stats>> _dc_stats;
std::unordered_map<dc_name, lw_shared_ptr<load_balancer_dc_stats>> _dc_stats;
std::unordered_map<host_id, std::unique_ptr<load_balancer_node_stats>> _node_stats;
load_balancer_cluster_stats _cluster_stats;
seastar::metrics::label dc_label{"target_dc"};
@@ -113,7 +113,7 @@ class load_balancer_stats_manager {
public:
load_balancer_stats_manager(sstring group_name);
load_balancer_dc_stats& for_dc(const dc_name& dc);
const lw_shared_ptr<load_balancer_dc_stats>& for_dc(const dc_name& dc);
load_balancer_node_stats& for_node(const dc_name& dc, host_id node);
load_balancer_cluster_stats& for_cluster();
@@ -196,7 +196,7 @@ public:
bool has_nodes_to_drain() const { return _has_nodes_to_drain; }
const migrations_vector& migrations() const { return _migrations; }
bool empty() const { return _migrations.empty() && !_resize_plan.size() && !_repair_plan.size() && !_rack_list_colocation_plan.size() && _drain_failures.empty(); }
bool empty() const { return !size(); }
size_t size() const { return _migrations.size() + _resize_plan.size() + _repair_plan.size() + _rack_list_colocation_plan.size() + _drain_failures.size(); }
size_t tablet_migration_count() const { return _migrations.size(); }
size_t resize_decision_count() const { return _resize_plan.size(); }
@@ -257,6 +257,7 @@ class migration_notifier;
class tablet_allocator {
public:
struct config {
scheduling_group background_sg;
};
class impl {
public:

View File

@@ -1539,7 +1539,7 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
case locator::tablet_transition_stage::allow_write_both_read_old:
if (action_failed(tablet_state.barriers[trinfo.stage])) {
if (check_excluded_replicas()) {
transition_to_with_barrier(locator::tablet_transition_stage::cleanup_target);
transition_to(locator::tablet_transition_stage::cleanup_target);
break;
}
}
@@ -1560,7 +1560,7 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
case locator::tablet_transition_stage::write_both_read_old:
if (action_failed(tablet_state.barriers[trinfo.stage])) {
if (check_excluded_replicas()) {
transition_to_with_barrier(locator::tablet_transition_stage::cleanup_target);
transition_to(locator::tablet_transition_stage::cleanup_target);
break;
}
}
@@ -1570,17 +1570,18 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
transition_to_with_barrier(locator::tablet_transition_stage::streaming);
}
break;
case locator::tablet_transition_stage::write_both_read_old_fallback_cleanup:
transition_to_with_barrier(locator::tablet_transition_stage::cleanup_target);
break;
case locator::tablet_transition_stage::rebuild_repair: {
if (action_failed(tablet_state.rebuild_repair)) {
bool fail = utils::get_local_injector().enter("rebuild_repair_stage_fail");
if (fail || check_excluded_replicas()) {
if (do_barrier()) {
rtlogger.debug("Will set tablet {} stage to {}", gid, locator::tablet_transition_stage::cleanup_target);
updates.emplace_back(get_mutation_builder()
.set_stage(last_token, locator::tablet_transition_stage::cleanup_target)
.del_session(last_token)
.build());
}
rtlogger.debug("Will set tablet {} stage to {}", gid, locator::tablet_transition_stage::cleanup_target);
updates.emplace_back(get_mutation_builder()
.set_stage(last_token, locator::tablet_transition_stage::cleanup_target)
.del_session(last_token)
.build());
break;
}
}
@@ -1653,13 +1654,11 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
}
if (rollback) {
if (do_barrier()) {
rtlogger.debug("Will set tablet {} stage to {}: {}", gid, locator::tablet_transition_stage::cleanup_target, *rollback);
updates.emplace_back(get_mutation_builder()
.set_stage(last_token, locator::tablet_transition_stage::cleanup_target)
.del_session(last_token)
.build());
}
rtlogger.debug("Will set tablet {} stage to {}: {}", gid, locator::tablet_transition_stage::cleanup_target, *rollback);
updates.emplace_back(get_mutation_builder()
.set_stage(last_token, locator::tablet_transition_stage::cleanup_target)
.del_session(last_token)
.build());
break;
}
}
@@ -1696,7 +1695,6 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
_exit(1);
});
auto next_stage = locator::tablet_transition_stage::use_new;
if (action_failed(tablet_state.barriers[trinfo.stage])) {
auto& tinfo = tmap.get_tablet_info(gid.tablet);
unsigned excluded_old = 0;
@@ -1718,10 +1716,15 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
// than excluded_old for intra-node migration.
if (excluded_new > excluded_old && trinfo.transition != locator::tablet_transition_kind::intranode_migration) {
rtlogger.debug("During {} stage of {} {} new nodes and {} old nodes were excluded", trinfo.stage, gid, excluded_new, excluded_old);
next_stage = locator::tablet_transition_stage::cleanup_target;
if (_feature_service.tablets_intermediate_fallback_cleanup) {
transition_to(locator::tablet_transition_stage::write_both_read_old_fallback_cleanup);
} else {
transition_to_with_barrier(locator::tablet_transition_stage::cleanup_target);
}
break;
}
}
transition_to_with_barrier(next_stage);
transition_to_with_barrier(locator::tablet_transition_stage::use_new);
}
break;
case locator::tablet_transition_stage::use_new:

View File

@@ -8,6 +8,7 @@
#include <seastar/core/coroutine.hh>
#include <seastar/core/iostream.hh>
#include <seastar/util/memory-data-source.hh>
#include "partition_reversing_data_source.hh"
#include "reader_permit.hh"
#include "sstables/consumer.hh"
@@ -15,7 +16,6 @@
#include "sstables/shared_sstable.hh"
#include "sstables/sstables.hh"
#include "sstables/types.hh"
#include "utils/buffer_input_stream.hh"
#include "utils/to_string.hh"
namespace sstables {
@@ -417,7 +417,7 @@ private:
}
_current_read_size = std::min(max_read_size, _current_read_size * 2);
}
co_return make_buffer_input_stream(_cached_read.share(_cached_read.size() - row_size, row_size));
co_return seastar::util::as_input_stream(_cached_read.share(_cached_read.size() - row_size, row_size));
}
temporary_buffer<char> last_row(size_t row_size) {
auto tmp = _cached_read.share(_cached_read.size() - row_size, row_size);

View File

@@ -95,6 +95,7 @@ public:
virtual future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity) override;
virtual future<data_sink> make_data_or_index_sink(sstable& sst, component_type type) override;
future<data_source> make_data_or_index_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const override;
future<data_source> make_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const override;
virtual future<data_sink> make_component_sink(sstable& sst, component_type type, open_flags oflags, file_output_stream_options options) override;
virtual future<> destroy(const sstable& sst) override { return make_ready_future<>(); }
virtual future<atomic_delete_context> atomic_delete_prepare(const std::vector<shared_sstable>&) const override;
@@ -132,8 +133,12 @@ future<data_sink> filesystem_storage::make_data_or_index_sink(sstable& sst, comp
}
}
future<data_source> filesystem_storage::make_data_or_index_source(sstable&, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const {
future<data_source> filesystem_storage::make_data_or_index_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const {
SCYLLA_ASSERT(type == component_type::Data || type == component_type::Index);
co_return co_await make_source(sst, type, std::move(f), offset, len, std::move(opt));
}
future<data_source> filesystem_storage::make_source(sstable&, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const {
co_return make_file_data_source(std::move(f), offset, len, std::move(opt));
}
@@ -615,6 +620,7 @@ public:
future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity) override;
future<data_sink> make_data_or_index_sink(sstable& sst, component_type type) override;
future<data_source> make_data_or_index_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const override;
future<data_source> make_source(sstable& sst, component_type type, file, uint64_t offset, uint64_t len, file_input_stream_options) const override;
future<data_sink> make_component_sink(sstable& sst, component_type type, open_flags oflags, file_output_stream_options options) override;
future<> destroy(const sstable& sst) override {
@@ -657,6 +663,7 @@ public:
{}
future<data_source> make_data_or_index_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const override;
future<data_source> make_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const override;
};
object_name object_storage_base::make_object_name(const sstable& sst, component_type type) const {
@@ -742,13 +749,23 @@ future<data_sink> object_storage_base::make_data_or_index_sink(sstable& sst, com
future<data_source>
object_storage_base::make_data_or_index_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options options) const {
co_return co_await make_source(sst, type, f, offset, len, options);
}
future<data_source>
object_storage_base::make_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options) const {
co_return co_await maybe_wrap_source(sst, type, _client->make_download_source(make_object_name(sst, type), abort_source()), offset, len);
}
future<data_source>
s3_storage::make_data_or_index_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options options) const {
co_return co_await make_source(sst, type, std::move(f), offset, len, std::move(options));
}
future<data_source>
s3_storage::make_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options options) const {
if (offset == 0) {
co_return co_await object_storage_base::make_data_or_index_source(sst, type, std::move(f), offset, len, std::move(options));
co_return co_await object_storage_base::make_source(sst, type, std::move(f), offset, len, std::move(options));
}
co_return make_file_data_source(
co_await maybe_wrap_file(sst, type, open_flags::ro, _client->make_readable_file(make_object_name(sst, type), abort_source())), offset, len, std::move(options));

View File

@@ -107,6 +107,7 @@ public:
virtual future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity) = 0;
virtual future<data_sink> make_data_or_index_sink(sstable& sst, component_type type) = 0;
virtual future<data_source> make_data_or_index_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const = 0;
virtual future<data_source> make_source(sstable& sst, component_type type, file f, uint64_t offset, uint64_t len, file_input_stream_options opt) const = 0;
virtual future<data_sink> make_component_sink(sstable& sst, component_type type, open_flags oflags, file_output_stream_options options) = 0;
virtual future<> destroy(const sstable& sst) = 0;
virtual future<atomic_delete_context> atomic_delete_prepare(const std::vector<shared_sstable>&) const = 0;

View File

@@ -43,7 +43,7 @@ lazy_comparable_bytes_from_ring_position::lazy_comparable_bytes_from_ring_positi
, _weight(bound_weight::equal)
, _pk(std::move(dk._key))
{
init_first_fragment(dk._token);
init_first_fragment(dk.token());
}
void lazy_comparable_bytes_from_ring_position::init_first_fragment(dht::token dht_token) {

View File

@@ -11,11 +11,13 @@
#include <seastar/core/map_reduce.hh>
#include <seastar/core/shared_ptr.hh>
#include <seastar/core/shared_mutex.hh>
#include <seastar/core/units.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/coroutine/switch_to.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/rpc/rpc.hh>
#include "sstables_loader.hh"
#include "dht/auto_refreshing_sharder.hh"
#include "replica/distributed_loader.hh"
#include "replica/database.hh"
#include "sstables/sstables_manager.hh"
@@ -178,11 +180,13 @@ private:
};
class tablet_sstable_streamer : public sstable_streamer {
sharded<replica::database>& _db;
const locator::tablet_map& _tablet_map;
public:
tablet_sstable_streamer(netw::messaging_service& ms, replica::database& db, ::table_id table_id, locator::effective_replication_map_ptr erm,
tablet_sstable_streamer(netw::messaging_service& ms, sharded<replica::database>& db, ::table_id table_id, locator::effective_replication_map_ptr erm,
std::vector<sstables::shared_sstable> sstables, primary_replica_only primary, unlink_sstables unlink, stream_scope scope)
: sstable_streamer(ms, db, table_id, std::move(erm), std::move(sstables), primary, unlink, scope)
: sstable_streamer(ms, db.local(), table_id, std::move(erm), std::move(sstables), primary, unlink, scope)
, _db(db)
, _tablet_map(_erm->get_token_metadata().tablets().get_tablet_map(table_id)) {
}
@@ -199,9 +203,139 @@ private:
return result;
}
future<> stream_fully_contained_sstables(const dht::partition_range& pr, std::vector<sstables::shared_sstable> sstables, shared_ptr<stream_progress> progress) {
// FIXME: fully contained sstables can be optimized.
return stream_sstables(pr, std::move(sstables), std::move(progress));
struct minimal_sst_info {
sstables::generation_type _generation;
sstables::sstable_version_types _version;
sstables::sstable_format_types _format;
};
using sst_classification_info = std::vector<std::vector<minimal_sst_info>>;
future<> attach_sstable(shard_id from_shard, const sstring& ks, const sstring& cf, const minimal_sst_info& min_info) const {
llog.debug("Adding downloaded SSTables to the table {} on shard {}, submitted from shard {}", _table.schema()->cf_name(), this_shard_id(), from_shard);
auto& db = _db.local();
auto& table = db.find_column_family(ks, cf);
auto& sst_manager = table.get_sstables_manager();
auto sst = sst_manager.make_sstable(
table.schema(), table.get_storage_options(), min_info._generation, sstables::sstable_state::normal, min_info._version, min_info._format);
sst->set_sstable_level(0);
auto units = co_await sst_manager.dir_semaphore().get_units(1);
co_await sst->load(table.get_effective_replication_map()->get_sharder(*table.schema()));
co_await table.add_sstable_and_update_cache(sst);
}
future<>
stream_fully_contained_sstables(const dht::partition_range& pr, std::vector<sstables::shared_sstable> sstables, shared_ptr<stream_progress> progress) {
if (_stream_scope != stream_scope::node) {
co_return co_await stream_sstables(pr, std::move(sstables), std::move(progress));
}
llog.debug("Directly downloading {} fully contained SSTables to local node from object storage.", sstables.size());
auto downloaded_ssts = co_await download_fully_contained_sstables(std::move(sstables));
co_await smp::invoke_on_all(
[this, &downloaded_ssts, from = this_shard_id(), ks = _table.schema()->ks_name(), cf = _table.schema()->cf_name()] -> future<> {
auto shard_ssts = std::move(downloaded_ssts[this_shard_id()]);
for (const auto& min_info : shard_ssts) {
co_await attach_sstable(from, ks, cf, min_info);
}
});
if (progress) {
progress->advance(std::accumulate(downloaded_ssts.cbegin(), downloaded_ssts.cend(), 0., [](float acc, const auto& v) { return acc + v.size(); }));
}
}
future<sst_classification_info> download_fully_contained_sstables(std::vector<sstables::shared_sstable> sstables) const {
constexpr auto foptions = file_open_options{.extent_allocation_size_hint = 32_MiB, .sloppy_size = true};
constexpr auto stream_options = file_output_stream_options{.buffer_size = 128_KiB, .write_behind = 10};
sst_classification_info downloaded_sstables(smp::count);
for (const auto& sstable : sstables) {
auto components = sstable->all_components();
// Move the TOC to the front to be processed first since `sstables::create_stream_sink` takes care
// of creating behind the scene TemporaryTOC instead of usual one. This assures that in case of failure
// this partially created SSTable will be cleaned up properly at some point.
auto toc_it = std::ranges::find_if(components, [](const auto& component) { return component.first == component_type::TOC; });
if (toc_it != components.begin()) {
swap(*toc_it, components.front());
}
// Ensure the Scylla component is processed second.
//
// The sstable_sink->output() call for each component may invoke load_metadata()
// and save_metadata(), but these functions only operate correctly if the Scylla
// component file already exists on disk. If the Scylla component is written first,
// load_metadata()/save_metadata() become no-ops, leaving the original Scylla
// component (with outdated metadata) untouched.
//
// By placing the Scylla component second, we guarantee that:
// 1) The first component (TOC) is written and the Scylla component file already
// exists on disk when subsequent output() calls happen.
// 2) Later output() calls will overwrite the Scylla component with the correct,
// updated metadata.
//
// In short: Scylla must be written second so that all following output() calls
// can properly update its metadata instead of silently skipping it.
auto scylla_it = std::ranges::find_if(components, [](const auto& component) { return component.first == component_type::Scylla; });
if (scylla_it != std::next(components.begin())) {
swap(*scylla_it, *std::next(components.begin()));
}
auto gen = _table.get_sstable_generation_generator()();
auto files = co_await sstable->readable_file_for_all_components();
for (auto it = components.cbegin(); it != components.cend(); ++it) {
try {
auto descriptor = sstable->get_descriptor(it->first);
auto sstable_sink = sstables::create_stream_sink(
_table.schema(),
_table.get_sstables_manager(),
_table.get_storage_options(),
sstables::sstable_state::normal,
sstables::sstable::component_basename(
_table.schema()->ks_name(), _table.schema()->cf_name(), descriptor.version, gen, descriptor.format, it->first),
sstables::sstable_stream_sink_cfg{.last_component = std::next(it) == components.cend()});
auto out = co_await sstable_sink->output(foptions, stream_options);
input_stream src(co_await [this, &it, sstable, f = files.at(it->first)]() -> future<input_stream<char>> {
const auto fis_options = file_input_stream_options{.buffer_size = 128_KiB, .read_ahead = 2};
if (it->first != sstables::component_type::Data) {
co_return input_stream<char>(
co_await sstable->get_storage().make_source(*sstable, it->first, f, 0, std::numeric_limits<size_t>::max(), fis_options));
}
auto permit = co_await _db.local().obtain_reader_permit(_table, "download_fully_contained_sstables", db::no_timeout, {});
co_return co_await (
sstable->get_compression()
? sstable->data_stream(0, sstable->ondisk_data_size(), std::move(permit), nullptr, nullptr, sstables::sstable::raw_stream::yes)
: sstable->data_stream(0, sstable->data_size(), std::move(permit), nullptr, nullptr, sstables::sstable::raw_stream::no));
}());
std::exception_ptr eptr;
try {
co_await seastar::copy(src, out);
} catch (...) {
eptr = std::current_exception();
llog.info("Error downloading SSTable component {}. Reason: {}", it->first, eptr);
}
co_await src.close();
co_await out.close();
if (eptr) {
co_await sstable_sink->abort();
std::rethrow_exception(eptr);
}
if (auto sst = co_await sstable_sink->close()) {
const auto& shards = sstable->get_shards_for_this_sstable();
if (shards.size() != 1) {
on_internal_error(llog, "Fully-contained sstable must belong to one shard only");
}
llog.debug("SSTable shards {}", fmt::join(shards, ", "));
downloaded_sstables[shards.front()].emplace_back(gen, descriptor.version, descriptor.format);
}
} catch (...) {
llog.info("Error downloading SSTable component {}. Reason: {}", it->first, std::current_exception());
throw;
}
}
}
co_return downloaded_sstables;
}
bool tablet_in_scope(locator::tablet_id) const;
@@ -400,8 +534,14 @@ future<> tablet_sstable_streamer::stream(shared_ptr<stream_progress> progress) {
progress,
sstables_fully_contained.size() + sstables_partially_contained.size());
auto tablet_pr = dht::to_partition_range(tablet_range);
co_await stream_sstables(tablet_pr, std::move(sstables_partially_contained), per_tablet_progress);
co_await stream_fully_contained_sstables(tablet_pr, std::move(sstables_fully_contained), per_tablet_progress);
if (!sstables_partially_contained.empty()) {
llog.debug("Streaming {} partially contained SSTables.",sstables_partially_contained.size());
co_await stream_sstables(tablet_pr, std::move(sstables_partially_contained), per_tablet_progress);
}
if (!sstables_fully_contained.empty()) {
llog.debug("Streaming {} fully contained SSTables.",sstables_fully_contained.size());
co_await stream_fully_contained_sstables(tablet_pr, std::move(sstables_fully_contained), per_tablet_progress);
}
}
}
@@ -536,14 +676,6 @@ future<> sstable_streamer::stream_sstable_mutations(streaming::plan_id ops_uuid,
}
}
template <typename... Args>
static std::unique_ptr<sstable_streamer> make_sstable_streamer(bool uses_tablets, Args&&... args) {
if (uses_tablets) {
return std::make_unique<tablet_sstable_streamer>(std::forward<Args>(args)...);
}
return std::make_unique<sstable_streamer>(std::forward<Args>(args)...);
}
future<locator::effective_replication_map_ptr> sstables_loader::await_topology_quiesced_and_get_erm(::table_id table_id) {
// By waiting for topology to quiesce, we guarantee load-and-stream will not start in the middle
// of a topology operation that changes the token range boundaries, e.g. split or merge.
@@ -581,9 +713,14 @@ future<> sstables_loader::load_and_stream(sstring ks_name, sstring cf_name,
// throughout its lifetime.
auto erm = co_await await_topology_quiesced_and_get_erm(table_id);
auto streamer = make_sstable_streamer(_db.local().find_column_family(table_id).uses_tablets(),
_messaging, _db.local(), table_id, std::move(erm), std::move(sstables),
primary, unlink_sstables(unlink), scope);
std::unique_ptr<sstable_streamer> streamer;
if (_db.local().find_column_family(table_id).uses_tablets()) {
streamer =
std::make_unique<tablet_sstable_streamer>(_messaging, _db, table_id, std::move(erm), std::move(sstables), primary, unlink_sstables(unlink), scope);
} else {
streamer =
std::make_unique<sstable_streamer>(_messaging, _db.local(), table_id, std::move(erm), std::move(sstables), primary, unlink_sstables(unlink), scope);
}
co_await streamer->stream(progress);
}

View File

@@ -18,6 +18,7 @@
#include "db/view/view_update_checks.hh"
#include "sstables/sstables.hh"
#include "sstables/sstables_manager.hh"
#include "debug.hh"
namespace streaming {
@@ -33,7 +34,7 @@ mutation_reader_consumer make_streaming_consumer(sstring origin,
return [&db, &vb = vb.container(), &vbw, estimated_partitions, reason, offstrategy, origin = std::move(origin), frozen_guard, on_sstable_written] (mutation_reader reader) -> future<> {
std::exception_ptr ex;
try {
if (current_scheduling_group() != db.local().get_streaming_scheduling_group()) {
if (current_scheduling_group() != debug::streaming_scheduling_group) {
on_internal_error(sstables::sstlog, format("The stream consumer is not running in streaming group current_scheduling_group={}",
current_scheduling_group().name()));
}

View File

@@ -67,6 +67,7 @@ PYTEST_RUNNER_DIRECTORIES = [
TEST_DIR / 'cql',
TEST_DIR / 'cqlpy',
TEST_DIR / 'rest_api',
TEST_DIR / 'nodetool',
]
launch_time = time.monotonic()
@@ -252,8 +253,6 @@ def parse_cmd_line() -> argparse.Namespace:
default=None, dest="pytest_arg",
help="Additional command line arguments to pass to pytest, for example ./test.py --pytest-arg=\"-v -x\"")
scylla_additional_options = parser.add_argument_group('Additional options for Scylla tests')
scylla_additional_options.add_argument('--x-log2-compaction-groups', action="store", default="0", type=int,
help="Controls number of compaction groups to be used by Scylla tests. Value of 3 implies 8 groups.")
scylla_additional_options.add_argument('--extra-scylla-cmdline-options', action="store", default="", type=str,
help="Passing extra scylla cmdline options for all tests. Options should be space separated:"
"'--logger-log-level raft=trace --default-log-level error'")
@@ -364,8 +363,6 @@ def run_pytest(options: argparse.Namespace) -> tuple[int, list[SimpleNamespace]]
args.extend(shlex.split(options.pytest_arg))
if options.random_seed:
args.append(f'--random-seed={options.random_seed}')
if options.x_log2_compaction_groups:
args.append(f'--x-log2-compaction-groups={options.x_log2_compaction_groups}')
if options.gather_metrics:
args.append('--gather-metrics')
if options.timeout:

View File

@@ -451,24 +451,6 @@ def test_update_table_non_existent(dynamodb, test_table):
with pytest.raises(ClientError, match='ResourceNotFoundException'):
client.update_table(TableName=random_string(20), BillingMode='PAY_PER_REQUEST')
# Consistent schema change feature is optionally enabled and
# some tests are expected to fail on Scylla without this
# option enabled, and pass with it enabled (and also pass on Cassandra).
# These tests should use the "fails_without_consistent_cluster_management"
# fixture. When consistent mode becomes the default, this fixture can be removed.
@pytest.fixture(scope="module")
def check_pre_consistent_cluster_management(dynamodb):
# If not running on Scylla, return false.
if is_aws(dynamodb):
return False
consistent = scylla_config_read(dynamodb, 'consistent_cluster_management')
return consistent is None or consistent == 'false'
@pytest.fixture(scope="function")
def fails_without_consistent_cluster_management(request, check_pre_consistent_cluster_management):
if check_pre_consistent_cluster_management:
request.node.add_marker(pytest.mark.xfail(reason='Test expected to fail without consistent cluster management feature on'))
# Test for reproducing issues #6391 and #9868 - where CreateTable did not
# *atomically* perform all the schema modifications - creating a keyspace,
# a table, secondary indexes and tags - and instead it created the different
@@ -526,7 +508,7 @@ def fails_without_consistent_cluster_management(request, check_pre_consistent_cl
'Tags': [{'Key': 'k1', 'Value': 'v1'}]
}
])
def test_concurrent_create_and_delete_table(dynamodb, table_def, fails_without_consistent_cluster_management):
def test_concurrent_create_and_delete_table(dynamodb, table_def):
# According to boto3 documentation, "Unlike Resources and Sessions,
# clients are generally thread-safe.". So because we have two threads
# in this test, we must not use "dynamodb" (containing the boto3

View File

@@ -11,6 +11,7 @@
#include "utils/s3/aws_error.hh"
#include <boost/test/unit_test.hpp>
#include <seastar/core/sstring.hh>
#include <seastar/http/exception.hh>
enum class message_style : uint8_t { singular = 1, plural = 2 };
@@ -122,7 +123,7 @@ BOOST_AUTO_TEST_CASE(TestNestedException) {
std::throw_with_nested(std::logic_error("Higher level logic_error"));
}
} catch (...) {
auto error = aws::aws_error::from_maybe_nested_exception(std::current_exception());
auto error = aws::aws_error::from_exception_ptr(std::current_exception());
BOOST_REQUIRE_EQUAL(aws::aws_error_type::NETWORK_CONNECTION, error.get_error_type());
BOOST_REQUIRE_EQUAL("Software caused connection abort", error.get_error_message());
BOOST_REQUIRE_EQUAL(error.is_retryable(), aws::retryable::yes);
@@ -136,7 +137,7 @@ BOOST_AUTO_TEST_CASE(TestNestedException) {
std::throw_with_nested(std::runtime_error("Higher level runtime_error"));
}
} catch (...) {
auto error = aws::aws_error::from_maybe_nested_exception(std::current_exception());
auto error = aws::aws_error::from_exception_ptr(std::current_exception());
BOOST_REQUIRE_EQUAL(aws::aws_error_type::UNKNOWN, error.get_error_type());
BOOST_REQUIRE_EQUAL("Higher level runtime_error", error.get_error_message());
BOOST_REQUIRE_EQUAL(error.is_retryable(), aws::retryable::no);
@@ -146,7 +147,7 @@ BOOST_AUTO_TEST_CASE(TestNestedException) {
try {
throw std::runtime_error("Something bad happened");
} catch (...) {
auto error = aws::aws_error::from_maybe_nested_exception(std::current_exception());
auto error = aws::aws_error::from_exception_ptr(std::current_exception());
BOOST_REQUIRE_EQUAL(aws::aws_error_type::UNKNOWN, error.get_error_type());
BOOST_REQUIRE_EQUAL("Something bad happened", error.get_error_message());
BOOST_REQUIRE_EQUAL(error.is_retryable(), aws::retryable::no);
@@ -156,9 +157,39 @@ BOOST_AUTO_TEST_CASE(TestNestedException) {
try {
throw "foo";
} catch (...) {
auto error = aws::aws_error::from_maybe_nested_exception(std::current_exception());
auto error = aws::aws_error::from_exception_ptr(std::current_exception());
BOOST_REQUIRE_EQUAL(aws::aws_error_type::UNKNOWN, error.get_error_type());
BOOST_REQUIRE_EQUAL("No error message was provided, exception content: char const*", error.get_error_message());
BOOST_REQUIRE_EQUAL(error.is_retryable(), aws::retryable::no);
}
// Test system_error
try {
throw std::system_error(std::error_code(ECONNABORTED, std::system_category()));
} catch (...) {
auto error = aws::aws_error::from_exception_ptr(std::current_exception());
BOOST_REQUIRE_EQUAL(aws::aws_error_type::NETWORK_CONNECTION, error.get_error_type());
BOOST_REQUIRE_EQUAL("Software caused connection abort", error.get_error_message());
BOOST_REQUIRE_EQUAL(error.is_retryable(), aws::retryable::yes);
}
// Test aws_exception
try {
throw aws::aws_exception(aws::aws_error::get_errors().at("HTTP_TOO_MANY_REQUESTS"));
} catch (...) {
auto error = aws::aws_error::from_exception_ptr(std::current_exception());
BOOST_REQUIRE_EQUAL(aws::aws_error_type::HTTP_TOO_MANY_REQUESTS, error.get_error_type());
BOOST_REQUIRE_EQUAL("", error.get_error_message());
BOOST_REQUIRE_EQUAL(error.is_retryable(), aws::retryable::yes);
}
// Test httpd::unexpected_status_error
try {
throw seastar::httpd::unexpected_status_error(seastar::http::reply::status_type::network_connect_timeout);
} catch (...) {
auto error = aws::aws_error::from_exception_ptr(std::current_exception());
BOOST_REQUIRE_EQUAL(aws::aws_error_type::HTTP_NETWORK_CONNECT_TIMEOUT, error.get_error_type());
BOOST_REQUIRE_EQUAL(" HTTP code: 599 Network Connect Timeout", error.get_error_message());
BOOST_REQUIRE_EQUAL(error.is_retryable(), aws::retryable::yes);
}
}

View File

@@ -201,7 +201,7 @@ SEASTAR_THREAD_TEST_CASE(test_conversion_to_legacy_form_same_token_singular) {
auto key1 = partition_key::from_single_value(*s1, b);
auto dk1 = partitioner.decorate_key(*s1, key1);
BOOST_REQUIRE_EQUAL(dk._token, dk1._token);
BOOST_REQUIRE_EQUAL(dk.token(), dk1.token());
}
SEASTAR_THREAD_TEST_CASE(test_conversion_to_legacy_form_same_token_two_components) {
@@ -223,7 +223,7 @@ SEASTAR_THREAD_TEST_CASE(test_conversion_to_legacy_form_same_token_two_component
auto key1 = partition_key::from_single_value(*s1, b);
auto dk1 = partitioner.decorate_key(*s1, key1);
BOOST_REQUIRE_EQUAL(dk._token, dk1._token);
BOOST_REQUIRE_EQUAL(dk.token(), dk1.token());
}
SEASTAR_THREAD_TEST_CASE(test_legacy_ordering_of_singular) {

View File

@@ -10,7 +10,7 @@
#include "sstables/consumer.hh"
#include "bytes.hh"
#include "utils/buffer_input_stream.hh"
#include "test/lib/limiting_data_source.hh"
#include "test/lib/reader_concurrency_semaphore.hh"
#include "test/lib/random_utils.hh"
#include "schema/schema.hh"
@@ -20,12 +20,18 @@
#include <seastar/core/iostream.hh>
#include <seastar/core/temporary_buffer.hh>
#include <seastar/core/thread.hh>
#include <seastar/util/memory-data-source.hh>
#include "test/lib/scylla_test_case.hh"
#include <seastar/testing/thread_test_case.hh>
#include <random>
namespace {
input_stream<char> make_buffer_input_stream(temporary_buffer<char>&& buf, size_t limit) {
auto res = data_source{std::make_unique<seastar::util::temporary_buffer_data_source>(std::move(buf))};
return input_stream < char > { make_limiting_data_source(std::move(res), limit) };
}
class test_consumer final : public data_consumer::continuous_data_consumer<test_consumer> {
static const int MULTIPLIER = 10;
uint64_t _tested_value;
@@ -48,7 +54,7 @@ class test_consumer final : public data_consumer::continuous_data_consumer<test_
for (int i = 0; i < MULTIPLIER; ++i) {
pos += unsigned_vint::serialize(tested_value, out + pos);
}
return make_buffer_input_stream(std::move(buf), [] {return 1;});
return make_buffer_input_stream(std::move(buf), 1);
}
public:
@@ -121,7 +127,7 @@ class skipping_consumer final : public data_consumer::continuous_data_consumer<s
std::memset(buf.get_write(), 'a', initial_data_size);
std::memset(buf.get_write() + initial_data_size, 'b', to_skip);
std::memset(buf.get_write() + initial_data_size + to_skip, 'a', next_data_size);
return make_buffer_input_stream(std::move(buf), [] {return 1;});
return make_buffer_input_stream(std::move(buf), 1);
}
static size_t prepare_initial_consumer_length(int initial_data_size, int to_skip) {
// some bytes that we want to skip may end up even after the initial consumer range

View File

@@ -7,9 +7,9 @@
*/
#include "seastar/core/shard_id.hh"
#include <boost/test/tools/old/interface.hpp>
#include <seastar/core/seastar.hh>
#include <seastar/core/shard_id.hh>
#include <seastar/core/smp.hh>
#include <seastar/core/thread.hh>
#include <seastar/core/coroutine.hh>
@@ -97,29 +97,12 @@ static future<> apply_mutation(sharded<replica::database>& sharded_db, table_id
});
}
future<> do_with_cql_env_and_compaction_groups_cgs(unsigned cgs, std::function<void(cql_test_env&)> func, cql_test_config cfg = {}, thread_attributes thread_attr = {}) {
// clean the dir before running
if (cfg.db_config->data_file_directories.is_set() && cfg.clean_data_dir_before_test) {
co_await recursive_remove_directory(fs::path(cfg.db_config->data_file_directories()[0]));
co_await recursive_touch_directory(cfg.db_config->data_file_directories()[0]);
}
// TODO: perhaps map log2_compaction_groups into initial_tablets when creating the testing keyspace.
co_await do_with_cql_env_thread(func, cfg, thread_attr);
}
future<> do_with_cql_env_and_compaction_groups(std::function<void(cql_test_env&)> func, cql_test_config cfg = {}, thread_attributes thread_attr = {}) {
std::vector<unsigned> x_log2_compaction_group_values = { 0 /* 1 CG */ };
for (auto x_log2_compaction_groups : x_log2_compaction_group_values) {
co_await do_with_cql_env_and_compaction_groups_cgs(x_log2_compaction_groups, func, cfg, thread_attr);
}
}
BOOST_AUTO_TEST_SUITE(database_test)
SEASTAR_TEST_CASE(test_safety_after_truncate) {
auto cfg = make_shared<db::config>();
cfg->auto_snapshot.set(false);
return do_with_cql_env_and_compaction_groups([](cql_test_env& e) {
return do_with_cql_env_thread([](cql_test_env& e) {
e.execute_cql("create table ks.cf (k text, v int, primary key (k));").get();
auto& db = e.local_db();
sstring ks_name = "ks";
@@ -180,7 +163,7 @@ SEASTAR_TEST_CASE(test_safety_after_truncate) {
SEASTAR_TEST_CASE(test_truncate_without_snapshot_during_writes) {
auto cfg = make_shared<db::config>();
cfg->auto_snapshot.set(false);
return do_with_cql_env_and_compaction_groups([] (cql_test_env& e) {
return do_with_cql_env_thread([] (cql_test_env& e) {
sstring ks_name = "ks";
sstring cf_name = "cf";
e.execute_cql(fmt::format("create table {}.{} (k text, v int, primary key (k));", ks_name, cf_name)).get();
@@ -218,7 +201,7 @@ SEASTAR_TEST_CASE(test_truncate_without_snapshot_during_writes) {
SEASTAR_TEST_CASE(test_truncate_saves_replay_position) {
auto cfg = make_shared<db::config>();
cfg->auto_snapshot.set(false);
return do_with_cql_env_and_compaction_groups([] (cql_test_env& e) {
return do_with_cql_env_thread([] (cql_test_env& e) {
BOOST_REQUIRE_GT(smp::count, 1);
const sstring ks_name = "ks";
const sstring cf_name = "cf";
@@ -236,7 +219,7 @@ SEASTAR_TEST_CASE(test_truncate_saves_replay_position) {
}
SEASTAR_TEST_CASE(test_querying_with_limits) {
return do_with_cql_env_and_compaction_groups([](cql_test_env& e) {
return do_with_cql_env_thread([](cql_test_env& e) {
// FIXME: restore indent.
e.execute_cql("create table ks.cf (k text, v int, primary key (k));").get();
auto& db = e.local_db();
@@ -304,8 +287,8 @@ SEASTAR_TEST_CASE(test_querying_with_limits) {
});
}
static void test_database(void (*run_tests)(populate_fn_ex, bool), unsigned cgs) {
do_with_cql_env_and_compaction_groups_cgs(cgs, [run_tests] (cql_test_env& e) {
static void test_database(void (*run_tests)(populate_fn_ex, bool)) {
do_with_cql_env_thread([run_tests] (cql_test_env& e) {
run_tests([&] (schema_ptr s, const utils::chunked_vector<mutation>& partitions, gc_clock::time_point) -> mutation_source {
auto& mm = e.migration_manager().local();
try {
@@ -339,72 +322,36 @@ static void test_database(void (*run_tests)(populate_fn_ex, bool), unsigned cgs)
}).get();
}
// plain cg0
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_basic_cg0) {
test_database(run_mutation_source_tests_plain_basic, 0);
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_basic) {
test_database(run_mutation_source_tests_plain_basic);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_reader_conversion_cg0) {
test_database(run_mutation_source_tests_plain_reader_conversion, 0);
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_reader_conversion) {
test_database(run_mutation_source_tests_plain_reader_conversion);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_fragments_monotonic_cg0) {
test_database(run_mutation_source_tests_plain_fragments_monotonic, 0);
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_fragments_monotonic) {
test_database(run_mutation_source_tests_plain_fragments_monotonic);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_read_back_cg0) {
test_database(run_mutation_source_tests_plain_read_back, 0);
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_read_back) {
test_database(run_mutation_source_tests_plain_read_back);
}
// plain cg1
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_basic_cg1) {
test_database(run_mutation_source_tests_plain_basic, 1);
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_basic) {
test_database(run_mutation_source_tests_reverse_basic);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_reader_conversion_cg1) {
test_database(run_mutation_source_tests_plain_reader_conversion, 1);
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_reader_conversion) {
test_database(run_mutation_source_tests_reverse_reader_conversion);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_fragments_monotonic_cg1) {
test_database(run_mutation_source_tests_plain_fragments_monotonic, 1);
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_fragments_monotonic) {
test_database(run_mutation_source_tests_reverse_fragments_monotonic);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_plain_read_back_cg1) {
test_database(run_mutation_source_tests_plain_read_back, 1);
}
// reverse cg0
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_basic_cg0) {
test_database(run_mutation_source_tests_reverse_basic, 0);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_reader_conversion_cg0) {
test_database(run_mutation_source_tests_reverse_reader_conversion, 0);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_fragments_monotonic_cg0) {
test_database(run_mutation_source_tests_reverse_fragments_monotonic, 0);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_read_back_cg0) {
test_database(run_mutation_source_tests_reverse_read_back, 0);
}
// reverse cg1
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_basic_cg1) {
test_database(run_mutation_source_tests_reverse_basic, 1);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_reader_conversion_cg1) {
test_database(run_mutation_source_tests_reverse_reader_conversion, 1);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_fragments_monotonic_cg1) {
test_database(run_mutation_source_tests_reverse_fragments_monotonic, 1);
}
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_read_back_cg1) {
test_database(run_mutation_source_tests_reverse_read_back, 1);
SEASTAR_THREAD_TEST_CASE(test_database_with_data_in_sstables_is_a_mutation_source_reverse_read_back) {
test_database(run_mutation_source_tests_reverse_read_back);
}
static void require_exist(const sstring& filename, bool should) {
@@ -455,7 +402,7 @@ SEASTAR_THREAD_TEST_CASE(test_distributed_loader_with_incomplete_sstables) {
auto test_config = cql_test_config(db_cfg_ptr);
test_config.clean_data_dir_before_test = false;
do_with_cql_env_and_compaction_groups([&sst_dir, &ks, &cf, &temp_sst_dir_2, &temp_sst_dir_3] (cql_test_env& e) {
do_with_cql_env_thread([&sst_dir, &ks, &cf, &temp_sst_dir_2, &temp_sst_dir_3] (cql_test_env& e) {
require_exist(temp_sst_dir_2, false);
require_exist(temp_sst_dir_3, false);
@@ -544,7 +491,7 @@ SEASTAR_THREAD_TEST_CASE(test_distributed_loader_with_pending_delete) {
auto test_config = cql_test_config(db_cfg_ptr);
test_config.clean_data_dir_before_test = false;
do_with_cql_env_and_compaction_groups([&] (cql_test_env& e) {
do_with_cql_env_thread([&] (cql_test_env& e) {
// Empty log filesst_dir
// Empty temporary log file
require_exist(pending_delete_dir + "/sstables-1-1.log.tmp", false);
@@ -582,7 +529,7 @@ future<> do_with_some_data_in_thread(std::vector<sstring> cf_names, std::functio
db_cfg_ptr = make_shared<db::config>();
db_cfg_ptr->data_file_directories(std::vector<sstring>({ tmpdir_for_data->path().string() }));
}
do_with_cql_env_and_compaction_groups([cf_names = std::move(cf_names), func = std::move(func), create_mvs, num_keys] (cql_test_env& e) {
do_with_cql_env_thread([cf_names = std::move(cf_names), func = std::move(func), create_mvs, num_keys] (cql_test_env& e) {
for (const auto& cf_name : cf_names) {
e.create_table([&cf_name] (std::string_view ks_name) {
return *schema_builder(ks_name, cf_name)
@@ -1113,7 +1060,7 @@ SEASTAR_TEST_CASE(test_snapshot_ctl_true_snapshots_size) {
// toppartitions_query caused a lw_shared_ptr to cross shards when moving results, #5104
SEASTAR_TEST_CASE(toppartitions_cross_shard_schema_ptr) {
return do_with_cql_env_and_compaction_groups([] (cql_test_env& e) {
return do_with_cql_env_thread([] (cql_test_env& e) {
e.execute_cql("CREATE TABLE ks.tab (id int PRIMARY KEY)").get();
db::toppartitions_query tq(e.db(), {{"ks", "tab"}}, {}, 1s, 100, 100);
tq.scatter().get();
@@ -1128,7 +1075,7 @@ SEASTAR_TEST_CASE(toppartitions_cross_shard_schema_ptr) {
}
SEASTAR_THREAD_TEST_CASE(read_max_size) {
do_with_cql_env_and_compaction_groups([] (cql_test_env& e) {
do_with_cql_env_thread([] (cql_test_env& e) {
e.execute_cql("CREATE TABLE test (pk text, ck int, v text, PRIMARY KEY (pk, ck));").get();
auto id = e.prepare("INSERT INTO test (pk, ck, v) VALUES (?, ?, ?);").get();
@@ -1217,7 +1164,7 @@ SEASTAR_THREAD_TEST_CASE(unpaged_mutation_read_global_limit) {
// configured based on the available memory, so give a small amount to
// the "node", so we don't have to work with large amount of data.
cfg.dbcfg->available_memory = 2 * 1024 * 1024;
do_with_cql_env_and_compaction_groups([] (cql_test_env& e) {
do_with_cql_env_thread([] (cql_test_env& e) {
e.execute_cql("CREATE TABLE test (pk text, ck int, v text, PRIMARY KEY (pk, ck));").get();
auto id = e.prepare("INSERT INTO test (pk, ck, v) VALUES (?, ?, ?);").get();
@@ -1301,7 +1248,7 @@ SEASTAR_THREAD_TEST_CASE(reader_concurrency_semaphore_selection_test) {
scheduling_group_and_expected_semaphore.emplace_back(sched_groups.gossip_scheduling_group, system_semaphore);
scheduling_group_and_expected_semaphore.emplace_back(unknown_scheduling_group, user_semaphore);
do_with_cql_env_and_compaction_groups([&scheduling_group_and_expected_semaphore] (cql_test_env& e) {
do_with_cql_env_thread([&scheduling_group_and_expected_semaphore] (cql_test_env& e) {
auto& db = e.local_db();
database_test_wrapper tdb(db);
for (const auto& [sched_group, expected_sem_getter] : scheduling_group_and_expected_semaphore) {
@@ -1352,7 +1299,7 @@ SEASTAR_THREAD_TEST_CASE(max_result_size_for_query_selection_test) {
scheduling_group_and_expected_max_result_size.emplace_back(sched_groups.gossip_scheduling_group, system_max_result_size);
scheduling_group_and_expected_max_result_size.emplace_back(unknown_scheduling_group, user_max_result_size);
do_with_cql_env_and_compaction_groups([&scheduling_group_and_expected_max_result_size] (cql_test_env& e) {
do_with_cql_env_thread([&scheduling_group_and_expected_max_result_size] (cql_test_env& e) {
auto& db = e.local_db();
database_test_wrapper tdb(db);
for (const auto& [sched_group, expected_max_size] : scheduling_group_and_expected_max_result_size) {
@@ -1433,7 +1380,7 @@ SEASTAR_TEST_CASE(multipage_range_scan_semaphore_mismatch) {
// Test `upgrade_sstables` on all keyspaces (including the system keyspace).
// Refs: #9494 (https://github.com/scylladb/scylla/issues/9494)
SEASTAR_TEST_CASE(upgrade_sstables) {
return do_with_cql_env_and_compaction_groups([] (cql_test_env& e) {
return do_with_cql_env_thread([] (cql_test_env& e) {
e.db().invoke_on_all([] (replica::database& db) -> future<> {
auto& cm = db.get_compaction_manager();
for (auto& [ks_name, ks] : db.get_keyspaces()) {
@@ -1645,7 +1592,7 @@ SEASTAR_TEST_CASE(snapshot_with_quarantine_works) {
}
SEASTAR_TEST_CASE(database_drop_column_family_clears_querier_cache) {
return do_with_cql_env_and_compaction_groups([] (cql_test_env& e) {
return do_with_cql_env_thread([] (cql_test_env& e) {
e.execute_cql("create table ks.cf (k text, v int, primary key (k));").get();
auto& db = e.local_db();
auto& tbl = db.find_column_family("ks", "cf");

View File

@@ -113,9 +113,6 @@ static future<> test_provider(const test_provider_args& args) {
auto cfg = seastar::make_shared<db::config>(ext);
cfg->data_file_directories({args.tmp.path().string()});
// Currently the test fails with consistent_cluster_management = true. See #2995.
cfg->consistent_cluster_management(false);
{
boost::program_options::options_description desc;
boost::program_options::options_description_easy_init init(&desc);
@@ -199,9 +196,6 @@ static auto make_commitlog_config(const test_provider_args& args, const std::uno
cfg->data_file_directories({args.tmp.path().string()});
cfg->commitlog_sync("batch"); // just to make sure files are written
// Currently the test fails with consistent_cluster_management = true. See #2995.
cfg->consistent_cluster_management(false);
boost::program_options::options_description desc;
boost::program_options::options_description_easy_init init(&desc);
configurable::append_all(*cfg, init);

View File

@@ -7,6 +7,7 @@
*/
#include <seastar/core/thread.hh>
#include <seastar/util/memory-data-source.hh>
#include "test/lib/scylla_test_case.hh"
#include "utils/assert.hh"
@@ -342,29 +343,6 @@ SEASTAR_THREAD_TEST_CASE(test_read_bytes_view) {
}
}
namespace {
class memory_data_source final : public data_source_impl {
private:
using vector_type = std::vector<temporary_buffer<char>>;
vector_type _buffers;
vector_type::iterator _position;
public:
explicit memory_data_source(std::vector<temporary_buffer<char>> buffers)
: _buffers(std::move(buffers))
, _position(_buffers.begin())
{ }
virtual future<temporary_buffer<char>> get() override {
if (_position == _buffers.end()) {
return make_ready_future<temporary_buffer<char>>();
}
return make_ready_future<temporary_buffer<char>>(std::move(*_position++));
}
};
}
SEASTAR_THREAD_TEST_CASE(test_read_fragmented_buffer) {
using tuple_type = std::tuple<std::vector<temporary_buffer<char>>,
bytes,
@@ -412,7 +390,7 @@ SEASTAR_THREAD_TEST_CASE(test_read_fragmented_buffer) {
auto size = expected_data.size();
auto suffix_size = expected_suffix.size();
auto in = input_stream<char>(data_source(std::make_unique<memory_data_source>(std::move(buffers))));
auto in = seastar::util::as_input_stream(std::move(buffers));
auto prefix = in.read_exactly(prefix_size).get();
BOOST_CHECK_EQUAL(prefix.size(), prefix_size);
@@ -491,8 +469,7 @@ static void do_test_read_exactly_eof(size_t input_size) {
if (input_size) {
data.push_back(temporary_buffer<char>(input_size));
}
auto ds = data_source(std::make_unique<memory_data_source>(std::move(data)));
auto is = input_stream<char>(std::move(ds));
auto is = seastar::util::as_input_stream(std::move(data));
auto reader = fragmented_temporary_buffer::reader();
auto result = reader.read_exactly(is, input_size + 1).get();
BOOST_CHECK_EQUAL(result.size_bytes(), size_t(0));

View File

@@ -7,10 +7,10 @@
*/
#include "utils/assert.hh"
#include "utils/limiting_data_source.hh"
#include <boost/test/unit_test.hpp>
#include "test/lib/scylla_test_case.hh"
#include "test/lib/limiting_data_source.hh"
#include <seastar/testing/thread_test_case.hh>
#include <seastar/core/iostream.hh>
#include <seastar/core/temporary_buffer.hh>
@@ -53,7 +53,7 @@ data_source create_test_data_source() {
void test_get(unsigned limit) {
auto src = create_test_data_source();
auto tested = make_limiting_data_source(std::move(src), [limit] { return limit; });
auto tested = make_limiting_data_source(std::move(src), limit);
char expected = 0;
auto test_get = [&] {
auto buf = tested.get().get();
@@ -69,7 +69,7 @@ void test_get(unsigned limit) {
data_source prepare_test_skip() {
auto src = create_test_data_source();
auto tested = make_limiting_data_source(std::move(src), [] { return 1; });
auto tested = make_limiting_data_source(std::move(src), 1);
auto buf = tested.get().get();
BOOST_REQUIRE_EQUAL(1, buf.size());
BOOST_REQUIRE_EQUAL(0, buf[0]);

View File

@@ -120,7 +120,7 @@ SEASTAR_THREAD_TEST_CASE(test_decorated_key_is_compatible_with_origin) {
auto dk = partitioner.decorate_key(*s, key);
// Expected value was taken from Origin
BOOST_REQUIRE_EQUAL(dk._token, token_from_long(4958784316840156970));
BOOST_REQUIRE_EQUAL(dk.token(), token_from_long(4958784316840156970));
BOOST_REQUIRE(dk._key.equal(*s, key));
}

View File

@@ -767,7 +767,6 @@ void test_chunked_download_data_source(const client_maker_function& client_maker
#endif
cln->delete_object(object_name).get();
cln->close().get();
}
SEASTAR_THREAD_TEST_CASE(test_chunked_download_data_source_with_delays_minio) {

View File

@@ -653,8 +653,8 @@ static bool key_range_overlaps(table_for_tests& cf, const dht::decorated_key& a,
}
static bool sstable_overlaps(const lw_shared_ptr<replica::column_family>& cf, sstables::shared_sstable candidate1, sstables::shared_sstable candidate2) {
auto range1 = wrapping_interval<dht::token>::make(candidate1->get_first_decorated_key()._token, candidate1->get_last_decorated_key()._token);
auto range2 = wrapping_interval<dht::token>::make(candidate2->get_first_decorated_key()._token, candidate2->get_last_decorated_key()._token);
auto range1 = wrapping_interval<dht::token>::make(candidate1->get_first_decorated_key().token(), candidate1->get_last_decorated_key().token());
auto range2 = wrapping_interval<dht::token>::make(candidate2->get_first_decorated_key().token(), candidate2->get_last_decorated_key().token());
return range1.overlaps(range2, dht::token_comparator());
}

View File

@@ -2986,8 +2986,8 @@ SEASTAR_TEST_CASE(test_index_fast_forwarding_after_eof) {
sst->load(sst->get_schema()->get_sharder()).get();
}
const auto t1 = muts.front().decorated_key()._token;
const auto t2 = muts.back().decorated_key()._token;
const auto t1 = muts.front().decorated_key().token();
const auto t2 = muts.back().decorated_key().token();
dht::partition_range_vector prs;
prs.emplace_back(dht::ring_position::starting_at(dht::token{t1.raw() - 200}), dht::ring_position::ending_at(dht::token{t1.raw() - 100}));

View File

@@ -9,7 +9,6 @@ import logging
import time
from cassandra.cluster import NoHostAvailable
from test.cluster.conftest import skip_mode
from test.pylib.manager_client import ManagerClient, ServerUpState
from test.pylib.util import wait_for
from test.cluster.auth_cluster import extra_scylla_config_options as auth_config

View File

@@ -14,7 +14,6 @@ from test.cluster.util import trigger_snapshot, wait_until_topology_upgrade_fini
delete_raft_topology_state, delete_raft_data_and_upgrade_state, wait_until_upgrade_finishes, \
wait_for_token_ring_and_group0_consistency, wait_until_driver_service_level_created, get_topology_coordinator, \
find_server_by_host_id
from test.cluster.conftest import skip_mode
from test.cqlpy.test_service_levels import MAX_USER_SERVICE_LEVELS
from cassandra import ConsistencyLevel
from cassandra.query import SimpleStatement

View File

@@ -268,7 +268,7 @@ async def manager(request: pytest.FixtureRequest,
# Save scylladb logs for failed tests in a separate directory and copy XML report to the same directory to have
# all related logs in one dir.
# Then add property to the XML report with the path to the directory, so it can be visible in Jenkins
failed_test_dir_path = testpy_test.suite.log_dir / "failed_test" / test_case_name
failed_test_dir_path = testpy_test.suite.log_dir / "failed_test" / test_case_name.translate(str.maketrans('[]', '()'))
failed_test_dir_path.mkdir(parents=True, exist_ok=True)
await manager_client.gather_related_logs(
failed_test_dir_path,
@@ -356,27 +356,6 @@ async def random_tables(request, manager):
if not failed and not await manager.is_dirty():
tables.drop_all()
skipped_funcs = {}
# Can be used to mark a test to be skipped for a specific mode=[release, dev, debug]
# The reason to skip a test should be specified, used as a comment only.
# Additionally, platform_key can be specified to limit the scope of the attribute
# to the specified platform. Example platform_key-s: [aarch64, x86_64]
@warnings.deprecated('Please use pytest.mark.skip_mode instead')
def skip_mode(mode: str, reason: str, platform_key: str | None = None):
"""DEPRECATED. Please use pytest.mark.skip_mode instead"""
def wrap(func):
skipped_funcs.setdefault((func, mode), []).append((reason, platform_key))
return func
return wrap
@pytest.fixture(scope="function", autouse=True)
@warnings.deprecated('Please use pytest.mark.skip_mode instead')
def skip_mode_fixture(request, build_mode):
for reason, platform_key in skipped_funcs.get((request.function, build_mode), []):
if platform_key is None or platform_key in platform.platform():
pytest.skip(f'{request.node.name} skipped, reason: {reason}')
@pytest.fixture(scope="function", autouse=True)
async def prepare_3_nodes_cluster(request, manager):
if request.node.get_closest_marker("prepare_3_nodes_cluster"):

View File

@@ -152,7 +152,7 @@ class TestHelper(Tester):
table_path = self.get_table_path(cf)
with tempfile.TemporaryDirectory() as tmp_dir:
node1.run_scylla_sstable("scrub", additional_args=["--scrub-mode", "abort", "--output-dir", tmp_dir, "--logger-log-level", "scylla-sstable=debug", "--unsafe-accept-nonempty-output-dir"], keyspace=ks, column_families=[cf])
node1.run_scylla_sstable("scrub", additional_args=["--scrub-mode", "abort", "--output-dir", tmp_dir, "--logger-log-level", "scylla-sstable=debug"], keyspace=ks, column_families=[cf])
# Replace the table's sstables with the scrubbed ones, just like online scrub would do.
shutil.rmtree(table_path)
shutil.copytree(tmp_dir, table_path)

View File

@@ -9,7 +9,6 @@ import logging
import random
import pytest
from test.cluster.conftest import skip_mode
from test.cluster.lwt.lwt_common import (
BaseLWTTester,
get_token_for_pk,

View File

@@ -9,7 +9,6 @@ import logging
import random
import pytest
from test.cluster.conftest import skip_mode
from test.cluster.lwt.lwt_common import (
BaseLWTTester,
wait_for_tablet_count,

View File

@@ -9,7 +9,6 @@ import time
from typing import Dict
import pytest
from test.cluster.conftest import skip_mode
from test.cluster.lwt.lwt_common import (
BaseLWTTester,
DEFAULT_WORKERS,

View File

@@ -11,7 +11,6 @@ from test.pylib.rest_client import read_barrier
from test.pylib.tablets import get_tablet_replicas, get_tablet_count
from test.pylib.util import wait_for_cql_and_get_hosts
from test.pylib.internal_types import ServerInfo
from test.cluster.conftest import skip_mode
from test.cluster.util import new_test_keyspace
from test.cluster.test_alternator import get_alternator, alternator_config, full_query

View File

@@ -13,7 +13,6 @@ from cassandra.cluster import ConnectionException, NoHostAvailable # type: igno
from test.pylib.scylla_cluster import ReplaceConfig
from test.pylib.manager_client import ManagerClient
from test.cluster.conftest import skip_mode
from test.cluster.util import new_test_keyspace

View File

@@ -16,7 +16,6 @@ import asyncio
import logging
import time
from test.cluster.conftest import skip_mode
from test.cluster.util import get_topology_coordinator, find_server_by_host_id
from test.cluster.mv.tablets.test_mv_tablets import get_tablet_replicas
from test.cluster.util import new_test_keyspace, wait_for

View File

@@ -10,7 +10,6 @@ import pytest
import time
import logging
from test.cluster.conftest import skip_mode
from test.pylib.util import wait_for_view
from test.cluster.mv.tablets.test_mv_tablets import pin_the_only_tablet, get_tablet_replicas
from test.cluster.util import new_test_keyspace

View File

@@ -9,7 +9,6 @@ import logging
import time
import asyncio
import pytest
from test.cluster.conftest import skip_mode
from test.pylib.util import wait_for_view, wait_for
from test.cluster.mv.tablets.test_mv_tablets import pin_the_only_tablet
from test.pylib.tablets import get_tablet_replica

View File

@@ -11,7 +11,6 @@ import time
from test.pylib.manager_client import ManagerClient, wait_for_cql_and_get_hosts
from test.pylib.tablets import get_tablet_replica
from test.pylib.util import wait_for, wait_for_view
from test.cluster.conftest import skip_mode
from test.cluster.util import get_topology_coordinator, new_test_keyspace, reconnect_driver
logger = logging.getLogger(__name__)

View File

@@ -9,7 +9,6 @@ import asyncio
import pytest
import time
import logging
from test.cluster.conftest import skip_mode
from test.pylib.util import wait_for_view
from test.cluster.util import new_test_keyspace, new_test_table, new_materialized_view
from cassandra.cqltypes import Int32Type

View File

@@ -5,7 +5,6 @@
#
import asyncio
import pytest
from test.cluster.conftest import skip_mode
from test.pylib.manager_client import ManagerClient
from test.pylib.util import wait_for_view
from test.cluster.util import new_test_keyspace, reconnect_driver

View File

@@ -9,7 +9,6 @@ import asyncio
import pytest
import logging
from test.cluster.conftest import skip_mode
from test.pylib.util import wait_for_view
from test.cluster.util import new_test_keyspace
from cassandra import ReadTimeout, WriteTimeout

View File

@@ -7,7 +7,6 @@
from cassandra.query import SimpleStatement, ConsistencyLevel
from test.pylib.manager_client import ManagerClient
from test.cluster.conftest import skip_mode
from test.pylib.util import wait_for_view
from test.pylib.internal_types import ServerInfo
from test.cluster.util import new_test_keyspace, wait_for_cql_and_get_hosts
@@ -46,7 +45,7 @@ async def delete_table_sstables(manager: ManagerClient, server: ServerInfo, ks:
break # break unconditionally here to remove only files in `table_dir`
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
@pytest.mark.skip_mode(mode='release', reason='error injections are not supported in release mode')
async def test_staging_backlog_processed_after_restart(manager: ManagerClient):
"""
Verifies that staging sstables are processed after node restart.

View File

@@ -13,7 +13,6 @@ import re
from cassandra.cluster import ConnectionException, NoHostAvailable # type: ignore
from cassandra.query import SimpleStatement, ConsistencyLevel
from test.cluster.conftest import skip_mode
from test.cluster.util import new_test_keyspace
from test.pylib.manager_client import ManagerClient

View File

@@ -11,7 +11,6 @@ import pytest
from cassandra.cluster import Session as CassandraSession
from cassandra.protocol import InvalidRequest
from test.cluster.conftest import skip_mode
from test.pylib.async_cql import _wrap_future
from test.pylib.manager_client import ManagerClient
@@ -19,7 +18,7 @@ logger = logging.getLogger(__name__)
@pytest.mark.asyncio
@pytest.mark.parametrize("rf_kind", ["numeric", "rack_list"])
@skip_mode('release', 'error injections are not supported in release mode')
@pytest.mark.skip_mode(mode='release', reason='error injections are not supported in release mode')
async def test_create_mv_and_index_restrictions_in_tablet_keyspaces(manager: ManagerClient, rf_kind: str):
"""
Verify that creating a materialized view or a secondary index in a tablet-based keyspace
@@ -99,7 +98,7 @@ async def test_create_mv_and_index_restrictions_in_tablet_keyspaces(manager: Man
@pytest.mark.asyncio
@pytest.mark.parametrize("rf_kind", ["numeric", "rack_list"])
@skip_mode('release', 'error injections are not supported in release mode')
@pytest.mark.skip_mode(mode='release', reason='error injections are not supported in release mode')
async def test_alter_keyspace_rf_rack_restriction_with_mv_and_index(manager: ManagerClient, rf_kind: str):
"""
Verify that ALTER KEYSPACE fails if it changes RF so that RF != number of racks
@@ -178,7 +177,7 @@ async def test_alter_keyspace_rf_rack_restriction_with_mv_and_index(manager: Man
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
@pytest.mark.skip_mode(mode='release', reason='error injections are not supported in release mode')
async def test_add_node_in_new_rack_restriction_with_mv(manager: ManagerClient):
"""
Test adding a node to a new rack is rejected when there is a keyspace with RF=Racks and a materialized view
@@ -210,7 +209,7 @@ async def test_add_node_in_new_rack_restriction_with_mv(manager: ManagerClient):
@pytest.mark.parametrize("op", ["remove", "decommission"])
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
@pytest.mark.skip_mode(mode='release', reason='error injections are not supported in release mode')
async def test_remove_node_violating_rf_rack(manager: ManagerClient, op: str):
"""
Test removing a node is rejected when there is a keyspace with RF=Racks and a materialized view

View File

@@ -13,7 +13,6 @@ import random
from test.cqlpy.util import local_process_id
from test.pylib.manager_client import ManagerClient
from test.cluster.object_store.conftest import format_tuples
from test.cluster.conftest import skip_mode
from test.cluster.util import wait_for_cql_and_get_hosts, get_replication, new_test_keyspace
from test.pylib.rest_client import read_barrier
from test.pylib.util import unique_name, wait_all
@@ -316,23 +315,22 @@ async def do_test_simple_backup_and_restore(manager: ManagerClient, object_stora
#
# in this test, we:
# 1. upload:
# prefix: {prefix}/{suffix}
# prefix: {some}/{objects}/{path}
# sstables:
# - 1-TOC.txt
# - 2-TOC.txt
# - ...
# 2. download:
# prefix = {prefix}
# prefix = {some}/{objects}/{path}
# sstables:
# - {suffix}/1-TOC.txt
# - {suffix}/2-TOC.txt
# - 1-TOC.txt
# - 2-TOC.txt
# - ...
suffix = 'suffix'
old_files = list_sstables();
toc_names = [f'{suffix}/{entry.name}' for entry in old_files if entry.name.endswith('TOC.txt')]
toc_names = [f'{entry.name}' for entry in old_files if entry.name.endswith('TOC.txt')]
prefix = f'{cf}/{snap_name}'
tid = await manager.api.backup(server.ip_addr, ks, cf, snap_name, object_storage.address, object_storage.bucket_name, f'{prefix}/{suffix}')
tid = await manager.api.backup(server.ip_addr, ks, cf, snap_name, object_storage.address, object_storage.bucket_name, f'{prefix}')
status = await manager.api.wait_task(server.ip_addr, tid)
assert (status is not None) and (status['state'] == 'done')
@@ -514,7 +512,7 @@ async def do_abort_restore(manager: ManagerClient, object_storage):
assert failed, "Expected at least one restore task to fail after aborting"
@pytest.mark.asyncio
@skip_mode('release', 'error injections are not supported in release mode')
@pytest.mark.skip_mode(mode='release', reason='error injections are not supported in release mode')
async def test_abort_restore_with_rpc_error(manager: ManagerClient, object_storage):
await do_abort_restore(manager, object_storage)
@@ -631,6 +629,7 @@ async def check_data_is_back(manager, logger, cql, ks, cf, keys, servers, topolo
for s in servers:
streamed_to = defaultdict(int)
log, mark = log_marks[s.server_id]
direct_downloads = await log.grep('sstables_loader - Adding downloaded SSTables to the table', from_mark=mark)
res = await log.grep(r'sstables_loader - load_and_stream:.*target_node=(?P<target_host_id>[0-9a-f-]+)', from_mark=mark)
for r in res:
target_host_id = r[1].group('target_host_id')
@@ -663,14 +662,17 @@ async def check_data_is_back(manager, logger, cql, ks, cf, keys, servers, topolo
# asses balance
streamed_to_counts = streamed_to.values()
assert len(streamed_to_counts) > 0
mean_count = statistics.mean(streamed_to_counts)
max_deviation = max(abs(count - mean_count) for count in streamed_to_counts)
if not primary_replica_only:
assert max_deviation == 0, f'if primary_replica_only is False, streaming should be perfectly balanced: {streamed_to}'
continue
if scope == 'node' and len(streamed_to_counts) == 0:
assert len(direct_downloads) > 0
else:
assert len(streamed_to_counts) > 0
mean_count = statistics.mean(streamed_to_counts)
max_deviation = max(abs(count - mean_count) for count in streamed_to_counts)
if not primary_replica_only:
assert max_deviation == 0, f'if primary_replica_only is False, streaming should be perfectly balanced: {streamed_to}'
continue
assert max_deviation < 0.1 * mean_count, f'node {s.ip_addr} streaming to primary replicas was unbalanced: {streamed_to}'
assert max_deviation < 0.1 * mean_count, f'node {s.ip_addr} streaming to primary replicas was unbalanced: {streamed_to}'
async def do_load_sstables(ks, cf, servers, topology, sstables, scope, manager, logger, prefix = None, object_storage = None, primary_replica_only = False, load_fn=do_restore_server):
logger.info(f'Loading {servers=} with {sstables=} scope={scope}')
@@ -938,6 +940,10 @@ async def test_backup_broken_streaming(manager: ManagerClient, s3_storage):
res = cql.execute(f"SELECT COUNT(*) FROM {keyspace}.{table} BYPASS CACHE USING TIMEOUT 600s;")
assert res[0].count == expected_rows, f"number of rows after restore is incorrect: {res[0].count}"
log = await manager.server_open_log(server.server_id)
await log.wait_for("fully contained SSTables to local node from object storage", timeout=10)
# just make sure we had partially contained sstables as well
await log.wait_for("partially contained SSTables", timeout=10)
@pytest.mark.asyncio
async def test_restore_primary_replica_same_rack_scope_rack(manager: ManagerClient, object_storage):

View File

@@ -21,7 +21,6 @@ from cassandra.cluster import NoHostAvailable
from test.pylib.util import wait_for, wait_for_cql_and_get_hosts
from test.cluster.util import wait_for_token_ring_and_group0_consistency, get_coordinator_host
from test.cluster.conftest import skip_mode
from test.pylib.internal_types import ServerUpState
from test.cluster.random_failures.cluster_events import CLUSTER_EVENTS, TOPOLOGY_TIMEOUT, feed_rack_seed, get_random_rack
from test.cluster.random_failures.error_injections import ERROR_INJECTIONS, ERROR_INJECTIONS_NODE_MAY_HANG

View File

@@ -13,7 +13,6 @@ from test.pylib.internal_types import ServerInfo
from test.pylib.manager_client import ManagerClient
from test.pylib.repair import create_table_insert_data_for_repair, get_tablet_task_id
from test.pylib.tablets import get_all_tablet_replicas
from test.cluster.conftest import skip_mode
from test.cluster.util import create_new_test_keyspace, new_test_keyspace
from test.cluster.test_tablets2 import inject_error_on
from test.cluster.tasks.task_manager_client import TaskManagerClient

View File

@@ -14,7 +14,6 @@ from cassandra.cluster import NoHostAvailable # type: ignore
from test.pylib.manager_client import ManagerClient
from test.pylib.rest_client import inject_error
from test.pylib.util import wait_for, wait_for_cql_and_get_hosts
from test.cluster.conftest import skip_mode
from test.cluster.util import new_test_keyspace, new_test_table
logger = logging.getLogger(__name__)

View File

@@ -31,7 +31,6 @@ from test.cluster.util import get_replication
from test.pylib.manager_client import ManagerClient
from test.pylib.util import wait_for
from test.pylib.tablets import get_all_tablet_replicas
from test.cluster.conftest import skip_mode
from test.pylib.tablets import get_tablet_replica
logger = logging.getLogger(__name__)

View File

@@ -25,7 +25,6 @@ import pytest
from test.pylib.manager_client import ManagerClient
from test.pylib.rest_client import inject_error
from test.cluster.conftest import skip_mode
logger = logging.getLogger(__name__)
@@ -325,7 +324,7 @@ async def test_alternator_no_proxy_header_to_proxy_port_fails(alternator_proxy_s
@pytest.mark.asyncio
@pytest.mark.parametrize("use_ssl", [False, True], ids=["http", "https"])
@skip_mode('release', 'error injections are not supported in release mode')
@pytest.mark.skip_mode(mode='release', reason='error injections are not supported in release mode')
async def test_alternator_proxy_protocol_address_in_system_clients(alternator_proxy_server, use_ssl):
"""Test that the source address from the proxy protocol header is correctly
reported in system.clients.

View File

@@ -4,7 +4,6 @@
# SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
#
from test.pylib.manager_client import ManagerClient
from test.cluster.conftest import skip_mode
from test.cluster.util import new_test_keyspace
from cassandra import WriteFailure
import pytest

Some files were not shown because too many files have changed in this diff Show More