Commit Graph

42744 Commits

Author SHA1 Message Date
Botond Dénes
e1c4e6c151 Merge 'sstables_manager: use maintenance scheduling group to run components reload fiber' from Lakshmi Narayanan Sreethar
PR https://github.com/scylladb/scylladb/pull/18186 introduced a fiber that reloads reclaimed bloom filters when memory becomes available. Use maintenance scheduling group to run that fiber instead of running it in the main scheduling group.

Fixes #18675

Closes scylladb/scylladb#18721

* github.com:scylladb/scylladb:
  sstables_manager: use maintenance scheduling group to run components reload fiber
  sstables_manager: add member to store maintenance scheduling group
2024-05-20 16:38:42 +03:00
Takuya ASADA
33af97ca5a dist/docker: revert dropping systemd package
On 7ce6962141 we dropped openssh-server,
it also dropped systemd package and caused an error on Scylla Operator
(#17787).

This reverts dropping systemd package and fix the issue.

Fix #17787

Closes scylladb/scylladb#18643
2024-05-20 16:38:15 +03:00
Andrei Chekun
bce53efd36 Enrich test results produced by test.py
This PR resolves issue with double count of the test result for topology tests. It will not appear in the consolidated report anymore.
Another fix is to provide a better view which test failed by modifying the test case name in the report enriching it with mode and run id, so making them unique across the run.

The scope of this change is:
1. Modify the test name to have run id in name
2. Add handlers to get logs of test.py and pytest in one file that are related to test, rather than to the full suite
3. Remove topology tests from aggregating them on a suite level in Junit results
4. Add a link to the logs related to the failed tests in Junit results, so it will be easier to navigate to all logs related to test
5. Gather logs related to the failed test to one directory for better logs investigation

Ref: scylladb/scylladb#17851

Closes scylladb/scylladb#18277
2024-05-20 15:33:57 +02:00
Avi Kivity
52fe351c31 Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec
This is needed to avoid severe imbalance between shards which can
happen when some table grows and is split. The inter-node balance can
be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N
then there is not even a possibility of moving tablets around to fix the imbalance.
The only way to bring the system to balance is to move tablets within the nodes.

The system is not prepared for intra-node migration currently. Request coordination
is host-based, while for intra-node migration it should be (also) shard-based.
The solution employed here is to keep the coordination between nodes as-is,
and for intra-node migration storage_proxy-level coordinator is not aware of
the migration (no pending host). The replica-side request handler will be a
second-level coordinator which routes requests to shards, similar to how
the first-level coordinator routes them to hosts.

Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.

The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.

The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.

perf-simple-query test results show no signs of regression:

Command: perf-simple-query -c1 -m1G --write --tablets --duration=10

Before:

> 83294.81 tps ( 59.5 allocs/op,  14.3 tasks/op,   53725 insns/op,        0 errors)
> 87756.72 tps ( 59.5 allocs/op,  14.3 tasks/op,   54049 insns/op,        0 errors)
> 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> 86211.38 tps ( 59.7 allocs/op,  14.3 tasks/op,   54219 insns/op,        0 errors)
> 86559.89 tps ( 59.6 allocs/op,  14.3 tasks/op,   54188 insns/op,        0 errors)
> 86609.39 tps ( 59.6 allocs/op,  14.3 tasks/op,   54117 insns/op,        0 errors)
> 87464.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   54039 insns/op,        0 errors)
> 86185.43 tps ( 59.6 allocs/op,  14.3 tasks/op,   54169 insns/op,        0 errors)
> 86254.71 tps ( 59.6 allocs/op,  14.3 tasks/op,   54139 insns/op,        0 errors)
> 83395.35 tps ( 60.2 allocs/op,  14.4 tasks/op,   54693 insns/op,        0 errors)
>
> median 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> median absolute deviation: 243.04
> maximum: 87756.72
> minimum: 83294.81
>

After:

> 85523.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   53872 insns/op,        0 errors)
> 89362.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54226 insns/op,        0 errors)
> 88167.55 tps ( 59.7 allocs/op,  14.3 tasks/op,   54400 insns/op,        0 errors)
> 87044.40 tps ( 59.7 allocs/op,  14.3 tasks/op,   54310 insns/op,        0 errors)
> 88344.50 tps ( 59.6 allocs/op,  14.3 tasks/op,   54289 insns/op,        0 errors)
> 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> 88725.46 tps ( 59.6 allocs/op,  14.3 tasks/op,   54230 insns/op,        0 errors)
> 88640.08 tps ( 59.6 allocs/op,  14.3 tasks/op,   54210 insns/op,        0 errors)
> 90306.31 tps ( 59.4 allocs/op,  14.3 tasks/op,   54043 insns/op,        0 errors)
> 87343.62 tps ( 59.8 allocs/op,  14.3 tasks/op,   54496 insns/op,        0 errors)
>
> median 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> median absolute deviation: 1007.41
> maximum: 90306.31
> minimum: 85523.06

Command (reads): perf-simple-query -c1 -m1G  --tablets --duration=10

Before:

> 95860.18 tps ( 63.1 allocs/op,  14.1 tasks/op,   42476 insns/op,        0 errors)
> 97537.69 tps ( 63.1 allocs/op,  14.1 tasks/op,   42454 insns/op,        0 errors)
> 97549.23 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97511.29 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97227.32 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 94031.94 tps ( 63.1 allocs/op,  14.1 tasks/op,   42441 insns/op,        0 errors)
> 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> 96401.70 tps ( 63.1 allocs/op,  14.1 tasks/op,   42473 insns/op,        0 errors)
> 96573.77 tps ( 63.1 allocs/op,  14.1 tasks/op,   42440 insns/op,        0 errors)
> 96340.54 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
>
> median 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> median absolute deviation: 571.20
> maximum: 97549.23
> minimum: 94031.94
>

After:

> 99794.67 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 101244.99 tps ( 63.1 allocs/op,  14.1 tasks/op,   42472 insns/op,        0 errors)
> 101128.37 tps ( 63.1 allocs/op,  14.1 tasks/op,   42485 insns/op,        0 errors)
> 101065.27 tps ( 63.1 allocs/op,  14.1 tasks/op,   42465 insns/op,        0 errors)
> 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> 101413.31 tps ( 63.1 allocs/op,  14.1 tasks/op,   42463 insns/op,        0 errors)
> 101464.92 tps ( 63.1 allocs/op,  14.1 tasks/op,   42466 insns/op,        0 errors)
> 101086.74 tps ( 63.1 allocs/op,  14.1 tasks/op,   42488 insns/op,        0 errors)
> 101559.09 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
> 100742.58 tps ( 63.1 allocs/op,  14.1 tasks/op,   42491 insns/op,        0 errors)
>
> median 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> median absolute deviation: 200.33
> maximum: 101559.09
> minimum: 99794.67
>

Fixes #16594

Closes scylladb/scylladb#18026

* github.com:scylladb/scylladb:
  Implement fast streaming for intra-node migration
  test: tablets_test: Test sharding during intra-node migration
  test: tablets_test: Check sharding also on the pending host
  test: py: tablets: Test writes concurrent with migration
  test: py: tablets: Test crash during intra-node migration
  api, storage_service: Introduce API to wait for topology to quiesce
  dht, replica: Remove deprecated sharder APIs
  test: Avoid using deprecated sharded API
  db: do_apply_many() avoid deprecated sharded API
  replica: mutation_dump: Avoid deprecated sharder API
  repair: Avoid deprecated sharder API
  table: Remove optimization which returns empty reader when key is not owned by the shard
  dht: is_single_shard: Avoid deprecated sharder API
  dht: split_range_to_single_shard: Work with static_sharder only
  dht: ring_position_range_sharder: Avoid deprecated sharder APIs
  dht: token: Avoid use of deprecated sharder API by switching to static_sharder
  selective_token_sharder: Avoid use of deprecated sharder API
  docs: Document tablet sharding vs tablet replica placement
  readers/multishard.cc: use shard_for_reads() instead of shard_of()
  multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()
  storage_proxy: Extract common code to apply mutations on many shards according to sharder
  storage_proxy: Prepare per-partition rate-limiting for intra-node migration
  storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()
  storage_proxy: Prepare mutate_hint() for intra-node tablet migration
  commitlog_replayer: Avoid deprecated sharder::shard_of()
  lwt: Avoid deprecated sharder::shard_of()
  compaction: Avoid deprecated sharder::shard_of()
  dht: Extract dht::static_sharder
  replica: Deprecate table::shard_of()
  locator: Deprecate effective_replication_map::shard_of()
  dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
  tests: tablets: py: Add intra-node migration test
  tests: tablets: Test that drained nodes are not balanced internally
  tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load
  tests: tablets: Verify that disabling balancing results in no intra-node migrations
  tests: tablets: Check that nodes are internally balanced
  tests: tablets: Improve debuggability by showing which rows are missing
  tablets, storage_service: Support intra-node migration in move_tablet() API
  tablet_allocator: Generate intra-node migration plan
  tablet_allocator: Extract make_internode_plan()
  tablet_allocator: Maintain candidate list and shard tablet count for target nodes
  tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions
  tablets, streaming: Implement tablet streaming for intra-node migration
  dht, auto_refreshing_sharder: Allow overriding write selector
  multishard_writer: Handle intra-node migration
  storage_proxy: Handle intra-node tablet migration for writes
  tablets: Get rid of tablet_map::get_shard()
  tablets: Avoid tablet_map::get_shard in cleanup
  tablets: test: Use sharder instead of tablet_map::get_shard()
  tablets: tablet_sharder: Allow working with non-local host
  sharding: Prepare for intra-node-migration
  docs: Document sharder use for tablets
  tablets: Introduce tablet transition kind for intra-node migration
  tests: tablets: Fix use-after-move of skiplist in rebalance_tablets()
  sstables, gdb: Track readers in a linked list
  raft topology: Fix global token metadata barrier to not fence ahead of what is drained
2024-05-20 16:13:01 +03:00
Kefu Chai
a517fcf970 service/storage_proxy: capture tr_state by copy in handle_paxos_accept()
this change is inspired by following warning from clang-tidy

```
Warning: /home/runner/work/scylladb/scylladb/service/storage_proxy.cc:884:13: warning: 'tr_state' used after it was moved [bugprone-use-after-move]
  884 |         if (tr_state) {
      |             ^
/home/runner/work/scylladb/scylladb/service/storage_proxy.cc:872:139: note: move occurred here
  872 |         auto f = get_schema_for_read(proposal.update.schema_version(), src_addr, *timeout).then([&sp = _sp, &sys_ks = _sys_ks, tr_state = std::move(tr_state),
      |                                                                                                                                           ^
```

this is not a false positive. as `tr_state` is a captured by move for
constructing a variable in the captured list of a lambda which is in
turn passed to the expression evaluated to `f`.

even the expression itself is not evaluated yet when we reference
`tr_state` to check if it is empty after preparing the expression,
`tr_state` is already moved away into the captured variable. so
at that moment, the statement of `f = f.finally(...)` is never
evaluated, because `tr_state` is always empty by then.

so before this change, the trace message is never recorded.

in this change, we address this issue by capturing `tr_state` by
copying it. as `tr_state` is backed by a `lw_shared_ptr`, the overhead is
neglectable.

after this change, the tracing message is recorded.

the change introduced this issue was 548767f91e.

please note, we could coroutinize this function to improve its
readability, but since this is a fix and should be backported,
let's start with a minimal fix, and worry about the readability
in a follow-up change.

Refs 548767f91e
Fixes #18725
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18702
2024-05-20 12:58:49 +03:00
Kefu Chai
40ce52c3cc test: use generic boost_test_print_type()
in this change, we trade the `boost_test_print_type()` overloads
for the generic template of `boost_test_print_type()`, except for
those in the very small tests, which presumably want to keep
themselves relative self-contained.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18727
2024-05-20 12:56:20 +03:00
Botond Dénes
0e23cd45ad Merge 'feature: grandfather some old cluster features' from Avi Kivity
This series grandfathers the following features:

  MD_SSTABLE_FORMAT
  ME_SSTABLE feature
  VIEW_VIRTUAL_COLUMNS
  DIGEST_INSENSITIVE_TO_EXPIRY
  CDC
  NONFROZEN_UDTS
  PER_TABLE_PARTITIONERS
  PER_TABLE_CACHING
  DIGEST_FOR_NULL_VALUES
  CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX

Note that for the last (CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX) some code remains to support indexes created before the new feature was adopted.

Each patch names the version where the feature was introduced.

Closes scylladb/scylladb#18428

* github.com:scylladb/scylladb:
  feature, index: grandfather CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX
  feature: grandfather DIGEST_FOR_NULL_VALUES
  storage_proxy: drop use of MD5 as a digest algorithm
  feature: grandfather PER_TABLE_CACHING
  feature: grandfather LWT
  feature: grandfather HINTED_HANDOFF_SEPARATE_CONNECTION
  feature: grandfather PER_TABLE_PARTITIONERS
  test: schema_change_test: regenerate digest for PER_TABLE_PARTITIONERS
  test: test_schema_change_digest: drop unneeded reference digests
  feature: grandfather NONFROZEN_UDTS
  feature: grandfather CDC
  feature: grandfather DIGEST_INSENSITIVE_TO_EXPIRY
  feature: grandfather VIEW_VIRTUAL_COLUMNS
  feature: grandfather ME_SSTABLE feature
  feature: grandfather MD_SSTABLE_FORMAT
2024-05-20 11:48:07 +03:00
Lakshmi Narayanan Sreethar
6f58768c46 sstables_manager: use maintenance scheduling group to run components reload fiber
Fixes #18675

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-19 15:23:45 +05:30
Lakshmi Narayanan Sreethar
79f6746298 sstables_manager: add member to store maintenance scheduling group
Store that maintenance scheduling group inside the sstables_manager. The
next patch will use this to run the components reloader fiber.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-05-19 15:23:45 +05:30
Avi Kivity
54a82fed6b feature, index: grandfather CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX
This feature corrected how we store the token in secondary indexes. It
was introduced in 7ff72b0ba5 (2020; 4.4) and can now be assumed present
everywhere. Note that we still support indexes created with the old format.
2024-05-18 00:24:11 +03:00
Avi Kivity
2fbd78c769 feature: grandfather DIGEST_FOR_NULL_VALUES
The DIGEST_FOR_NULL_VALUES feature was added in 21a77612b3 (2020; 4.4)
and can now be assumed to be always present. The hasher which it invoked
is removed.
2024-05-18 00:24:00 +03:00
Avi Kivity
879583c489 storage_proxy: drop use of MD5 as a digest algorithm
The XXHASH feature was introduced in 0bab3e59c2 (2017; 2.2) and made
mandatory in defe6f49df (2020; 4.4), but some vestiges remain.
Remove them now. Note that md5_hasher itself is still in use by
other components, so it cannot be removed.
2024-05-18 00:23:47 +03:00
Avi Kivity
7c264e8a71 feature: grandfather PER_TABLE_CACHING
The PER_TABLE_CACHING feature was added in 0475dab359 (2020; 4.2)
and can now be assumed to be always present.
2024-05-18 00:23:30 +03:00
Avi Kivity
d52c424a5f feature: grandfather LWT
LWT was make non-experimental in 9948f548a5 (2020; 4.1) and can now be
assumed to be always present.
2024-05-18 00:20:53 +03:00
Avi Kivity
93088d0921 feature: grandfather HINTED_HANDOFF_SEPARATE_CONNECTION
The HINTED_HANDOFF_SEPARATE_CONNECTION feature was introduced in 3a46b1bb2b (2019; 3.3)
and can be assumed always present.
2024-05-18 00:18:27 +03:00
Avi Kivity
3bead8cea0 feature: grandfather PER_TABLE_PARTITIONERS
The PER_TABLE_PARTITIONERS feature was added in 90df9a44ce (2020; 4.0)
and can now be assumed to be always present. We also remove the associated
schema_feature.
2024-05-18 00:15:07 +03:00
Avi Kivity
6b532fd40b test: schema_change_test: regenerate digest for PER_TABLE_PARTITIONERS
The first digest tested was generated without the PER_TABLE_PARTITIONERS
schema feature. We're about to make that feature mandatory, so we won't
be able (and won't need) to generate a digest without it.

Update the digest to include the feature. Note it wasn't untested before,
we have a test with schema_features::full().
2024-05-18 00:14:43 +03:00
Avi Kivity
c4d8b17f4c test: test_schema_change_digest: drop unneeded reference digests
digests[0] was used by the VIEW_VIRTUAL_COLUMNS feature, which
no longer exists.

digests[1] is the same as digests[2], so drop it.
2024-05-17 20:41:20 +03:00
Avi Kivity
93113da01b feature: grandfather NONFROZEN_UDTS
The NONFROZEN_UDTS feature was added in e74b5deb5d (2019; 3.2)
and can now be assumed to be always present.
2024-05-17 20:41:20 +03:00
Avi Kivity
c7d7ca2c23 feature: grandfather CDC
The CDC feature was made non-experimental in e9072542c1 (2020; 4.4)
and can now be assumed to be always present. We also remove the corresponding
schema_feature.
2024-05-17 20:41:20 +03:00
Avi Kivity
82ad2913ca feature: grandfather DIGEST_INSENSITIVE_TO_EXPIRY
The DIGEST_INSENSITIVE_TO_EXPIRY feature was added in 9de071d214 (2019; 3.2)
and can now be assumed to be always present. We enable the corresponding
schema_feature unconditionally.

We do not remove the corresponding schema feature, because it can be disabled
when the related TABLE_DIGEST_INSENSITIVE_TO_EXPIRY is present.
2024-05-17 20:41:19 +03:00
Avi Kivity
b5f6021a6b feature: grandfather VIEW_VIRTUAL_COLUMNS
The VIEW_VIRTUAL_COLUMNS feature was added in a108df09f9 (2019; 3.1)
and can now be assumed to be always present.

The corresponding schema_feature is removed. Note schema_features are not sent
over the wire. A digest calculation without VIEW_VIRTUAL_COLUMNS is no longer tested.
2024-05-17 20:41:19 +03:00
Avi Kivity
7952200c8c feature: grandfather ME_SSTABLE feature
"me" format sstables were introduced in d370558279 (Jan 2022; 5.1)
and so can be assumed always present. The listener that checks when
the cluster understands ME_SSTABLE was removed and in its place
we default to sstable_version_types::me (and call on_enabled()
immediately).
2024-05-17 20:41:19 +03:00
Avi Kivity
6d0c0b542c feature: grandfather MD_SSTABLE_FORMAT
"md" sstable support was introduced in e8d7744040 (2020; 4.4)
and so can be assumed to be present on all versions we upgrade from.
Nothing appears to depend on it.
2024-05-17 20:41:19 +03:00
Anna Stuchlik
c93a7d2664 doc: replace 5.5 with 6.0 in SStable docs (me)
This commit replaces the version number 5.5 with 6.0,
because 5.5 has never been released.

This is a follow-up to https://github.com/scylladb/scylladb/pull/16716.

Refs https://github.com/scylladb/scylladb/issues/16551
Refs https://github.com/scylladb/scylladb/issues/18580

Closes scylladb/scylladb#18730
2024-05-17 16:34:18 +03:00
Botond Dénes
db70e8dd5f test/cql-pytest: test_tombstone_limit.py: enable xfailing tests
These tests were marked as xfail because they use to fail with tablets.
They don't anymore, so remove the xfail.

Fixes: #16486

Closes scylladb/scylladb#18671
2024-05-16 20:14:47 +03:00
Nadav Har'El
c7aa47354a Merge 'mutation_fragment_stream_validating_filter: respect validating_level::none' from Botond Dénes
Even when configured to not do any validation at all, the validator still did some. This small series fixes this, and adds a test to check that validation levels in general are respected, and the validator doesn't validate more than it is asked to.

Fixes: #18662

Closes scylladb/scylladb#18667

* github.com:scylladb/scylladb:
  test/boost/mutation_fragment_test.cc: add test for validator validation levels
  mutation: mutation_fragment_stream_validating_filter: fix validation_level::none
  mutation: mutation_fragment_stream_validating_filter: add raises_error ctor parameter
2024-05-16 19:57:49 +03:00
Kamil Braun
734c5de314 Merge 'fix test teardown race with ongoing test operation' from Artsiom Mishuta
This commit brings several new features in scylla_cluster.py to fix runaway asyncio task problems in topology tests

- Start-Stop Lock and Stop Event in ScyllaServer
- Tasks History, Wait for tasks from Tasks History and Manager broken state in ScyllaClusterManager
- make ManagerClient object function scope
- test_finished_event in ManagerClient

Fixes: scylladb/scylladb#16472
Fixes: scylladb/scylladb#16651

Closes scylladb/scylladb#18236

* github.com:scylladb/scylladb:
  test/pylib: Introduce ManagerClient.test_finished_event
  test/topology: make ManagerClient object function scope
  test/pylib: Introduce Manager broken state:
  test/pylib: Wait for tasks from Tasks History:
  test/pylib: Introduce Tasks History:
  test/pylib: Introduce Stop Event
  test/pylib: Introduce Start-Stop Lock:
2024-05-16 17:42:00 +02:00
Kefu Chai
759156b56d test: perf: alternator: mark format string as constexpr
before this change, we use `update_item_suffix` as a format string
fed to `format(...)`, which is resolved to `seastar::format()`.
but with a patch which migrates the `seastar::format()` to the backend
with compile-time format check, the caller sites using `format()` would
fail to build, because `update_item_suffix` is not a `constexpr`:
```
/home/kefu/.local/bin/clang++ -DFMT_SHARED -DSCYLLA_BUILD_MODE=release -DSEASTAR_API_LEVEL=7 -DSEASTAR_LOGGER_COMPILE_TIME_FMT -DSEASTAR_LOGGER_TYPE_STDOUT -DSEASTAR_SCHEDULING_GROUPS_COUNT=16 -DSEASTAR_SSTRING -DXXH_PRIVATE_API -DCMAKE_INTDIR=\"RelWithDebInfo\" -I/home/kefu/dev/scylladb -I/home/kefu/dev/scylladb/build/gen -I/home/kefu/dev/scylladb/seastar/include -I/home/kefu/dev/scylladb/build/seastar/gen/include -I/home/kefu/dev/scylladb/build/seastar/gen/src -isystem /home/kefu/dev/scylladb/abseil -isystem /home/kefu/dev/scylladb/build/rust -ffunction-sections -fdata-sections -O3 -g -gz -std=gnu++20 -fvisibility=hidden -Wall -Werror -Wextra -Wno-error=deprecated-declarations -Wimplicit-fallthrough -Wno-c++11-narrowing -Wno-deprecated-copy -Wno-mismatched-tags -Wno-missing-field-initializers -Wno-overloaded-virtual -Wno-unsupported-friend -Wno-enum-constexpr-conversion -Wno-unused-parameter -ffile-prefix-map=/home/kefu/dev/scylladb=. -march=westmere -mllvm -inline-threshold=2500 -fno-slp-vectorize -U_FORTIFY_SOURCE -Werror=unused-result -MD -MT test/perf/CMakeFiles/test-perf.dir/RelWithDebInfo/perf_alternator.cc.o -MF test/perf/CMakeFiles/test-perf.dir/RelWithDebInfo/perf_alternator.cc.o.d -o test/perf/CMakeFiles/test-perf.dir/RelWithDebInfo/perf_alternator.cc.o -c /home/kefu/dev/scylladb/test/perf/perf_alternator.cc
/home/kefu/dev/scylladb/test/perf/perf_alternator.cc:249:69: error: call to consteval function 'fmt::basic_format_string<char, const char (&)[1]>::basic_format_string<const char *, 0>' is not a constant expression
  249 |     return make_request(cli, "UpdateItem", prefix + seastar::format(update_item_suffix, ""));
      |                                                                     ^
/usr/include/fmt/core.h:2776:67: note: read of non-constexpr variable 'update_item_suffix' is not allowed in a constant expression
 2776 |   FMT_CONSTEVAL FMT_INLINE basic_format_string(const S& s) : str_(s) {
      |                                                                   ^
/home/kefu/dev/scylladb/test/perf/perf_alternator.cc:249:69: note: in call to 'basic_format_string<const char *, 0>(update_item_suffix)'
  249 |     return make_request(cli, "UpdateItem", prefix + seastar::format(update_item_suffix, ""));
      |                                                                     ^~~~~~~~~~~~~~~~~~
/home/kefu/dev/scylladb/test/perf/perf_alternator.cc:198:6: note: declared here
  198 | auto update_item_suffix = R"(
      |      ^
```

so, to prepare the change switching to compile-time format checking,
let's mark this variable `static constexpr`. this is also more correct,
as this variable is

* a compile time constant, and
* is not shared across different compilation units.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18685
2024-05-16 15:18:42 +03:00
Avi Kivity
6982de6dde Merge 'Fix stalls in forward_service::dispatch() with large tablet count' from Raphael "Raph" Carvalho
With a large tablet count, e.g. 128k, forward_service::dispatch() can potentially stall when grouping ranges per endpoint.

`    Reactor stalled for 4 ms on shard 1. Backtrace: 0x5eb15ea 0x5eb09f5 0x5eb1daf 0x3dbaf 0x2d01e57 0x33f7d1e 0x348255f 0x2d005d4 0x2d3d017 0x2d3d58c 0x2d3d225 0x5e59622 0x5ec328f 0x5ec4577 0x5ee84e0 0x5e8394a 0x8c946 0x11296f
`

Also there are inefficient copies that are being removed. partition_range_vector for a single endpoint can grow beyond 1M.

Closes scylladb/scylladb#18695

* github.com:scylladb/scylladb:
  service: fix indentation in dispatch()
  service: fix reactor stall with large tablet count
  service: avoid potential expensive copies in forward_service::dispatch()
  service: coroutinize forward_service::dispatch()
2024-05-16 15:17:43 +03:00
Kefu Chai
617e532859 db: config: drop operator<<() for error_injection_at_startup
it is not used anymore, so let's drop it.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#18701
2024-05-16 15:10:57 +03:00
Pavel Emelyanov
dffd985401 data_dictionary: Resurrect formatter for keyspace_metadata
It was commented out by the a439ebcfce (treewide: include fmt/ranges.h
and/or fmt/std.h) , probably by mistake

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18665
2024-05-16 15:09:45 +03:00
Nadav Har'El
27ab560abd cql: fix hang during certain SELECT statements
The function intersection(r1,r2) in statement_restrictions.cc is used
when several WHERE restrictions were applied to the same column.
For example, for "WHERE b<1 AND b<2" the intersection of the two ranges
is calculated to be b<1.

As noted in issue #18690, Scylla is inconsistent in where it allows or
doesn't allow these intersecting restrictions. But where they are
allowed they must be implemented correctly. And it turns out the
function intersection() had a bug that caused it to sometimes enter
an infinite loop - when the intent was only to call itself once with
swapped parameters.

This patch includes a test reproducing this bug, and a fix for the
bug. The test hangs before the fix, and passes after the fix.

While at it, I carefully reviewed the entire code used to implement
the intersection() function to try to make sure that the bug we found
was the only one. I also added a few more comments where I thought they
were needed to understand complicated logic of the code.

The bug, the fix and the test were originally discovered by
Michał Chojnowski.

Fixes #18688
Refs #18690

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#18694
2024-05-16 11:25:44 +03:00
Piotr Dulikowski
68eca3778c Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros
This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed.

See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions.

This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh.

The existing mechanism works in the following way:

* Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes
* Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking
* We keep track of the percent of consumed units on each node, this is called `view update backlog`.
* Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level.

This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates.

To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`.

The new algorithm of view update generation looks something like this:
```c++
for(;;) {
    auto updates = generate_updates_batch_with_max_100_rows();
    co_await seastar::sleep(calculate_sleep_time_from_backlogs());
    spawn_background_tasks_for_updates(updates);
}
```
Fixes: https://github.com/scylladb/scylladb/issues/12379

Closes scylladb/scylladb#16819

* github.com:scylladb/scylladb:
  test: add test for bad_allocs during large mv queries
  mv: throttle view update generation for large queries
  exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception
  db/view: extract view throttling delay calculation to a global function
  view_update_generator: add get_storage_proxy()
  storage_proxy: make view backlog getters public
2024-05-16 08:22:54 +02:00
Botond Dénes
af9e173c99 Merge 'repair: Don't get topology via database' from Pavel Emelyanov
Database has token-metadata onboard and other services use it to get topology from. Repair code has simpler and cleaner ways to get access to topology.

Closes scylladb/scylladb#18677

* github.com:scylladb/scylladb:
  repair: Get topology via replication map
  repair: Use repair_service::my_address() in handlers
  repair: Remove repair_meta::_myip
  repair: Use repair_meta::myip() everywhere
  repair: Add repair_service::my_address() method
2024-05-16 08:28:14 +03:00
Raphael S. Carvalho
715ae689c0 Implement fast streaming for intra-node migration
With intra-node migration, all the movement is local, so we can make
streaming faster by just cloning the sstable set of leaving replica
and loading it into the pending one.

This cloning is underlying storage specific, but s3 doesn't support
snapshot() yet (th sstables::storage procedure which clone is built
upon). It's only supported by file system, with help of hard links.
A new generation is picked for new cloned sstable, and it will
live in the same directory as the original.

A challenge I bumped into was to understand why table refused to
load the sstable at pending replica, as it considered them foreign.
Later I realized that sharder (for reads) at this stage of migration
will point only to leaving replica. It didn't fail with mutation
based streaming, because the sstable writer considers the shard --
that the sstable was written into -- as its owner, regardless of what
sharder says. That was fixed by mimicking this behavior during
loading at pending.

test:
./test.py --mode=dev intranode --repeat=100 passes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
a179f37780 test: tablets_test: Test sharding during intra-node migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
5f32d2ddb6 test: tablets_test: Check sharding also on the pending host 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
6d809c75fb test: py: tablets: Test writes concurrent with migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
ad02d85c16 test: py: tablets: Test crash during intra-node migration 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7956a2991e api, storage_service: Introduce API to wait for topology to quiesce 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
679baff25a dht, replica: Remove deprecated sharder APIs 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
32a191384a test: Avoid using deprecated sharded API
There is not tablet migration in unit tests, so shard_of() can be
safely replaced with shard_for_reads(). Even if it's used for writes.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
539460dd71 db: do_apply_many() avoid deprecated sharded API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
0f50504c39 replica: mutation_dump: Avoid deprecated sharder API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7bf5733fa5 repair: Avoid deprecated sharder API 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7c03646f99 table: Remove optimization which returns empty reader when key is not owned by the shard
This check would lead to correctness issues with intra-node migration
because the shard may switch during read, from "read old" to "read
new". If the coordinator used "read old" for shard routing, but table
on the old shard is already using "read new" erm, such a read would
observe empty result, which is wrong.

Drop the optimization. In the scenario above, read will observe all
past writes because:

  1) writes are still using "write both"

  2) writes are switched to "write new" only after all requests which
  might be using "read old" are done

Replica-side coordinators should already route single-key requests to
the correct shard, so it's not important as an optimization.

This issue shows how assumptions about static sharding are embedded in
the current code base and how intra-node migration, by violating those
assumptions, can lead to correctness issues.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
26f2e6aa8e dht: is_single_shard: Avoid deprecated sharder API
All current uses are used in the read path.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c9e6b4dca7 dht: split_range_to_single_shard: Work with static_sharder only
In preparation for intra-node tablet migration, to avoid
using deprecated sharder APIs.

This function is used for generating sstable sharding metadata.
For tablets, it is not invoked, so we can safely work with the
static sharder. The call site already passes static_sharder only.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
c380aecf64 dht: ring_position_range_sharder: Avoid deprecated sharder APIs
In preparation for tablet intra-node migration.

Existing uses are for reads, so it's safe to use shard_for_reads():
  - in multishard reader
  - in forward_service

The ring_position_range_vector_sharder is used when computing sstable
shards, which for intra-node migration should use the view for
reads. If we haven't completed streaming, sstables should be attached
to the old shard (used by reads). When in write-both-read-new stage,
streaming is complete, reads are using the new shard, and we should
attach sstables to the new shard.

When not in intra-node migration, the view for reads on the pending
node will return the pending shard even if read selector is "read old".
So if pending node restarts during streaming, we will attach to sstables
to the shard which is used by writes even though we're using the selector
for reads.
2024-05-16 00:28:47 +02:00