Commit Graph

38069 Commits

Author SHA1 Message Date
Tomasz Grabiec
e3a8bb7ec9 tablets: Introduce global_tablet_id
Identifies tablet in the scope of the whole cluster. Not to be
confused with tablet replicas, which all share global_tablet_id.

Will be needed by load balancer and tablet migration algorithm to
identify tablets globally.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
f88220aeee stream_transfer_task, multishard_writer: Work with table sharder
So that we can use it on tablet-based tables.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
8cf92d4c86 tablets: Turn tablet_id into a struct
The IDL compiler cannot deal with enum classes like this.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
c2b18ae483 db: Do not create per-keyspace erm for tablet-based tables
This erm is not updated when replicating token metadata in
storage_service::replicate_to_all_cores() so will pin token metadata
version and prevent token metadata barrier from finishing.

It is not necessary to have per-keyspace erm for tablet-based tables,
so just don't create it.
2023-07-25 21:08:51 +02:00
Tomasz Grabiec
91dee5c872 tablets: effective_replication_map: Take transition stage into account when computing replicas 2023-07-25 21:08:51 +02:00
Tomasz Grabiec
dc2ec3f81c tablets: Store "stage" in transition info
It's needed to implement tablet migration. It stores the current step
of tablet migration state machine. The state machine will be advanced
by the topology change coordinator.

See the "Tablet migration" section of topology-over-raft.md
2023-07-25 21:08:02 +02:00
Tomasz Grabiec
05519bd5e5 doc: Document tablet migration state machine and load balancer 2023-07-25 21:08:02 +02:00
Tomasz Grabiec
7851694eaa locator: erm: Make get_endpoints_for_reading() always return read replicas
Just a simplification.

Drop the test case from token_metadata which creates pending endpoints
without normal tokens. It fails after this change with exception:
"sorted_tokens is empty in first_token_index!" thrown from
token_metadata::first_token_index(), which is used when calculating
normal endpoints. This test case is not valid, first node inserts
its tokens as normal without going through bootstrap procedure.
2023-07-25 21:08:01 +02:00
Tomasz Grabiec
b642e69eb3 storage_service: topology_coordinator: Sleep on failure between retries
Avoid failing in a tight loop. Can happen if some node is down, for example.
2023-07-25 21:08:01 +02:00
Tomasz Grabiec
f0e9dbf911 storage_service: topology_coordinator: Simplify coordinator loop
This refactoring removes a boolean and branching which makes it easier
to reason about the flow, and easier to extend it with more steps.
2023-07-25 21:08:01 +02:00
Tomasz Grabiec
b294932cf1 main: Require experimental raft to enable tablets
Tablets depend on the topology changes on raft feature.

Drop "tablets" from suite.yaml of the topology/ suite, which doesn't
use tablets anymore.
2023-07-25 21:08:01 +02:00
Pavel Emelyanov
c46c57d535 messaging_service: Clear list of clients on shutdown
When messaging_service shuts down it first sets _shutting_down to true
and proceeds with stopping clients and servers. Stopping clients, in
turn, is calling client.stop() on each.

Setting _shutting_down is used in two places.

First, when a client is stopped it may happen that it's in the middle of
some operation, which may result in call to remove_error_rpc_client()
and not to call .stop() for the second time it just does nothing if the
shutdown flag is set (see 357c91a076).

Second, get_rpc_client() asserts that this flag is not set, so once
shutdown started it can make sure that it will call .stop() on _all_
clients and no new ones would appear in parallel.

However, after shutdown() is complete the _clients vector of maps
remains intact even though all clients from it are stopped. This is not
very debugging-friendly, the clients are better be removed on shutdown.

fixes: #14624

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14632
2023-07-25 13:08:20 +03:00
Botond Dénes
ed025890e5 scripts/coverage.py: --run: swallow KeyboardInterrupt
It is quite common to stop a tested scylla process with ^C, which will
raise KeyboardInterrupt from subprocess.run(). Catch and swallow this
exception, allowing the post-processing to continue.
The interrupted process has to handle the interrupt correctly too --
flush the coverage data even on premature exit -- but this is for
another patch.

Closes #14815
2023-07-25 12:29:22 +03:00
Kefu Chai
2943d3c1b0 tools/scylla-sstable: s/foo.find(bar) != foo.end()/foo.count(bar) != 0/
just for better readability.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14816
2023-07-25 11:38:44 +03:00
Raphael S. Carvalho
0ac43ea877 Fix stack-use-after-return in mutation source excluding staging
The new test detected a stack-use-after-return when using table's
as_mutation_source_excluding_staging() for range reads.

This doesn't really affect view updates that generate single
key reads only. So the problem was only stressed in the recently
added test. Otherwise, we'd have seen it when running dtests
(in debug mode) that stress the view update path from staging.

The problem happens because the closure was feeded into
a noncopyable_function that was taken by reference. For range
reads, we defer before subsequent usage of the predicate.
For single key reads, we only defer after finished using
the predicate.

Fix is about using sstable_predicate type, so there won't
be a need to construct a temporary object on stack.

Fixes #14812.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes #14813
2023-07-25 10:38:20 +03:00
Botond Dénes
3eec990e4e Merge 'test: use different table names in simple_backlog_controller_test ' from Kefu Chai
in this series, we use different table names in simple_backlog_controller_test. this test is a test exercising sstables compaction strategies. and it creates and keeps multiple tables in a single test session. but we are going to add metrics on per-table basis, and will use the table's ks and cf as the counter's labels. as the metrics subsystem does not allow multiple counters to share the same label. the test will fail when the metrics are being added.

to address this problem, in this change

1. a new ctor is added for `simple_schema`, so we can create `simple_schema` with different names
2. use the new ctor in simple_backlog_controller_test

Fixes #14767

Closes #14783

* github.com:scylladb/scylladb:
  test: use different table names in simple_backlog_controller_test
  test/lib/simple_schema: add ctor for customizing ks.cf
  test/lib/simple_schema: do not hardwire ks.cf
2023-07-25 10:26:33 +03:00
Anna Stuchlik
f6732865b9 doc: doc: move unified installer from web to docs
This commit adds the information on how to install ScyllaDB
without root privileges (with "unified installer", but we've
decided to drop that name - see the page title).

The content taken from the website
https://www.scylladb.com/download/?platform=tar&version=scylla-5.2#open-source
is divided into two sections: "Download and Install" and
"Configure and Run ScyllaDB".
In addition, the "Next Steps" section is also copied from
the website, and adjusted to be in sync with other installation
pages in the docs.

Refs https://github.com/scylladb/scylla-docs/issues/4091

Closes #14781
2023-07-25 10:23:02 +03:00
Benny Halevy
a07440173f storage_service: node_ops_ctl: send_to_all: fix "Node is down for" log message args order
The node and op_desc args are reversed.

Fixes #14807

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #14808
2023-07-24 21:13:06 +03:00
Petr Gusev
5fb8da4181 hints: add fencing
In this commit we just pass a fencing_token
through hint_mutation RPC verb.

The hints manager uses either
storage_proxy::send_hint_to_all_replicas or
storage_proxy::send_hint_to_endpoint to send a hint.
Both methods capture the current erm and use the
corresponding fencing token from it in the
mutation or hint_mutation RPC verb. If these
verbs are fenced out, the server stale_topology_exception
is translated to a mutation_write_failure_exception
on the client with an appropriate error message.
The hint manager will attempt to resend the failed
hint from the commitlog segment after a delay.
However, if delivery is unsuccessful, the hint will
be discarded after gc_grace_seconds.

Closes #14580
2023-07-24 18:12:48 +02:00
Tomasz Grabiec
5b30931406 Merge 'raft topology: restore gossiper eps' from Gusev Petr
We don't load gossiper endpoint states in `storage_service::join_cluster` if `_raft_topology_change_enabled`, but gossiper is still needed even in case of `_raft_topology_change_enabled` mode, since it still contains part of the cluster state. To work correctly, the gossiper needs to know the current endpoints. We cannot rely on seeds alone, since it is not guaranteed that seeds will be up to date and reachable at the time of restart.

The problem was demonstrated by the test `test_joining_old_node_fails`, it fails occasionally with `experimental_features: [consistent-topology-changes]` on the line where it waits for `TEST_ONLY_FEATURE` to become enabled on all nodes. This doesn't happen since `SUPPORTED_FEATURES` gossiper state is not disseminated, and feature_service still relies on gossiper to disseminate information around the cluster.

The series also contains a fix for a problem in `gossiper::do_send_ack2_msg`, see commit message for details.

Fixes #14675

Closes #14775

* github.com:scylladb/scylladb:
  storage_service: restore gossiper endpoints on topology_state_load fix
  gossiper: do_send_ack2_msg fix
2023-07-24 13:55:50 +02:00
Botond Dénes
a8feb7428d Merge 'semaphore mismatch: don't throw an error if both semaphores belong to user' from Michał Jadwiszczak
If semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning, log a `querier_cache_scheduling_group_mismatches` stat and drop cached reader instead of throwing an error.

Until now, semaphore mismatch was only checked in multi-partition queries.  The PR pushes the check to `querier_cache` and perform it on all `lookup_*_querier` methods.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.

This patch doesn't solve a problem with mismatched semaphores because of changes in service levels/scheduling groups but only mitigate it.

Refers: https://github.com/scylladb/scylla-enterprise/issues/3182
Refers: https://github.com/scylladb/scylla-enterprise/issues/3050
Closes: #14770

Closes #14736

* github.com:scylladb/scylladb:
  querier_cache: add stats of scheduling group mismatches
  querier_cache: check semaphore mismatch during querier lookup
  querier_cache: add reference to `replica::database::is_user_semaphore()`
  replica:database: add method to determine if semaphore is user one
2023-07-24 14:13:09 +03:00
Petr Gusev
75694aa080 storage_service: restore gossiper endpoints on topology_state_load fix
We don't load gossiper endpoint states in
storage_service::join_cluster if
_raft_topology_change_enabled, but gossiper
is still needed even in case of
_raft_topology_change_enabled mode, since it
still contains part of the cluster state.
To work correctly, the gossiper needs to know
the current endpoints. We cannot rely on seeds alone,
since it is not guaranteed that seeds will be
up to date and reachable at the time of restart.

The specific scenario of the problem: cluster with
three nodes, the second has the first in seeds,
the third has the first and second. We restart all
the nodes simultaneously, the third node uses its
seeds as _endpoints_to_talk_with in the first gossiper round
and sends SYN to the first and sedond. The first node
hasn't started its gossiper yet, so handle_syn_msg
returns immediately after if (!this->is_enabled());
The third node receives ack from the second node and
no communication from the first node, so it fills
its _live_endpoints collection with the second node
and will never communicate with the first node again.

The problem was demonstrated by the test
test_joining_old_node_fails, it fails occasionally with
experimental_features: [consistent-topology-changes]
on the line where it waits for TEST_ONLY_FEATURE
to become enabled on all nodes. This doesn't happen
since SUPPORTED_FEATURES gossiper state is not
disseminated because of the problem described above.

The first commit is needed since add_saved_endpoint
adds the endpoint with some default app states with locally
incrementing versions and without that fix gossiper
refuses to fill the real app states for this endpoint later.

Fixes: #14675
2023-07-24 12:36:39 +04:00
Kamil Braun
e6099c4685 Merge 'config: set schema_commitlog_segment_size_in_mb to 128 ' from Patryk Jędrzejczak
Fixes #14668

In #14668, we have decided to introduce a new `scylla.yaml` variable for the schema commitlog segment size and set it to 128MB. The reason is that segment size puts a limit on the mutation size that can be written at once, and some schema mutation writes are much larger than average, as shown in #13864. This `schema_commitlog_segment_size_in_mb variable` variable is now added to `scylla.yaml` and `db/config`.

Additionally,  we do not derive the commitlog sync period for schema commitlog anymore because schema commitlog runs in batch mode, so it doesn't need this parameter. It has also been discussed in #14668.

Closes #14704

* github.com:scylladb/scylladb:
  replica: do not derive the commitlog sync period for schema commitlog
  config: set schema_commitlog_segment_size_in_mb to 128
  config: add schema_commitlog_segment_size_in_mb variable
2023-07-24 10:23:34 +02:00
Petr Gusev
87cd7e8741 gossiper: do_send_ack2_msg fix
This commit is a first part of the fix for #14675.
The issue is about the test test_joining_old_node_fails
faling occasionally with
experimental_features: [consistent-topology-changes].
The next commit contains a fix for it, here we
solve the pre-existing gossiper problem
which we stumble upon after the fix.

Local generation for addr may have been
increased since the current node sent
an initial SYN. Comparing versions across different
generations in get_state_for_version_bigger_than
could result in loosing some app states with
smaller versions.

More specifically, consider a cluster with nodes
.1, .2, .3, .3 has .1 and .2 as seeds, .2 has .1
as a seed. Suppose .2 receives a SYN from .3 before
its gossiper starts, and it has a
version 0.24 for .1 in endpoint_states.

The digest from .3 contains 0.25 as a version for .1,
so examine_gossiper produces .1->0.24 as a digest
and this digest is send to .3 as part of the ack.
Before processing this ack, .3 processed an ack from
.1 (scylla sends SYN to many nodes) and updates
its endpoint_states according to it, so now it
has .1->100500.32 for .1. Then
we get to do_send_ack2_msg and call
get_state_for_version_bigger_than(.1, 24).
This returns properties which has version > 24,
ignoring a lot of them with smaller versions
which has been received from .1. Also,
get_state_for_version_bigger_than updates
generation (it copies get_heart_beat_state from
.3), so when we apply the ack in handle_ack2_msg
at .2 we update the generation and now the
skipped app states will only be updated on .2
if somebody change them and increment their version.

Cassandra behaviour is the same in this case
(see https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/gms/GossipDigestAckVerbHandler.java#L86). This is probably less
of a problem for them since most of the time  they
send only one SYN in one gossiper round
(save for unreachable nodes), so there is less
room for conflicts.
2023-07-24 11:52:56 +04:00
Kefu Chai
3ad844a4bb build: cmake: set scylla version strings as CACHED strings
before this change, add_version_library() is a single function
which accomplishes two tasks:

1. build scylla-version target using
2. add an object library

but this has two problems:

1. we should run `SCYLLA-VERSION-GEN` at configure time, instead
   of at build time. otherwise the targets which read from the
   SCYLLA-{VERSION, RELEASE, PRODUCT}-FILE cannot access them,
   unless they are able to read them in their build rules. but
   they always use `file(STRINGS ..)` to read them, and thsee
   `file()` command is executed at configure time. so, this
   is a dead end.
2. we repeat the `file(STRING ..)` multiple places. this is
   not ideal if we want to minimize the repeatings.

so, to address this problem, in this change:

1. use `execute_process()` instead of `add_custom_command()`
   for generating these *-FILE files. so they are always ready
   at build time. this partially reverts bb7d99ad37.
2. extract `generate_scylla_version()` out of `add_version_library()`.
   so we can call the former much earlier than the latter.
   this would allow us to reference the variables defined by
   the `generate_scylla_version()` much earlier.
3. define cached strings in the extracted function, so that
   they can consumed by other places.
4. reference the cached variables in `build_submodule.cmake`.

also, take this opportunity to fix the version string
used in build_submodule.cmake: we should have used
`scylla_version_tilde`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14769
2023-07-24 08:57:19 +03:00
Michał Jadwiszczak
246728cbbb querier_cache: add stats of scheduling group mismatches
Add stats to count dropped queriers because of scheduling group
mismatch.
2023-07-21 19:05:55 +02:00
Michał Jadwiszczak
a5fc53aa11 querier_cache: check semaphore mismatch during querier lookup
Previously semaphore mismatch was checked only in multi-partition
queries and if happened, an internal error was thrown.

This commit pushed the check down to `querier_cache`, so each
`lookup_*_querier` method will check for the mismatch.

What's more, if semaphore mismatch occurs, check whether both semaphores belong
to user. If so, log a warning and drop cached reader instead of
throwing an error.

The mismatch can happen if user's scheduling group changed during
a query. We don't want to throw an error then, but drop and reset
cached reader.
2023-07-21 19:05:50 +02:00
Michał Jadwiszczak
e5c965b280 querier_cache: add reference to replica::database::is_user_semaphore() 2023-07-21 18:58:57 +02:00
Jan Ciolek
decbc841b7 cql3/prepare_expr: fix partially preparing function arguments
Before choosing a function, we prepare the arguments that can be
prepared without a receiver. Preparing an argument makes
its type known, which allows to choose the best overload
among many possible functions.

The function that prepared the argument passes the unprepared
argument by mistake. Let's fix it so that it actually uses
the prepared argument.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #14786
2023-07-21 18:59:56 +03:00
Jan Ciolek
cbc97b41d4 cql.g: make the parser reject INSERT JSON without a JSON value
We allow inserting column values using a JSON value, eg:
```cql
INSERT INTO mytable JSON '{ "\"myKey\"": 0, "value": 0}';
```

When no JSON value is specified, the query should be rejected.

Scylla used to crash in such cases. A recent change fixed the crash
(https://github.com/scylladb/scylladb/pull/14706), it now fails
on unwrapping an uninitialized value, but really it should
be rejected at the parsing stage, so let's fix the grammar so that
it doesn't allow JSON queries without JSON values.

A unit test is added to prevent regressions.

Refs: https://github.com/scylladb/scylladb/pull/14707
Fixes: https://github.com/scylladb/scylladb/issues/14709

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>

Closes #14785
2023-07-21 18:52:47 +03:00
Kefu Chai
d78c6d5f50 test: use different table names in simple_backlog_controller_test
in `simple_backlog_controller_test`, we need to have multiple tables
at the same time. but the default constructor of `simple_schema` always
creates schema with the table name of "ks.cf". we are going to have
a per-table metrics. and the new metric group will use the table name
as its counter labels, so we need to either disable this per-table
metrics or use a different table name for each table.

as in real world, we don't have multiple tables at the same time. it
would be better to stop reusing the same table name in a single test
session. so, in this change, we use a random cf_name for each of
the created table.

Fixes #14767
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-21 19:08:29 +08:00
Kefu Chai
1f596e4669 test/lib/simple_schema: add ctor for customizing ks.cf
some low level tests, like the ones exercising sstables, creates
multiple tables. and we are going to add per-table metrics and
the new metrics uses the ks.cf as part of its unique id. so,
once the per-table metrics is enabled, the sstable tests would fail.
as the metrics subsystem does not allow registering multiple
metric groups with the same name.

so, in this change, we add a new constructor for `simple_schema`,
so that we can customize the the schema's ks and cf when creating
the `simple_schema`. in the next commit, we will use this new
constructor in a sstable test which creates multiple tables.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-21 19:07:45 +08:00
Kefu Chai
306439d3aa test/lib/simple_schema: do not hardwire ks.cf
instead, query the name of ks and cf from the scheme. this change
prepare us for the a simple_schema whose ks and cf can be customized
by its contructor.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-07-21 19:07:45 +08:00
Mikołaj Grzebieluch
37ceef23a6 test: raft: skip test_old_ip_notification_repro in debug mode
Closes #14777
2023-07-21 12:41:03 +02:00
Kefu Chai
a87b0d68cd s3/test: remove the tempdir if test succeeds
in 46616712, we tried to keep the tmpdir only if the test failed,
and keep up to 1 of them using the recently introduced
option of `tmp_path_retention_count`. but it turns out this option
is not supported by the pytest used by our jenkins nodes, where we
have pytest 6.2.5. this is the one shipped along with fedora 36.

so, in this change, the tempdir is removed if the test completes
without failures. as the tempdir contains huge number of files,
and jenkins is quite slow scanning them. after nuking the tempdir,
jenkins will be much faster when scanning for the artifacts.

Fixes #14690
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14772
2023-07-21 12:21:51 +03:00
Nadav Har'El
5860820934 Merge 'mutation/mutation_compactor: validate the input stream' from Botond Dénes
The mutation compactor has a validator which it uses to validate the stream of mutation fragments that passes through it. This validator is supposed to validate the stream as it enters the compactor, as opposed to its compacted form (output). This was true for most fragment kinds except range tombstones, as purged range tombstones were not visible to
the validator for the most part.

This mistake was introduced by https://github.com/scylladb/scylladb/commit e2c9cdb576, which itself was a flawed attempt at fixing an error seen because purged tombstones were not terminated by the compactor.

This patch corrects this mistake by fixing the above problem properly: on page-cut, if the validator has an active tombstone, a closing tombstone is generated for it, to avoid the false-positive error. With this, range tombstones can be validated again as they come in.

The existing unit test checking the validation in the compactor is greatly expanded to check all (I hope) different validation scenarios.

Closes #13817

* github.com:scylladb/scylladb:
  test/mutation_test: test_compactor_validator_sanity_test
  mutation/mutation_compactor: fix indentation
  mutation/mutation_compactor: validate the input stream
  mutation: mutation_fragment_stream_validating_filter: add accessor to underlying validator
  readers: reader-from-fragment: don't modify stream when created without range
2023-07-21 00:26:46 +03:00
Avi Kivity
e00811caac cql3: grammar: reject intValue with no contents
The grammar mistakenly allows nothing to be parsed as an
intValue (itself accepted in LIMIT and similar clauses).

Easily fixed by removing the empty alternative. A unit test is
added.

Fixes #14705.

Closes #14707
2023-07-21 00:24:51 +03:00
Pavel Emelyanov
98609e2115 Merge 's3/test: close using deferred_close() or deferred()' from Kefu Chai
let's use RAII to tear down the client and the input file, so we can
always perform the cleanups even if the test throws.

Closes #14765

* github.com:scylladb/scylladb:
  s3/test: use seastar::deferred() to perform cleanup
  s3/test: close using deferred_close()
2023-07-20 20:05:34 +03:00
Botond Dénes
bf6186ed7e Update tools/java submodule
* tools/java 9f63a96f...585b30fd (1):
  > cassandra-stress: add support for using RackAwareRoundRobinPolicy
2023-07-20 18:13:32 +03:00
Botond Dénes
819b45d107 Merge 'Remove dead replacing_nodes_pending_ranges_updater manipulations' from Pavel Emelyanov
The set in question is read-and-delete-only and thus always empty. Originally it was removed by commit c9993f020d (storage_service: get rid of handle_state_replacing), but some dangling ends were left. Consequentially, the on_alive() callback can get rid of few dead if-else branches

Closes #14762

* github.com:scylladb/scylladb:
  storage_service: Relax on_alive()
  storage_service: Remove _replacing_nodes_pending_ranges_updater
2023-07-20 16:55:44 +03:00
Pavel Emelyanov
9df750fd4c storage_service: Remove dead get_rpc_address()
Unused. Locator calls gossiper directly

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #14761
2023-07-20 16:54:24 +03:00
Botond Dénes
53da97416a Merge 'Remove qctx from system.paxos table access methods' from Pavel Emelyanov
The "fix" is straightforward -- callers of system_keyspace::*paxos* methods need to get system keyspace from somewhere. This time the only caller is storage_proxy::remote that can have system keyspace via direct dependency reference.

Closes #14758

* github.com:scylladb/scylladb:
  db/system_keyspace: Move and use qctx::execute_cql_with_timeout()
  db/system_keyspace: Make paxos methods non-static
  service/paxos: Add db::system_keyspace& argument to some methods
  test: Optionally initialize proxy remote for cql_test_env
  proxy/remote: Keep sharded<db::system_keyspace>& dependency
2023-07-20 16:53:25 +03:00
Botond Dénes
e62325babc Merge 'Compaction reshard task' from Aleksandra Martyniuk
Task manager tasks covering reshard compaction.

Reattempt on https://github.com/scylladb/scylladb/pull/14044. Bugfix for https://github.com/scylladb/scylladb/issues/14618 is squashed with 95191f4.
Regression test added.

Closes #14739

* github.com:scylladb/scylladb:
  test: add test for resharding with non-empty owned_ranges_ptr
  test: extend test_compaction_task.py to test resharding compaction
  compaction: add shard_reshard_sstables_compaction_task_impl
  compaction: invoke resharding on sharded database
  compaction: move run_resharding_jobs into reshard_sstables_compaction_task_impl::run()
  compaction: add reshard_sstables_compaction_task_impl
  compaction: create resharding_compaction_task_impl
2023-07-20 16:43:22 +03:00
Botond Dénes
a35f4f6985 test/mutation_test: test_compactor_validator_sanity_test
Greatly expand this test to check that the compactor validates the input
stream properly.
The test is renamed (the _sanity_test suffix is removed) to reflect the
expanded scope.
2023-07-20 08:48:50 -04:00
Botond Dénes
18ed94e60b mutation/mutation_compactor: fix indentation
Left broken by the previous patch.
2023-07-20 08:48:50 -04:00
Botond Dénes
3d5b70e0d7 mutation/mutation_compactor: validate the input stream
The mutation compactor has a validator which it uses to validate the
stream of mutation fragments that passes through it. This validator is
supposed to validate the stream as it enters the compactor, as opposed
to its compacted form (output). This was true for most fragment kinds
except range tombstones, as purged range tombstones were not visible to
the validator for the most part.
This mistake was introduced by e2c9cdb576, which itself was a flawed
attempt at fixing an error seen because purged tombstones were not
terminated by the compactor.
This patch corrects this mistake by fixing the above problem properly:
on page-cut, if the validator has an active tombstone, a closing
tombstone is generated for it, to avoid the false-positive error. With
this, range tombstones can be validated again as they come in.
2023-07-20 08:48:50 -04:00
Botond Dénes
dbb2a6f03a mutation: mutation_fragment_stream_validating_filter: add accessor to underlying validator 2023-07-20 08:48:50 -04:00
Botond Dénes
93dd16fccc readers: reader-from-fragment: don't modify stream when created without range
The fragment reader currently unconditionally forwards its buffer to the
passed-in partition range. Even if this range is
`query::full_partition_range`, this will involve dropping any fragments
up to the first partitions tart. This causes problems for test users who
intentionally create invalid fragment streams, that don't start with a
partition-start.
Refactor the reader to not do any modifications on the stream, when
neither slice, nor partition-range was passed by the user.
2023-07-20 08:48:50 -04:00
Kefu Chai
fdf61d2f7c compaction_manager: prevent gc-only sstables from being compacted
before this change, there are chances that the temporary sstables
created for collecting the GC-able data create by a certain
compaction can be picked up by another compaction job. this
wastes the CPU cycles, adds write amplification, and causes
inefficiency.

in general, these GC-only SSTables are created with the same run id
as those non-GC SSTables, but when a new sstable exhausts input
sstable(s), we proactively replace the old main set with a new one
so that we can free up the space as soon as possible. so the
GC-only SSTables are added to the new main set along with
the non-GC SSTables, but since the former have good chance to
overlap the latter. these GC-only SSTables are assigned with
different run ids. but we fail to register them to the
`compaction_manager` when replacing the main sstable set.
that's why future compactions pick them up when performing compaction,
when the compaction which created them is not yet completed.

so, in this change,

* to prevent sstables in the transient stage from being picked
  up by regular compactions, a new interface class is introduced
  so that the sstable is always added to registration before
  it is added to sstable set, and removed from registration after
  it is removed from sstable set. the struct helps to consolidate
  the regitration related logic in a single place, and helps to
  make it more obvious that the timespan of an sstable in
  the registration should cover that in the sstable set.
* use a different run_id for the gc sstable run, as it can
  overlap with the output sstable run. the run_id for the
  gc sstable run is created only when the gc sstable writer
  is created. because the gc sstables is not always created
  for all compactions.

please note, all (indirect) callers of
`compaction_task_executor::compact_sstables()` passes a non-empty
`std::function` to this function, so there is no need to check for
empty before calling it. so in this change, the check is dropped.

Fixes #14560
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #14725
2023-07-20 15:47:48 +03:00
Asias He
865891cf02 doc: Repair system_auth with nodetool repair -pr option
Since repair is performed on all nodes, each node can just repair the
primary ranges instead of all owned ranges. This avoids repair ranges
more than once.

Closes #14766
2023-07-20 15:12:20 +03:00