Commit Graph

2245 Commits

Author SHA1 Message Date
Pavel Solodovnikov
c0854a0f62 raft: create system tables only when raft experimental feature is set
Also introduce a tiny function to return raft-enabled db config
for cql testing.

Tests: unit(dev)

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210826091432.279532-1-pa.solodovnikov@scylladb.com>
2021-08-26 12:21:12 +03:00
Benny Halevy
4476800493 flat_mutation_reader: get rid of timeout parameter
Now that the timeout is taken from the reader_permit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
fe479aca1d reader_permit: add timeout member
To replace the timeout parameter passed
to flat_mutation_reader methods.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Pavel Solodovnikov
22794efc22 db: add experimental option for raft
Introduce `raft` experimental option.
Adjust the tests accordingly to accomodate the new option.

It's not enabled by default when providing
`--experimental=true` config option and should be
requested explicitly via `--experimental-options=raft`
config option.

Hide the code related to `raft_group_registry` behind
the switch. The service object is still constructed
but no initialization is performed (`init()` is not
called) if the flag is not set.

Later, other raft-related things, such as raft schema
changes, will also use this flag.

Also, don't introduce a corresponding gossiper feature
just yet, because again, it should be done after the
raft schema changes API contract is stabilized.

This will be done in a separate series, probably related to
implementing the feature itself.

Tests: unit(dev)

Ref #9239.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20210823121956.167682-1-pa.solodovnikov@scylladb.com>
2021-08-23 17:45:58 +03:00
Benny Halevy
e9aff2426e everywhere: make deferred actions noexcept
Prepare for updating seastar submodule to a change
that requires deferred actions to be noexcept
(and return void).

Test: unit(dev, debug)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-22 21:11:52 +03:00
Benny Halevy
ef8ec54970 commitlog: segment, segment_manager: mark methods noexcept
Prepare for marking deferred_actions nexcept.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-22 21:11:40 +03:00
Benny Halevy
4439e5c132 everywhere: cleanup defer.hh includes
Get rid of unused includes of seastar/util/{defer,closeable}.hh
and add a few that are missing from source files.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-22 21:11:39 +03:00
Calle Wilund
3633c077be commitlog/config: Make hard size enforcement false by default + add config opt
Refs #9053

Flips default for commitlog disk footprint hard limit enforcement to off due
to observed latency stalls with stress runs. Instead adds an optional flag
"commitlog_use_hard_size_limit" which can be turned on to in fact do enforce it.

Sort of tape and string fix until we can properly tweak the balance between
cl & sstable flush rate.

Closes #9195
2021-08-15 15:10:27 +03:00
Asias He
97bb2e47ff storage_service: Enable Repair Based Note Operations (RBNO) by default for replace
We decided to enable repair based node operations by default for replace
node operations.

To do that, a new option --allowed-repair-based-node-ops is added. It
lists the node operations that are allowed to enable repair based node
operations.

The operations can be bootstrap, replace, removenode, decommission and rebuild.

By default, --allowed-repair-based-node-ops is set to contain "replace".

Note, the existing option --enable-repair-based-node-ops is still in
play. It is the global switch to enable or disable the feature.

Examples:

- To enable bootstrap and replace node ops:

```
scylla --enable-repair-based-node-ops true --allowed-repair-based-node-ops replace,bootstrap
```

- To disable any repair based node ops:

```
scylla --enable-repair-based-node-ops false
```

Closes #9197
2021-08-15 13:30:46 +03:00
Piotr Sarna
84876a165b db,schema_tables: add handling user-defined aggregates
Aggregates are propagated, created and dropped very similarly
to user-defined functions - a set of helper functions
for aggregates are added based on the UDF implementation.
2021-08-13 11:14:11 +02:00
Piotr Sarna
58196e8ea6 db,view: avoid ignoring failed future in background view updates
The code for handling background view updates used to propagate
exceptions unconditionally, which leads to "exceptional future
ignored" warnings if the update was put to background.
From now on, the exception is only propagated if its future
is actually waited on.

Fixes #6187

Tested manually, the warning was not observed after the patch

Closes #9179
2021-08-12 17:32:35 +03:00
Nadav Har'El
49ca1f86b2 Merge 'hints: error injection for pausing hint replay' from Piotr Dulikowski
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.

This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.

The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).

Refs: #6649

Closes #8801

* github.com:scylladb/scylla:
  hints: fix indentation after previous patch
  hints: error injection for pausing hint replay
  hints: coroutinize lambda inside send_one_file
2021-08-11 11:42:29 +03:00
Piotr Dulikowski
f2e1339f38 hints: use an abort_source with sleep_abortable in flush+send loop
Each hint sender runs an asynchronous loop with tries to flush and then
send hints. Between each attempt, it sleeps at most 10 seconds using
sleep_abortable. However, an overload of sleep_abortable is used which
does not take an abort_source - it should abort the sleep in case
Seastar handles a SIGINT or SIGTERM signal. However, in order for that
to work, the application must not prevent default handling of those
signals in Seastar - but Scylla explicitly does it by disabling the
`auto_handle_sigint_sigterm` option in reactor config. As a result,
those sleeps are never aborted, and - because we wait for the async
loops to stop - they can delay shutdown by at most 10 seconds.

To fix that, an abort_source is added to the hints sender, and the
abort_source is triggered when the corresponding sender is requested to
stop.

Fixes: #9176

Closes #9177
2021-08-11 10:32:53 +02:00
Piotr Dulikowski
68cac2eab7 hints: fix indentation after previous patch 2021-08-09 16:16:14 +02:00
Piotr Dulikowski
20cbe7fa2f hints: error injection for pausing hint replay
Adds a `hinted_handoff_pause_hint_replay` error injection point. When
enabled, hint replay logic behaves as if it is run, but it gets stuck in
a loop and no hints are actually sent until the point is disabled again.

This injection point will be useful in dtests - it will simulate
infinitely slow hint replay and will make it possible to test how some
operations behave while hint replay logic is running.

The first intended use case of this injection point is testing the HTTP
API for waiting for hints (#8728).

Refs: #6649
2021-08-09 16:16:14 +02:00
Piotr Dulikowski
29993f7745 hints: coroutinize lambda inside send_one_file
Converts the lambda invoked for every commitlog entry in a hints file
into a coroutine.
2021-08-09 16:16:14 +02:00
Piotr Dulikowski
d41d39bbcd hints: add functions for creating and waiting for sync points
Adds functions which allow to create per-shard sync points and wait for
them.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
e18b29765a hints: add hint sync point structure
Adds a sync_point structure. A sync point is a (possibly incomplete)
mapping from hint queues to a replay position in it. Users will be able
to create sync points consisting of the last written positions of some
hint queues, so then they can wait until hint replay in all of the
queues reach that point.

The sync point supports serialization - first it is serialized with the
help of IDL to a binary form, and then converted to a hexadecimal
string. Deserialization is also possible.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
70df9973f3 hints: make it possible to wait until hints are replayed
Adds necessary infrastructure which allows, for a given endpoint
manager, to wait until hints are replayed up to a specified position. An
abort source must be specified which, if triggered, cancels waiting for
hint replay.

If the endpoint manager is stopped, current waiters are dismissed with
an exception.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
93f244426d hints: track the RP of the last replayed position
Keeps track of a position which serves as an upper bound for positions
of already replayed hints - i.e. all hints with replay positions
strictly lower than it are considered replayed.

In order to accurately track this bound during hint replay, a std::map
is introduced which contains positions of hints which are currently
being sent.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
03e2e671cd hints: track the RP of the last written hint
The position of the last written hint is now tracked by the endpoint
hints manager.

When manager is constructed and no hints are replayed yet, the last
written hint position is initialized to the beginning of a fake segment
with ID corresponding to the current number of milliseconds since the
epoch. This choice makes sure that, in case a new hint sync point is
created before any hints are written, the position recorded for that
hint queue will be larger than all replay positions in segments
currently stored on disk.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
27d0d598fd hints: change last_attempted_rp to last_succeeded_rp
Instead of tracking the last position for which hint sending is
attempted, the last successfully replayed position is tracked.

The previous variable was used to calculate the position from which hint
replay should restart in case of an error, in the following way:

    _last_not_complete_rp = ctx_ptr->first_failed_rp.value_or(
        ctx_ptr->last_attempted_rp.value_or(_last_not_complete_rp));

Now, this formula uses the last_succeeded_rp in place of
last_attempted_rp. This change does not have an effect on the choice of
the starting position of the next retry:

- If the hint at `last_attempted_rp` has succeeded, in the new algorithm
  the same position will be recorded in `last_succeeded_rp`, and the
  formula will yield the same result.
- If the hint at `last_attempted_rp` has failed, it will be accounted
  into `first_failed_rp`, so the formula will yield the same result.

The motivation for this change is that in the next commits of this PR we
will start tracking the position of the last replayed hint per hint
queue, and the meaning of the new variable makes it more useful - when
there are no failed hints in the hint sending attempt, last_succeeded_rp
gives us information that hints _up to this position_ were replayed; the
last_attempted_rp variable can only tell us that hints _before that
position_ were replayed successfully.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
08a7d79ffc hints: rearrange error handling logic for hint sending
Instead of calling the `on_hint_send_failure` method inside the hint
sending task in places where an error occurs, we now let the exceptions
be returned and handle them inside a single `then_wrapped` attached to
the hint sending task.

Apart from the `then_wrapped`, there is one more place which calls
`on_hint_send_failure` - in the exception handler for the future which
spawns the asynchronous hint sending task. It needs to be kept separate
because it is a part of a separate task.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
45b04c94e0 hints: sort segments by ID, divide into foreign and local
Endpoint hints manager keeps a commitlog instance which is used to write
hints into new segments. This instance is re-created every 10 seconds,
which causes the previous instance to leave its segments on disk.

On the other hand, hints sender keeps a list of segments to replay which
is updated only after it becomes empty. The list is repopulated with
segments returned by the commitlog::get_segments_to_replay() method
which does not specify the order of the segments returned.

As a preparation for the upcoming hint sync points feature, this commit
changes the order in which segments are replayed:

- First, segments written by other shards are replayed. Such segments
  may appear in the queue because of segment rebalancing which is done
  at startup.
  The purpose of replaying "foreign" segments first is that they are
  problematic for hint sync points. For each hint queue, a hint sync
  point encodes a replay position of the last written hint on the local
  shard. Accounting foreign segments precisely would make the
  implementation more complicated. To make things simpler, waiting for
  sync points will always make sure that all foreign segments are
  replayed. This might sometimes cause more hints to be waited on than
  necessary if a restart occurs in the meantime.
- Segments written by the local shard are replayed later, in order of
  their IDs. This makes sure that local hints are replayed in the order
  they were written to segments, and will make it possible to use replay
  positions to track progress of hint replay.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
f83699bb7c Revert "db/hints: allow to forcefully update segment list on flush"
This reverts commit e48739a6da.

This commit removes the functionality from endpoint hints manager which
allowed to flush hints immediately and forcefully update the list of
segments to replay.

The new implementation of waiting for hints will be based on replay
positions returned by the commitlog API and it won't be necessary to
forcefully update the segment list when creating a sync point.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
9c1d4e7e6c Revert "db/hints: add a metric for counting processed files"
This reverts commit 5a49fe74bb.

This commit removes a metric which tracks how many segments were
replayed during current runtime. It was necessary for current "wait for
hints" mechanism which is being replaced with a different one -
therefore we can remove the metric.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
3b851a5ebd Revert "db/hints: make it possible to wait until current hints are sent"
This reverts commit 427bbf6d86.

This commit removes the infrastructure which allows to wait until
current hints are replayed in a given hint queue.

It will be replaced with a different mechanism in later commits.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
4a35d138f6 Revert "storage_proxy: add functions for syncing with hints queue"
This reverts commit 244738b0d5.

This commit removes create_hint_queue_sync_point and
check_hint_queue_sync_point functions from storage_proxy, which were
used to wait until local hints are sent out to particular nodes.

Similar methods will be reintroduced later in this PR, with a completely
different implementation.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
0d74dee683 Revert "messaging_service: add verbs for hint sync points"
This reverts commit 82c419870a.

This commit removes the HINT_SYNC_POINT_CREATE and HINT_SYNC_POINT_CHECK
rpc verbs.

The upcoming HTTP API for waiting for hint replay will be restricted
to waiting for hints on the node handling the request, so there is no
need for new verbs.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
ff453d80ff Revert "config: add wait_for_hint_replay_before_repair option"
This reverts commit 86d831b319.

This commit removes the wait_for_hints_before_repair option. Because a
previous commit in this series removes the logic from repair which
caused it to wait for hints to be replayed, this option is now useless.

We can safely remove this option because it is not present in any
release yet.
2021-08-09 09:24:36 +02:00
Piotr Dulikowski
e3c32c897a Revert "hints: dismiss segment waiters when hint queue can't send"
This reverts commit 9d68824327.

First, we are reverting existing infrastructure for waiting for hints in
order to replace it with a different one, therefore this commit needs to
be reverted as well.

Second, errors during hint replay can occur naturally and don't
necessarily indicate that no progress can be made - for example, the
target node is heavily loaded and some hints time out. The "waiting for
hints" operation becomes a user-issued command, so it's not as vital to
ensure liveness.
2021-08-09 09:06:23 +02:00
Avi Kivity
3b5e312800 db: schema_tables: clean up read_schema_partition_for_keyspace() coroutine captures
read_schema_partition_for_keyspace() copies some parameters to capture them
in a coroutine, but the same can be achieved more cleanly by changing the
reference parameters to value parameters, so do that.

Test: unit (dev)

Closes #9154
2021-08-08 12:55:10 +03:00
Asias He
6350a19f73 compaction: Move compaction_strategy.hh to compaction dir
The top dir is a mess. Move compaction_strategy.hh and
compaction_strategy_type.hh to the new home.
2021-08-07 08:06:37 +08:00
Avi Kivity
885ca2158e db: schema_tables: reindent
Following conversion to corotuines in fc91e90c59, remove extra
indents and braces left to make the change clearer.

One variable had to be renamed since without the braces it
duplicated another variable in the same block.

Test: unit (dev)

Closes #9125
2021-08-02 22:36:57 +02:00
Nadav Har'El
fc91e90c59 Merge 'db: schema_tables: coroutinize' from Avi Kivity
schema_tables is quite hairy, but can be easily simplified with coroutines.

In addition to switching future-returning functions to coroutines, we also
switch Seastar threads to coroutines. This is less of a clear-cut win; the
motivation is to reduce the chances of someone calling a function that
expects to run in a thread from a non-thread context. This sometimes works
by accident, but when it doesn't, it's pretty bad. So a uniform calling convention
has some benefit.

I left the extra indents in, since the indent-fixing patch is hard to rebase in case
a rebase is needed. I will follow up with an indent fix post merge.

Test: unit (dev, debug, release)

Closes #9118

* github.com:scylladb/scylla:
  db: schema_tables: drop now redundant #includes
  db: schema_tables: coroutinize drop_column_mapping()
  db: schema_tables: coroutinize column_mapping_exists()
  db: schema_tables: coroutinize get_column_mapping()
  db: schema_tables: coroutinize read_table_mutations()
  db: schema_tables: coroutinize create_views_from_schema_partition()
  db: schema_tables: coroutinize create_views_from_table_row()
  db: schema_tables: unpeel lw_shared_ptr in create_Tables_from_tables_partition()
  db: schema_tables: coroutinize create_tables_from_tables_partition()
  db: schema_tables: coroutinize create_table_from_name()
  db: schema_tables: coroutinize read_table_mutations()
  db: schema_tables: coroutinize merge_keyspaces()
  db: schema_tables: coroutinize do_merge_schema()
  db: schema_tables: futurize and coroutinize merge_functions()
  db: schema_tables: futurize and coroutinize user_types_to_drop::drop
  db: schema_tables: futurize and coroutinize merge_types()
  db: schema_tables: futurize and coroutinize merge_tables_and_views()
  db: schema_tables: coroutinize store_column_mapping()
  db: schema_tables: futurize and coroutinize read_tables_for_keyspaces()
  db: schema_tables: coroutinize read_table_names_of_keyspace()
  db: schema_tables: coroutinize recalculate_schema_version()
  db: schema_tables: coroutinize merge_schema()
  db: schema_tables: introduce and use with_merge_lock()
  db: schema_tables: coroutinize update_schema_version_and_announce()
  db: schema_tables: coroutinize read_keyspace_mutation()
  db: schema_tables: coroutinize read_schema_partition_for_table()
  db: schema_tables: coroutinize read_schema_partition_for_keyspace()
  db: schema_tables: coroutinize query_partition_mutation()
  db: schema_tables: coroutinize read_schema_for_keyspaces()
  db: schema_tables: coroutinize convert_schema_to_mutations()
  db: schema_tables: coroutinize calculate_schema_digest()
  db: schema_tables: coroutinize save_system_schema()
2021-08-02 13:43:53 +03:00
Tomasz Grabiec
c3ada1a145 Merge "count row (sstables/row cache/memtables) and range (memtables) tombstone reads" from Michael
Fixes #7749.
2021-08-01 23:13:18 +02:00
Avi Kivity
ca59754e68 db: schema_tables: drop now redundant #includes 2021-08-01 20:13:15 +03:00
Avi Kivity
40fdbf9558 db: schema_tables: coroutinize drop_column_mapping() 2021-08-01 20:13:15 +03:00
Avi Kivity
7d46300af2 db: schema_tables: coroutinize column_mapping_exists() 2021-08-01 20:13:15 +03:00
Avi Kivity
74b2200f4d db: schema_tables: coroutinize get_column_mapping() 2021-08-01 20:13:15 +03:00
Avi Kivity
f19ca7aaaa db: schema_tables: coroutinize read_table_mutations() 2021-08-01 20:13:15 +03:00
Avi Kivity
81a2be17b6 db: schema_tables: coroutinize create_views_from_schema_partition() 2021-08-01 20:13:15 +03:00
Avi Kivity
15f2fd2a23 db: schema_tables: coroutinize create_views_from_table_row() 2021-08-01 20:13:15 +03:00
Avi Kivity
0843d441ff db: schema_tables: unpeel lw_shared_ptr in create_Tables_from_tables_partition()
The tables local is a lw_shared_ptr which is created and then refeferenced
before returning. It can be unpeeled to the pointed-to type, resulting in
one less allocation.
2021-08-01 20:13:15 +03:00
Avi Kivity
66054d24c4 db: schema_tables: coroutinize create_tables_from_tables_partition() 2021-08-01 20:13:15 +03:00
Avi Kivity
82ba3c5f4a db: schema_tables: coroutinize create_table_from_name() 2021-08-01 20:13:15 +03:00
Avi Kivity
862f491605 db: schema_tables: coroutinize read_table_mutations() 2021-08-01 20:13:15 +03:00
Avi Kivity
91c1a29808 db: schema_tables: coroutinize merge_keyspaces() 2021-08-01 20:13:15 +03:00
Avi Kivity
78fc05922b db: schema_tables: coroutinize do_merge_schema()
It is now using an internal thread, so unpeel is and replace
future::get() with co_await.
2021-08-01 20:13:15 +03:00
Avi Kivity
9680d9e76c db: schema_tables: futurize and coroutinize merge_functions()
Right now, merge_functions() expects to be called in a thread.
Remove that requirement by converting it into a coroutine and returning
a future.

De-threading helps reduce errors where something expects to be called
in a thread, but isn't.
2021-08-01 20:13:15 +03:00