One of the following cases is true:
1. RAFT local feature is disabled. Then we don't do anything related to
group 0.
2. RAFT local feature is enabled and when we bootstrapped, we joined
group 0. Then `raft_group0::_group0` variant holds the
`raft::group_id` alternative.
3. RAFT local feature is enabled and when we bootstrapped we didn't join
group 0. This means the RAFT local feature was disabled when we
bootstrapped and we're in the (unimplemented yet) upgrade scenario.
`raft_group0::_group0` variant holds the `std::monostate` alternative.
The problem with the previous implementation was that it checked for the
conditions of the third case above - that RAFT local feature is enabled
but `_group0` does not hold `raft::group_id` - and if those conditions
were true, it executed some logic that didn't really make sense: it ran
the discovery algorithm and called `send_group0_modify_config` RPC.
In this rewrite I state some assumptions that `leave_group0` makes:
- we've finished the startup procedure.
- we're being run during decommission - after the node entered LEFT
status.
In the new implementation, if `_group0` does not hold `raft::group_id`
(checked by the internal `joined_group0()` helper), we simply return.
This is the yet-unimplemented upgrade case left for a follow-up PR.
Otherwise we fetch our Raft server ID (at this point it must be present
- otherwise it's a fatal error) and simply call `modify_config` from the
`raft::server` API.
Remove unnecessary call to `_shutdown_gate.hold()` (this is not a
background task).
`leave_group0` was responsible for both removing a different node from
group 0 and removing ourselves (leaving) group 0. The two scenarios are
a bit different and the handling will be rewritten in following commits.
Split `leave_group0` into two functions. Remove the incorrect comment
about idempotency - saying that the procedure is idempotent is an
oversimplification, one could argue it's incorrect since the second call
simply hangs, at least in the case of leaving group 0; following commits
will state what's happening more precisely.
Add some additional logging and assertions where the two functions are
called in `storage_service`.
Contains all logic for deciding to join (or not join) group 0.
Prepare for the case where we don't want to join group 0 immediately on
startup - the upgrade scenario (will be implemented in a follow-up).
Move the group 0 setup step earlier in `storage_service::join_cluster`.
`join_group0()` is now a private member of `raft_group0`. Some more
comments were written.
Compared to `load_or_create_my_addr` this function assumes that
the address is already present on disk; if not, it's a fatal error.
Use it in places where it would indeed be a fatal error
if the address was missing.
There are some calls to `modify_config` which should react to aborts
(e.g. when we shutdown Scylla).
There are also calls to `send_group0_modify_config` which should
probably also react to aborts, but the functions don't take
an abort_source parameter. This is fixable but I left TODOs for now.
The function no longer accesses the `_group0` variant directly, instead
it is made a member of `service::persistent_discovery`; the caller
guarantees that `persistent_discovery` is not destroyed before the
function finishes.
The function is now named `run`. A short comment was written at the
declaration site.
Make some members of `persistent_discovery` private, as they are only
used by `run`.
Simplify `struct tracker`, store the discovery output separately
(`struct tracker` is now responsible for a single thing).
Enclose the `parallel_for_each` over requests in a common coroutine
which keeps alive all the necessary things for the loop body and
performs the last step which was previously inside a `then`.
The set of seeds passed to the discovery algorithm may contain `self`.
The implementation will filter the `self` out (it calls `step(seeds)`;
`step` iterates over the given list of peers and ignores `_self`).
Specify this at the `discovery` constructor declaration site.
Simplify the code constructing `persistent_discovery` in
`raft_group0::discover_group0` using this assumption.
query_result was the wrong place to put last position into. It is only
included in data-responses, but not on digest-responses. If we want to
support empty pages from replicas, both data and digest responses have
to include the last position. So hoist up the last position to the
parent structure: query::result. This is a breaking change inter-node
ABI wise, but it is fine: the current code wasn't released yet.
Closes#11072
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.
Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.
Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. Replace the constructor with helper
functions which specify in comments that they are supposed to be used in
tests or in contexts where `info` doesn't matter (e.g. when checking
presence in an `unordered_set`, where the equality operator and hash
operate only on the `id`).
Closes#11047
* github.com:scylladb/scylla:
raft: fsm: fix `entry_size` calculation for config entries
raft: split `can_vote` field from `server_address` to separate struct
serializer_impl: generalize (de)serialization of `unordered_set`
to_string: generalize `operator<<` for `unordered_set`
The node operations using node_ops_cmd have the following procedure:
1) Send node_ops_cmd::replace_prepare to all nodes
2) Send node_ops_cmd::replace_heartbeat to all nodes
In a large cluster 1) might take a long time to finish, as a result when
the node starts to perform 2), the heartbeat timer on the peer nodes which
is 30s might have already timed out. This fails the whole node
opeartions.
We have patches to make 1) more efficient and faster.
https://github.com/scylladb/scylla/pull/10850https://github.com/scylladb/scylla/pull/10822
In addition to that, this patch increases the heartbeat timeout to reduce
the false positive of timeout.
Refs #10337
Refs #11078Closes#11081
Currently, all the mutations this test generates are applied on shard 0.
In rare cases, this may lead to the following crash, when the flushed
sstable doesn't contain any key that belongs to the current shard,
as seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1390/artifact/testlog/x86_64/dev/database_test.test_truncate_without_snapshot_during_writes.114.log
```
WARN 2022-07-17 17:41:36,630 [shard 0] sstable - create_sharding_metadata: range=[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}] has no intersection with shard=0 first_key={key: pk{00046b657930}, token:-468459073612751032} last_key={key: pk{00046b657930}, token:-468459073612751032} ranges_single_shard=[] ranges_all_shards={{1, {[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}]}}}
ERROR 2022-07-17 17:41:36,630 [shard 0] table - failed to write sstable /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db)
ERROR 2022-07-17 17:41:36,631 [shard 0] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db). Aborting, at 0x329e28e 0x329e780 0x329ea88 0xf5bc69 0xf956b1 0x3196dc4 0x3198037 0x319742a 0x32be2e4 0x32bd8e1 0x32ba01c 0x317f97d /lib64/libpthread.so.0+0x92a4 /lib64/libc.so.6+0x100322
```
Instead, generate random keys and apply them on their
owning shard, and truncate all database shards.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closes#11066
* github.com:scylladb/scylla:
database_test: test_truncate_without_snapshot_during_writes: apply mutation on the correct shard
table: try_flush_memtable_to_sstable: consume: close reader on error
This PR extends #9209. It consists of 2 main points:
To enable parallelization of user-defined aggregates, reduction function was added to UDA definition. Reduction function is optional and it has to be scalar function that takes 2 arguments with type of UDA's state and returns UDA's state
All currently implemented native aggregates got their reducible counterpart, which return their state as final result, so it can be reduced with other result. Hence all native aggregates can now be distributed.
Local 3-node cluster made with current master. `node1` updated to this branch. Accessing node with `ccm <node-name> cqlsh`
I've tested belowed things from both old and new node:
- creating UDA with reduce function - not allowed
- selecting count(*) - distributed
- selecting other aggregate function - not distributed
Fixes: #10224Closes#10295
* github.com:scylladb/scylla:
test: add tests for parallelized aggregates
test: cql3: Add UDA REDUCEFUNC test
forward_service: enable multiple selection
forward_service: support UDA and native aggregate parallelization
cql3:functions: Add cql3::functions::functions::mock_get()
cql3: selection: detect parallelize reduction type
db,cql3: Move part of cql3's function into db
selection: detect if selectors factory contains only simple selectors
cql3: reducible aggregates
DB: Add `scylla_aggregates` system table
db,gms: Add SCYLLA_AGGREGATES schema features
CQL3: Add reduce function to UDA
gms: add UDA_NATIVE_PARALLELIZED_AGGREGATION feature
"
Same thing was done for compaction class some time ago, now
it's time for streaming to keep repair-generated IO in bounds.
This set mostly resembles the one for compaction IO class with
the exception that boot-time reshard/reshape currently runs in
streaming class, but that's nod great if the class is throttled,
so the set also moves boot-time IO into default IO class.
"
* 'br-streaming-class-throttling-2' of https://github.com/xemul/scylla:
distributed_loader: Populate keyspaces in default class
streaming: Maintain class bandwidth
streaming: Pass db::config& to manager constructor
config: Add stream_io_throughput_mb_per_sec option
sstables: Keep priority class on sstable_directory
Currently, all the mutations this test generates are applied on shard 0.
In rare cases, this may lead to the following crash, when the flushed
sstable doesn't contain any key that belongs to the current shard,
as seen in https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1390/artifact/testlog/x86_64/dev/database_test.test_truncate_without_snapshot_during_writes.114.log
```
WARN 2022-07-17 17:41:36,630 [shard 0] sstable - create_sharding_metadata: range=[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}] has no intersection with shard=0 first_key={key: pk{00046b657930}, token:-468459073612751032} last_key={key: pk{00046b657930}, token:-468459073612751032} ranges_single_shard=[] ranges_all_shards={{1, {[{-468459073612751032, pk{00046b657930}}, {-468459073612751032, pk{00046b657930}}]}}}
ERROR 2022-07-17 17:41:36,630 [shard 0] table - failed to write sstable /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db)
ERROR 2022-07-17 17:41:36,631 [shard 0] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /jenkins/workspace/releng/Scylla-CI/scylla/testlog/x86_64/dev/scylla-e2b694c7-db4f-4f9d-9940-9c6c21850888/ks/cf-8f74aba005de11ed92fa8661a0ed7890/me-2-big-Data.db). Aborting, at 0x329e28e 0x329e780 0x329ea88 0xf5bc69 0xf956b1 0x3196dc4 0x3198037 0x319742a 0x32be2e4 0x32bd8e1 0x32ba01c 0x317f97d /lib64/libpthread.so.0+0x92a4 /lib64/libc.so.6+0x100322
```
Instead, generate random keys and apply them on their
owning shard, and truncate all database shards.
Fixes#11076
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
If an exception is throws in `consume` before
write_memtable_to_sstable is called or if the latter fails,
we must close the reader passed to it.
Fixes#11075
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch makes memtable_flush_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This patch makes compaction_static_shares liveupdateable
to avoid having to restart the cluster after updating
this config.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
This patch adds the _static_shares variable to the backlog_controller so that
instead of having to use a separate constructor when controller is disabled,
we can use a single constructor and periodically check on the adjust method
if we should use the static shares or the controller. This will be useful on
the next patches to make compaction_static_shares and memtable_flush_static_shares
live updateable.
Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>
Currently, we use following naming convention for relocatable package
filename:
${package_name}-${arch}-package-${version}.${release}.tar.gz
But this is very different with Linux standard packaging system such as
.rpm and .deb.
Let's align the convention to .rpm style, so new convention should be:
${package_name}-${version}-${release}.${arch}.tar.gz
Closes#9799Closes#10891
* tools/java de8289690e...d0143b447c (1):
> build_reloc.sh: rename relocatable packages
* tools/jmx fe351e8...06f2735 (1):
> build_reloc.sh: rename relocatable packages
* tools/python3 e48dcc2...bf6e892 (1):
> reloc/build_reloc.sh: rename relocatable packages
The streaming class throughput can be limitd with the respective option.
Doing boot-time reshard/reshape doesn't need to obey it, as the node is
not yet up but instead should get there as soon as possible.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The stream_manager will bookkeep the streaming bandwidth option, to
subscribe on its changes it needs the config reference. It would be
better if it was stream_manager::config, but currently subscription on
db::config::<stuff> updates is not very shard-friendly, so we need to
carry the config reference itself around.
Similar trouble is there for compaction_manager. The option is passed
through its own config, but the config is created on each shard by
database code. Stream manager config would be created once by main code
on shard 0.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's going to control the bandwidth for the streaming prio class.
For now it's jsut added but does't work for real
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Current code accepts priotity class as an argument to various functions
that need it and all its callers use streaming class. Next patches will
needs to sometimes use default class, but it will require heavy patching
of the distributed loader. Things get simpler if the priority class is
kept on sstable_directory on start.
This change also simplifies the ongoing effort on unification of sched
and IO classes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When collecting a histogram of smp-queues population empty queues also
count, but it makes the output very long and not very informative.
Skipping empty queues increases signal / noise ratio.
v2:
- print the number of omitted empty queues
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20220718180912.2931-1-xemul@scylladb.com>
We want to print the backwards fiber in reverse, starting with the
furthest-away task in the chain. For this, the task list returned by
`_walk()` has to be reversed.
Closes#11062
We forgot about `can_vote`.
Stumbled on this while separating `can_vote` to separate struct.
Note that `entry_size` is still inaccurate (#11068) but the patch is an
improvement.
Refs: #11068
Whether a server can vote in a Raft configuration is not part of the
address. `server_address` was used in many context where `can_vote` is
irrelevant.
Split the struct: `server_address` now contains only `id` and
`server_info` as it did before `can_vote` was introduced. Instead we
have a `config_member` struct that contains a `server_address` and the
`can_vote` field.
Also remove an "unsafe" constructor from `server_address` where `id` was
provided but `server_info` was not. The constructor was used for tests
where `server_info` is irrelevant, but it's important not to forget
about the info in production code. The constructor was used for two
purposes:
- Invoking set operations such as `contains`. To solve this we use C++20
transparent hash and comparator functions, which allow invoking
`contains` and similar functions by providing a different key type (in
this case `raft::server_id` in set of addresses, for example).
- constructing addresses without `info`s in tests. For this we provide
helper functions in the test helpers module and use them.
Enables parallelization of query like `SELECT MIN(x), MAX(x)`.
Compatibility is ensured under the same cluster feature as
UDA and native aggregates parallelization. (UDA_NATIVE_PARALLELIZED_AGGREGATION)
Enables parallelization of UDA and native aggregates. The way the
query is parallelized is the same as in #9209. Separate reduction
type for `COUNT(*)` is left for compatibility reason.
`mock_get` was created only for forward_service use, thus it only checks for
aggregate functions if no declared function was found.
The reason for this function is, there is no serialization of `cql3::selection::selection`,
so functions lying underneath these selections has to be refound.
Most of this code is copied from `functions::get()`, however `functions::get()` is not used because it requires to
mock or serialize expressions and `functions::find()` is not enough,
because it does not search for dynamic aggregate functions
Moving `function`, `function_name` and `aggregate_function` into
db namespace to avoid including cql3 namespace into query-request.
For now, only minimal subset of cql3 function was moved to db.
Because `selection` is not serializable and it has to be send via network
to parallelize query, we have to mock the selection. To simplify
the mocking, for now only single selectors for aggregate's arguments
are allowed (no casting or other functions as arguments).