The lifecycle of the reader lifecycle policy and all the resources the
reads use is now enclosed in that of the multishard reader thanks to its
close() method. We can now remove all the workarounds we had in place to
keep different resources as long as background reader cleanup finishes.
So that when this method returns the semaphore is safe to destroy. This
in turn will enable us to get rid of all the machinery we have in place
to deal with the semaphore having to out-live the lifecycle policy
without a clear time as to when it can be safe to destroy.
The test reader lifecycle policy has a mode in which it wants to ensure
all inactive readers are evicted, so tests can stress reader recreation
logic. For this it currently employs a trick of creating a waiter on the
semaphore. I don't even know how this even works (or if it even does)
but it sure complicates the lifecycle policy code a lot.
So switch to the much more reliable and simple method of creating the
semaphore with a single count and no memory. This ensures that all
inactive reads are immediately evicted, while still allows a single read
to be admitted at all times.
Currently shard_reader::close() (its caller) goes to the remote shard,
copies back all fragments left there to the local shard, then calls
`destroy_reader()`, which in the case of the multishard mutation query
copies it all back to the native shard. This was required before because
`shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on
`smp::submit_to()`. But close can, so we can get rid of all this
back-and-forth and just call `destroy_reader()` on the shard the reader
lives on, just like we do with `create_reader()`.
The shard reader is now able to wait on the stopped reader and pass the
already stopped reader to `destroy_reader()`, so we can de-futurize the
reader parameter of said method. The shard reader was already patched to
pass a ready future so adjusting the call-site is trivial.
The most prominent implementation, the multishard mutation query, can
now also drop its `_dismantling_gate` which was put in place so it can
wait on the background stopping if readers.
A consequence of this move is that handling errors that might happen
during the stopping of the reader is now handled in the shard reader,
not all lifecycle policy implementations.
We now have a future<> returning close() method so we don't need to
do the cleanup of the remote reader in the background, detaching it
from the shard-reader under destruction. We can now wait for the
cleanup properly before the shard reader is destroyed and just pass the
stopped reader to reader_lifecycle_policy::destroy_reader(). This patch
does the first part -- moving the cleanup to the foreground, the API
change of said method will come in the next patch.
Currently the foreign fields of the reader meta are destroyed in the
background via the foreign pointer's destructor (with one exception).
This makes the already complicated life-cycle of these parts and their
dependencies even harder to reason about, especially in tests, where
even things like semaphores live only within the test.
This patch makes sure to destroy all these remote fields in the
foreground in either `save_reader()` or `stop()`, ensuring that once
`stop()` returns, everything is cleaned up.
The reason we got away without closing _reader so far is that it is an
`std::unique_ptr<evictable_reader>` which is a
`flat_mutation_reader::impl` instance, without the
`flat_mutation_reader` wrapper, which contains the validations for
close.
Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.
However, to closely simulate a production environment, we want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.
We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).
To support these use cases we first generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.
We use this new functionality to tick Raft servers less often than the
network in basic_test.
This patchset effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.
The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.
The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.
* kbr/tick-network-often-v4:
raft: randomized_nemesis_test: generalize `ticker` to take a set of functions
raft: randomized_nemesis_test: split `environment::tick` into two functions
raft: randomized_nemesis_test: fix potential use-after-free in basic_test
... with associated calling periods and use the new API in `basic_test`.
Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.
However, to closely simulate a production environment, we may want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.
We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).
To support these use cases we generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.
We also modify `basic_test` to use this new approach: we tick Raft
servers once per 10 network ticks (in particular, once per 10 reactor
yields).
This commit effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.
The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.
The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.
With this change we must also wait a bit longer for the first node to
elect itself as a leader at the beginning of the test.
The test starts by waiting a certain number of ticks for the first node
to elect itself as a leader.
If this wait times out - i.e. the number of ticks passes before the node
manages to elect itself - the future associated with the task which checks
for the leader condition becomes discarded (it is passed to
`with_timeout`) and the task may keep using the `environment` (which it
has a reference to) even after the `environment` is destroyed.
Furthermore, the aforementioned task is a coroutine which uses lambda
captures in its body. Leaving `with_timeout` destroys the lambda object,
causing the coroutine to refer to no-longer-existing captures.
We fix the problems by:
- making `environment` `weakly_referencable` and checking if its alive
before it's used inside the task,
- not capturing anything in the lambda but passing whatever's needed as
function arguments (so these things get allocated inside the coroutine
frame).
Merged patch series by Pavel Emelyanov:
Alternator start and stop code is sitting inside the main()
and it's a big piece of code out there. Havig it all in main
complicates rework of start-stop sequences, it's much more
handy to have it in alternator/.
This set puts the mentioned code into transport- and thrift-
like controller model. While doing it one more call for global
storage service goes away.
* 'br-alternator-clientize' of https://github.com/xemul/scylla:
alternator: Move start-stop code into controller
alternator: Move the whole starting code into a sched group
alternator: Dont capture db, use cfg
alternator: Controller skeleton
alternator: Controller basement
alternator: Drop storage service from executor
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.
Fixes#8830
Tests: unit(release),
dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
Closes#8834
* github.com:scylladb/scylla:
view: fix use-after-move when handling view update failures
db,view: explicitly move the mutation to its helper function
db,view: pass base token by value to mutate_MV
This patch also fixes rare hangs in debug mode for drops_04 without
prevote.
Branch URL: https://github.com/alecco/scylla/tree/raft-fixes-05-v2-dueling
Tests: unit ({dev}), unit ({debug}), unit ({release})
Changes in v2:
- Fixed commit message @kostja
Whithout prevote, a node disconnected for long enough becomes candidate.
While disconnected (A) it keeps increasing its term.
When it rejoins it disrupts the current leader (C) which steps down due
to the higher term in (A)'s append_entries_reply and (C) also increases
its term.
Meanwhile followers (B) and (D) don't know (C) stepped down but see it
alive according to the current failure detecture implementation, and
also (A) has shorter log than them.
So they reject (A)'s vote requests (Raft 4.2.3 Disruptive servers).
Then (C) rejects voting for (A) because it has shorter log.
And (C) becomes candidate but even though (A) votes for (C), the
previous followers (B) and (D) ignore a vote request while leader (C) is
still alive and election timeout has not passed.
(A) and (C) alone can't reach quorum 2/4. So elections never succeed.
This patch addresses this problem by making followers not ignore vote
requests from who they think is the current leader even though
election timout was not reached.
As @kostja noted, if failure detector would consider a leader alive only
as long as it sends heartbeats (append requests) this patch is no longer
needed.
Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210611172734.254757-1-alejo.sanchez@scylladb.com>
This patch fixes the following compilation warning:
database.cc:430:33: warning: 'update_shares_for_class' is deprecated:
Use io_priority_class.update_shares [-Wdeprecated-declarations]
_inflight_update = engine().update_shares_for_class(_io_priority,
uint32_t(shares));
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Closes#8751
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.
Refs #8830
Tests: unit(release),
dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
* seastar 4506b878...813eee3e (12):
> reactor: fix race with boost::barrier destructor during smp initialialization
> Merge "Merge io-group and io-queue configs" from Pavel E
> tests: add test for skipping data from a socket
> tests: transform socket_test into a test suite
> .gitignore: Add tags
> tls: retain handshake error and return original problem on repeated failures
> iostream: fix skipping from closed sockets
> gitignore .cooking_memory
> Merge 'metrics: Fix dtest->ulong conversion error' from Benny Halevy
> io_priority_class: Make update_shares const
> Remove <seastar/core/apply.hh>
> smp: allow having multiple instances of the smp class
The fix to make io_priority::update_shares() const will allow getting
rid of one of the compilation warnings.
This patch fixes following warnings:
service/priority_manager.cc:30:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
: _commitlog_priority(engine().register_one_priority_class("commitlog", 1000))
service/priority_manager.cc:31:35: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
, _mt_flush_priority(engine().register_one_priority_class("memtable_flush", 1000))
service/priority_manager.cc:32:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
, _streaming_priority(engine().register_one_priority_class("streaming", 200))
service/priority_manager.cc:33:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
, _sstable_query_read(engine().register_one_priority_class("query", 1000))
service/priority_manager.cc:34:37: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
, _compaction_priority(engine().register_one_priority_class("compaction", 1000))
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
This patch fixes the following warning:
main.cc:307:53: warning: 'capacity' is deprecated: modern I/O queues
should use a property file [-Wdeprecated-declarations]
auto capacity = engine().get_io_queue().capacity();
It's fine to just check --max-io-requests directly because seastar
sets io_queue::capacity to the value of this parameter anyway.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Incremental selection may not work properly for LCS and ICS due to an
use-after-free bug in partitioned set which came into existence after
compound set was introduced.
The use-after-free happens because partitioned set wasn't taking into
account that the next position can become the current position in the
next iteration, which will be used by all selectors managed by
compound set. So if next position is freed, when it were being used
as current position, subsequent selectors would find the current
position freed, making them produce incorrect results.
Fix this by moving ownership of next pos from incremental_selector_impl
to incremental_selector, which makes it more robust as the latter knows
better when the selection is done with the next pos. incremental_selector
will still return ring_position_view to avoid copies.
Fixes#8802.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>
To handle a read request from a client, the coordinator node must send
data and digest requests to replicas, reconcile the obtained results
(by merging the obtained mutations and comparing digests), and possibly
send more requests to replicas if the digests turned out to be different
in order to perform read repair and preserve consistency of observed reads.
In contrast to writes, where coordinators send their mutation write requests
to all replicas in the replica set, for reads the coordinators send
their requests only to as many replicas as is required to achieve
the desired CL.
For example consider RF=3 and a CL=QUORUM read. Then the coordinator sends
its request to a subset of 2 nodes out of the 3 possible replicas. The
choice of the 2-node subset is random; the distribution used for the
random roll is affected by certain things such as the "cache hitrate"
metric. The details are not that relevant for this discussion.
If not all of the the initially chosen replicas
answer within a certain time period, the coordinator may send an
additional request to one more replica, hoping that this replica helps
achieving the desired CL so the entire client request succeeds. This
mechanism is called "speculative retry" and is enabled by default.
This time period - call it `T` - is chosen based on keyspace
configuration. The default value is "99.0PERCENTILE", which means that
`T` is roughly equal to the 99th percentile of the latency distribution
of previous requests (or at least the most recent requests; the
algorithm uses an exponential decay strategy to make old request less
relevant for the metric). The latencies used are the durations of whole
coordinator read requests: each such duration measurement starts before
the first replica request is sent and ends after the last replica
request is answered, among the replica requests whose results were used
for the reconciled result returned to the client (there may be more
requests sent later "in the background" - they don't affect the client
result and are not taken into account for the latency measurement).
This strategy, however, gives an undesired effect which appears
when a significant part of all requests require a speculative retry to
succeed. To explain this effect it's best to consider a scenario which
takes this to the extreme - where *all* requests require a speculative retry.
Consider RF=3 and CL=QUORUM so each read request initially uses 2
replicas. Let {A, B, C} be the set of replicas. We run a uniformly
distributed read workload.
Initially the cluster operates normally. Roughly 1/3 of all requests go
to replicas {A, B}, 1/3 go to {A, C}, and 1/3 go to {B, C}. The 99th
percentile of read request latencies is 50ms. Suppose that the average
round-trip latency between a coordinator and any replica is 10ms.
Suddenly replica C is hard-killed: non-graceful shutdown, e.g. power
outage. This means that other nodes are initially not aware that C is down,
they must wait for the failure detector to convict C as unavailable
which happens after a configurable amount of time. The current default
is 20s, meaning that by default coordinators will still attempt to send
requests to C for 20s after it is hard-killed.
During this period the following happens:
- About 2/3 of all requests - the ones which were routed to {A, C} and
{B, C} - do not finish within 50ms because C does not answer. For
these requests to finish, the coordinator performs a speculative retry
to the third replica which finishes after ~10ms (the average round-trip
latency). Thus the entire request, from the coordinator's POV, takes ~60ms.
- Eventually (very quickly in fact - assuming there are many concurrent
requests) the P99 latency rises to 60ms.
- Furthermore, the requests which initially use {A, C} and {B, C} start
taking more than 2/3 of all requests because they are stuck in the foreground
longer than the {A, B} requests (since their latencies are higher).
- These requests do not finish within 60ms. Thus coordinators perform
speculative retries. Thus they finish after ~70ms.
- Eventually the P99 latency rises to 70ms.
- These bad requests take an even longer portion of all requests.
- These requests do not finish within 70ms. They finish after ~80ms.
- Eventually the P99 latency rises to 80ms.
- And so on.
In metrics, we observe the following:
- Latencies rise roughly linearly. They rise until they hit a certain limit;
this limit comes from the fact that `T` is upper-bounded by the
read request timeout parameter divided by 2. Thus if the read request
timeout is `5s` and P99 latencies are `3s`, `T` will be `2.5s`, not `3s`.
Thus eventually all requests will take about `2.5s + 10ms` to finish
(`2.5s` until speculative retry happens, `10ms` for the last round-trip),
unless the node is marked as DOWN before we reach that limit.
- Throughput decreases roughly proportionally to the y = 1/x function, as
expected from Little's law.
Everything goes back to normal when nodes mark C as DOWN, which happens
after ~20s by default as explained above. Then coordinators start
routing all requests to {A, B} only.
This does not happen for graceful shutdowns, where C announces to the
cluster that it's shutting down before shutting down, causing other
nodes to mark it as DOWN almost immediately.
The root cause of the issue is a feedback loop in the metric used to
calculate `T`: we perform a speculative retry after `T` -> P99 request
latencies rise above `T + 10ms` -> `T` rises above `T + 10ms` -> etc.
We fix the problem by changing the measurements used for calculating
`T`. Instead of measuring the entire coordinator read latency, we
measure each replica request separately and take the maximum over these
measurements. We only take into account the measurements for requests
that actually contributed to the request's result.
The previous statistic would also measure failed requests latencies. Now we
measure only latencies of successful replica requests. Indeed this makes
sense for the speculative retry use case; the idea behind speculative retry
is that we assume that requests usually succeed within a certain time
period, and we should perform the retry if they take longer than that.
To measure this time period, taking failed requests into account doesn't
make much sense.
In the scenario above, for a request that initially goes to {A, C}, the
following would happen after applying the fix:
- We send the requests to A and C.
- After ~10ms A responds. We record the ~10ms measurement.
- After ~50ms we perform speculative retry, sending a request to B.
- After ~10ms B responds. We record the ~10ms measurement.
The maximum over recorded measurements is ~10ms, not ~60ms.
The feedback loop is removed.
Experiments show that the solution is effective: in scenarios like
above, after C is killed, latencies only rise slightly by a constant
amount and then maintain their level, as expected. Throughput also drops
by a constant amount and maintains its level instead of continuously
dropping with an asymptote at 0.
Fixes#3746.
Fixes#7342.
Closes#8783
This series adds a new configuration option -
restrict_replication_simplestrategy - which can be used to restrict the
ability to use SimpleStrategy in a CREATE KEYSPACE or ALTER KEYSPACE
statement. This is part of a new effort (dubbed "safe mode") to allow an
installation to restrict operations which are un-recommended or dangerous
(see issue #8586 for why SimpleStrategy is bad).
The new restrict_replication_simplestrategy option has three values:
"true", "false", and "warn":
For the time being, the default is still "false", which means SimpleStrategy is not
restricted, and can still be used freely.
Setting a value of "true" means that SimpleStrategy *is* restricted -
trying to create a a table with it will fail:
cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor': 1 };
ConfigurationException: SimpleStrategy replication class is not
recommended, and forbidden by the current configuration. Please use
NetworkToplogyStrategy instead. You may also override this restriction
with the restrict_replication_simplestrategy=false configuration
option.
Trying to ALTER an existing keyspace to use SimpleStrategy will
similarly fail.
The value "warn" allows - like "false" - SimpleStrategy to be used, but
produces a warning when used to create a keyspace. This warning appears
in the CREATE/ALTER KEYSPACE statement's response (an interactive cqlsh
user will see this warning), and also in Scylla's logs. For example:
cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor': 1 };
Warnings :
SimpleStrategy replication class is not recommended, but was used for
keyspace try1. The restrict_replication_simplestrategy configuration
option can be changed to silence this warning or make it into an error.
Fixes#8586Closes#8765
* github.com:scylladb/scylla:
cql: create_keyspace_statement: move logger out of header file
cql: allow restricting SimpleStrategy in ALTER KEYSPACE
cql: allow restricting SimpleStrategy in CREATE KEYSPACE
config: add configuration option restrict_replication_simplestrategy
config: add "tri_mode_restriction" type of configurable value
utils/enum_option.hh: add implicit converter to the underlying enum
Move the logger declaration from the header file into the only source
file that uses it.
This is just a small cleanup similar to what the previous patch did in
alter_keyspace_statement.{cc,hh}.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In the previous patch we made CREATE KEYSPACE honor the
"restrict_replication_simplestrategy" option. In this patch we do the
same for ALTER KEYSPACE.
We use the same function check_restricted_replication_strategy()
used in CREATE KEYSPACE for the logic of what to allow depending on the
configuration, and what errors or warnings to generate.
One of the non-self-explanatory changes in this patch is to execute():
Previosuly, alter_keyspace_statement inherited its execute() from
schema_altering_statement. Now we need to override it to check if the
operation is forbidden before running schema_altering_statement's execute()
or to warn after it is run. In the previous patch we didn't need to add
a new execute() for create_keyspace_statement because we already had one.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch uses the configuration option which we added in the previous
patch, "restrict_replication_simplestrategy", to control whether a user
can use the SimpleStrategy replication strategy in a CREATE KEYSPACE
operation. The next patch will do the same for ALTER KEYSPACE.
As a tri_mode_restriction, the restrict_replication_simplestrategy option
has three values - "true", "false", and "warn":
The value "false", which today is still the default, means that
SimpleStrategy is not restricted, and can still be used freely.
The value "true" means that SimpleStrategy *is* restricted - trying to
create a a table with it will fail:
cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor': 1 };
ConfigurationException: SimpleStrategy replication class is not
recommended, and forbidden by the current configuration. Please use
NetworkToplogyStrategy instead. You may also override this restriction
with the restrict_replication_simplestrategy=false configuration
option.
The value "warn" allows - like "false" - SimpleStrategy to be used, but
produces a warning when used to create a keyspace. This warning appears
in the CREATE KEYSPACE statement's response (an interactive cqlsh
user will see this warning), and also in Scylla's logs. For example:
cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
'SimpleStrategy', 'replication_factor': 1 };
Warnings :
SimpleStrategy replication class is not recommended, but was used for
keyspace try1. The restrict_replication_simplestrategy configuration
option can be changed to silence this warning or make it into an error.
Because we plan to use the same checks and the same error messages
also for ALTER TABLE (in the next patch), we encapsulate this logic in
a function check_restricted_replication_strategy() which we will use for
ALTER TABLE as well.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a configuration option to choose whether the
SimpleStrategy replication strategy is restricted. It is a
tri_mode_restriction, allowing to restrict this strategy (true), to allow
it (false), or to just warn when it is used (warn).
After this patch, the option exists but doesn't yet do anything.
It will be used in the following two patches to restrict the
CREATE KEYSPACE and ALTER KEYSPACE operations, respectively.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a new type of configurable value for our command-line
and YAML parsers - a "tri_mode_restriction" - which can be set to three
values: "true", "false", or "warn".
We will use this value type for many (but not all) of the restriction
options that we plan to start adding in the following patches.
Restriction options will allow users to ask Scylla to restrict (true),
to not restrict (false) or to warn about (warn) certain dangerous or
undesirable operations.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add an implicit converter of the enum_option to the underyling enum
it is holding. This is needed for using switch() on an enum_option.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We check that the number of open files is sufficent for normal
work (with lots of connections and sstables), but we can improve
it a little. Systemd sets up a low file soft limit by default (so that
select() doesn't break on file descriptors larger than 1023) and
recommends[1] raising the soft limit to the more generous hard limit
if the application doesn't use select(), as ours does not.
Follow the recommendation and bump the limit. Note that this applies
only to scylla started from the command line, as systemd integration
already raises the soft limit.
[1] http://0pointer.net/blog/file-descriptor-limits.htmlCloses#8756
Add test coverage inspired by etcd for non-voter servers,
and fix issues discovered when testing.
* scylla-dev/raft-learner-test-v4:
raft: (testing) test non-voter can vote
raft: (testing) test receiving a confchange in a snapshot
raft: (testing) test voter-non-voter config change loop
raft: (testing) test non-voter doesn't start election on election timeout
raft: (testing) test what happens when a learner gets TimeoutNow
raft: (testing) implement a test for a leader becoming non-voter
raft: style fix
raft: step down as a leader if converted to a non-voter
raft: improve configuration consistency checks
raft: (testing) test that non-voter stays in PIPELINE mode
raft: (testing) always return fsm_debug in create_follower()
This is a reproducer for issue #8827, that checks that a client which
tries to connect to Scylla with an unsupported version of SSL or TLS
gets the expected error alert - not some sort of unexpected EOF.
Issue #8827 is still open, so this test is still xfailing. However,
I verified that with a fix for this issue, the test passes.
The test also prints which protocol versions worked - so it also helps
checking issue #8837 (about the ancient SSL protocol being allowed).
Refs #8837
Refs #8827
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210610151714.1746330-1-nyh@scylladb.com>
The recent seastar update moved the variable again, so to have a
proper support for it we'd need to have 2 try-catch attempts and
a default. Or 1 try-catch, but make sure the maintainer commits
this patch AND seastar update in one go, so that the intermediate
variable doesn't creep into an intermediate commit. Or bear the
scylla-gdb test is not bisect-safe a little bit.
Instead of making this complex choise I suggest to just drop the
volatile variable from the script at all. This thing is actually
a constant derived from the latency goal and io-properties.yaml
file, so it can be calculated without gdb help (unlike run-time
bits like group rovers or numbers of queued/executing resources).
To free developers from doing all this math by hands there's an
"ioinfo" tool that (when run with correct options) prints the
results of this math on the screen.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210610120151.1135-1-xemul@scylladb.com>
This move is not "just move", but also includes:
- putting the whole thing into seastar::async()
- switch from locally captured dependencies into controller's
class members
- making smp_service_groups optional because it doesn't have
default contructor and should somehow survive on constructed
controller until its start()
Also copy few bits from main that can be generalized later:
- get_or_default() helper from main
- sharded_parameter lambda for cdc
- net family and preferred thing from main
( this also fixed the indentation broken by previous patch )
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The controller won't have the database_config at hands to get
the sched group from. All other client services run the whole
controller start in the needed sched group, so prepare the
alternator controller for that.
To make it compile (and while-at-it) also move up the sharded
server and executor instances and the smp_service_group. All
of these will migrate onto the controller in the next patch.
( the indentation is deliberately left broken )
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When .init()ing the server one needs to provide the
max_concurrent_requests_per_shard value from config.
Instead of carrying the database around for it -- use the
db::config itself which is at hand. All the shards share
its instance anyway.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add the controller class with all the needed dependencies. For
now completely unused (thus a bunch of (void)-s here and there).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add header and source file for transport- (and thrift-) like controller
that'll do all the bookkeeping needed to start and stop this client
service.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>