Commit Graph

26935 Commits

Author SHA1 Message Date
Botond Dénes
d2ddaced4e test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds
The lifecycle of the reader lifecycle policy and all the resources the
reads use is now enclosed in that of the multishard reader thanks to its
close() method. We can now remove all the workarounds we had in place to
keep different resources as long as background reader cleanup finishes.
2021-06-16 11:29:36 +03:00
Botond Dénes
5a271e42a5 test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore
So that when this method returns the semaphore is safe to destroy. This
in turn will enable us to get rid of all the machinery we have in place
to deal with the semaphore having to out-live the lifecycle policy
without a clear time as to when it can be safe to destroy.
2021-06-16 11:29:36 +03:00
Botond Dénes
c09c62a0fb test/lib/reader_lifecycle_policy: use a more robust eviction mechanism
The test reader lifecycle policy has a mode in which it wants to ensure
all inactive readers are evicted, so tests can stress reader recreation
logic. For this it currently employs a trick of creating a waiter on the
semaphore. I don't even know how this even works (or if it even does)
but it sure complicates the lifecycle policy code a lot.
So switch to the much more reliable and simple method of creating the
semaphore with a single count and no memory. This ensures that all
inactive reads are immediately evicted, while still allows a single read
to be admitted at all times.
2021-06-16 11:29:36 +03:00
Botond Dénes
578a092e4a reader_concurrency_semaphore: wait for all permits to be destroyed in stop()
To prevent use-after-free resulting from any permit out-living the
semaphore.
2021-06-16 11:29:36 +03:00
Botond Dénes
a10a6e253e test/lib/reader_lifcecycle_policy: fix indentation
Left broken from the previous patch.
2021-06-16 11:29:36 +03:00
Botond Dénes
8c7447effd mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard
Currently shard_reader::close() (its caller) goes to the remote shard,
copies back all fragments left there to the local shard, then calls
`destroy_reader()`, which in the case of the multishard mutation query
copies it all back to the native shard. This was required before because
`shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on
`smp::submit_to()`. But close can, so we can get rid of all this
back-and-forth and just call `destroy_reader()` on the shard the reader
lives on, just like we do with `create_reader()`.
2021-06-16 11:29:35 +03:00
Botond Dénes
4ecf061c90 reader_lifecycle_policy implementations: fix indentation
Left broken from the previous patch.
2021-06-16 11:21:38 +03:00
Botond Dénes
a7e59d3e2c mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter
The shard reader is now able to wait on the stopped reader and pass the
already stopped reader to `destroy_reader()`, so we can de-futurize the
reader parameter of said method. The shard reader was already patched to
pass a ready future so adjusting the call-site is trivial.
The most prominent implementation, the multishard mutation query, can
now also drop its `_dismantling_gate` which was put in place so it can
wait on the background stopping if readers.

A consequence of this move is that handling errors that might happen
during the stopping of the reader is now handled in the shard reader,
not all lifecycle policy implementations.
2021-06-16 11:21:38 +03:00
Botond Dénes
13d7806b62 mutation_reader: shard_reader::close(): wait on the remote reader
We now have a future<> returning close() method so we don't need to
do the cleanup of the remote reader in the background, detaching it
from the shard-reader under destruction. We can now wait for the
cleanup properly before the shard reader is destroyed and just pass the
stopped reader to reader_lifecycle_policy::destroy_reader(). This patch
does the first part -- moving the cleanup to the foreground, the API
change of said method will come in the next patch.
2021-06-16 11:21:38 +03:00
Botond Dénes
ab8d2a04a5 multishard_mutation_query: destroy remote parts in the foreground
Currently the foreign fields of the reader meta are destroyed in the
background via the foreign pointer's destructor (with one exception).
This makes the already complicated life-cycle of these parts and their
dependencies even harder to reason about, especially in tests, where
even things like semaphores live only within the test.
This patch makes sure to destroy all these remote fields in the
foreground in either `save_reader()` or `stop()`, ensuring that once
`stop()` returns, everything is cleaned up.
2021-06-16 11:21:38 +03:00
Botond Dénes
7552cc73cf mutation_reader: shard_reader::close(): close _reader
The reason we got away without closing _reader so far is that it is an
`std::unique_ptr<evictable_reader>` which is a
`flat_mutation_reader::impl` instance, without the
`flat_mutation_reader` wrapper, which contains the validations for
close.
2021-06-16 11:21:33 +03:00
Botond Dénes
98e5f0429b mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment
About the multishard reader not being able to wait on returned future.
It can now via the `close()` method.
2021-06-15 15:23:32 +03:00
Tomasz Grabiec
9d49a26e79 Merge "raft: randomized_nemesis_test: tick servers less often than the network in basic_test" from Kamil
Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.

However, to closely simulate a production environment, we want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.

We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).

To support these use cases we first generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.

We use this new functionality to tick Raft servers less often than the
network in basic_test.

This patchset effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.

The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.

The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.

* kbr/tick-network-often-v4:
  raft: randomized_nemesis_test: generalize `ticker` to take a set of functions
  raft: randomized_nemesis_test: split `environment::tick` into two functions
  raft: randomized_nemesis_test: fix potential use-after-free in basic_test
2021-06-15 01:54:57 +02:00
Kamil Braun
8f1caa6a90 raft: randomized_nemesis_test: generalize ticker to take a set of functions
... with associated calling periods and use the new API in `basic_test`.

Previously `ticker` would use a single function, `on_tick`, which it
called in a loop with yields in-between. In `basic_test` we would use
this to tick every object in synchrony.

However, to closely simulate a production environment, we may want the
tick ratios to be different. For example Raft servers should be ticked
rarely compared to the network.

We may also want to give the Seastar reactor more space between the
function calls (e.g. if they cause a bunch of work to be created for the
reactor that needs more than one tick to complete).

To support these use cases we generalize `ticker` to take a set of
functions with associated numbers. These numbers are the call periods of
their corresponding functions: given {n, f}, `f` will be called each
`n`th tick.

We also modify `basic_test` to use this new approach: we tick Raft
servers once per 10 network ticks (in particular, once per 10 reactor
yields).

This commit effectively reverts 01b6a2eb38
which caused the ticker to call `on_tick` only when the Seastar reactor
had no work to do. This approach is unfortunately incompatible with the
approach taken there. We *do* want the ticker to race with other work,
potentially producing more work while already scheduled work is executing,
and we want to see in tests what happens when we adjust the ticking ratios
of different subsystems.

The previous approach also had a problem where if there was an infinite task
loop executing, the ticker wouldn't ever tick.

The previous fix was introduced since the ticker caused too much work to
be produced (so the reactor couldn't keep up) due to ticking the Raft
servers too often (after each yield). This commit deals with the problem
in a different way, by ticking the servers rarely, which also resembles
"real-life" scenarios better.

With this change we must also wait a bit longer for the first node to
elect itself as a leader at the beginning of the test.
2021-06-14 16:54:38 +02:00
Kamil Braun
c0b80f1f8a raft: randomized_nemesis_test: split environment::tick into two functions
One for ticking the network and one for ticking the servers.
2021-06-14 16:54:38 +02:00
Kamil Braun
f42776aded raft: randomized_nemesis_test: fix potential use-after-free in basic_test
The test starts by waiting a certain number of ticks for the first node
to elect itself as a leader.

If this wait times out - i.e. the number of ticks passes before the node
manages to elect itself - the future associated with the task which checks
for the leader condition becomes discarded (it is passed to
`with_timeout`) and the task may keep using the `environment` (which it
has a reference to) even after the `environment` is destroyed.

Furthermore, the aforementioned task is a coroutine which uses lambda
captures in its body. Leaving `with_timeout` destroys the lambda object,
causing the coroutine to refer to no-longer-existing captures.

We fix the problems by:
- making `environment` `weakly_referencable` and checking if its alive
  before it's used inside the task,
- not capturing anything in the lambda but passing whatever's needed as
  function arguments (so these things get allocated inside the coroutine
  frame).
2021-06-14 16:54:38 +02:00
Nadav Har'El
3645c7104b Merge: Wrap alternator start-stop into controller
Merged patch series by Pavel Emelyanov:

Alternator start and stop code is sitting inside the main()
and it's a big piece of code out there. Havig it all in main
complicates rework of start-stop sequences, it's much more
handy to have it in alternator/.

This set puts the mentioned code into transport- and thrift-
like controller model. While doing it one more call for global
storage service goes away.

* 'br-alternator-clientize' of https://github.com/xemul/scylla:
  alternator: Move start-stop code into controller
  alternator: Move the whole starting code into a sched group
  alternator: Dont capture db, use cfg
  alternator: Controller skeleton
  alternator: Controller basement
  alternator: Drop storage service from executor
2021-06-14 15:44:10 +03:00
Michael Livshin
15b0e5c4d2 sstables: count read range tombstones
Refs #7749.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210602152210.17948-2-michael.livshin@scylladb.com>
2021-06-14 14:37:33 +02:00
Michael Livshin
9ef2317248 row_cache: count range tombstones processed during read
Refs #7749.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210602152210.17948-1-michael.livshin@scylladb.com>
2021-06-14 14:29:05 +02:00
Nadav Har'El
6726fe79b6 Merge 'view: fix use-after-move when handling view update failures' from Piotr Sarna
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Fixes #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)

Closes #8834

* github.com:scylladb/scylla:
  view: fix use-after-move when handling view update failures
  db,view: explicitly move the mutation to its helper function
  db,view: pass base token by value to mutate_MV
2021-06-14 13:15:35 +03:00
Alejo Sanchez
5c8092cf42 raft: fix election with disruptive candidate
This patch also fixes rare hangs in debug mode for drops_04 without
prevote.

Branch URL: https://github.com/alecco/scylla/tree/raft-fixes-05-v2-dueling

Tests: unit ({dev}), unit ({debug}), unit ({release})

Changes in v2:
    - Fixed commit message                               @kostja

Whithout prevote, a node disconnected for long enough becomes candidate.
While disconnected (A) it keeps increasing its term.
When it rejoins it disrupts the current leader (C) which steps down due
to the higher term in (A)'s append_entries_reply and (C) also increases
its term.

Meanwhile followers (B) and (D) don't know (C) stepped down but see it
alive according to the current failure detecture implementation, and
also (A) has shorter log than them.
So they reject (A)'s vote requests (Raft 4.2.3 Disruptive servers).

Then (C) rejects voting for (A) because it has shorter log.
And (C) becomes candidate but even though (A) votes for (C), the
previous followers (B) and (D) ignore a vote request while leader (C) is
still alive and election timeout has not passed.

(A) and (C) alone can't reach quorum 2/4. So elections never succeed.

This patch addresses this problem by making followers not ignore vote
requests from who they think is the current leader even though
election timout was not reached.

As @kostja noted, if failure detector would consider a leader alive only
as long as it sends heartbeats (append requests) this patch is no longer
needed.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210611172734.254757-1-alejo.sanchez@scylladb.com>
2021-06-14 11:07:38 +02:00
Piotr Jastrzebski
1ed92e37f8 database: Fix warning about deprecated update_shares_for_class usage
This patch fixes the following compilation warning:

database.cc:430:33: warning: 'update_shares_for_class' is deprecated:
Use io_priority_class.update_shares [-Wdeprecated-declarations]
    _inflight_update = engine().update_shares_for_class(_io_priority,
    uint32_t(shares));

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>

Closes #8751
2021-06-14 10:42:22 +03:00
Piotr Sarna
8a049c9116 view: fix use-after-move when handling view update failures
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Refs #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
2021-06-14 09:36:10 +02:00
Piotr Sarna
7cdbb7951a db,view: explicitly move the mutation to its helper function
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
2021-06-14 09:34:40 +02:00
Piotr Sarna
88d4a66e90 db,view: pass base token by value to mutate_MV
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
2021-06-14 09:30:38 +02:00
Nadav Har'El
6a8441ef03 Update seastar submodule
* seastar 4506b878...813eee3e (12):
  > reactor: fix race with boost::barrier destructor during smp initialialization
  > Merge "Merge io-group and io-queue configs" from Pavel E
  > tests: add test for skipping data from a socket
  > tests: transform socket_test into a test suite
  > .gitignore: Add tags
  > tls: retain handshake error and return original problem on repeated failures
  > iostream: fix skipping from closed sockets
  > gitignore .cooking_memory
  > Merge 'metrics: Fix dtest->ulong conversion error' from Benny Halevy
  > io_priority_class: Make update_shares const
  > Remove <seastar/core/apply.hh>
  > smp: allow having multiple instances of the smp class

The fix to make io_priority::update_shares() const will allow getting
rid of one of the compilation warnings.
2021-06-14 10:27:14 +03:00
Nadav Har'El
061e43e9d4 Merge 'Fix some compilation warnings' from Piotr Jastrzębski
Closes #8850

* github.com:scylladb/scylla:
  priority_manager: Fix warnings about deprecated register_one_priority_class usage
  main: Fix warning about deprecated usage of io_queue::capacity
2021-06-14 10:05:27 +03:00
Piotr Jastrzebski
831a60a6cd priority_manager: Fix warnings about deprecated register_one_priority_class usage
This patch fixes following warnings:
service/priority_manager.cc:30:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
    : _commitlog_priority(engine().register_one_priority_class("commitlog", 1000))

service/priority_manager.cc:31:35: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
    , _mt_flush_priority(engine().register_one_priority_class("memtable_flush", 1000))

service/priority_manager.cc:32:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
    , _streaming_priority(engine().register_one_priority_class("streaming", 200))

service/priority_manager.cc:33:36: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
    , _sstable_query_read(engine().register_one_priority_class("query", 1000))

service/priority_manager.cc:34:37: warning: 'register_one_priority_class' is deprecated: Use io_priority_class::register_one [-Wdeprecated-declarations]
    , _compaction_priority(engine().register_one_priority_class("compaction", 1000))

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-14 08:49:46 +02:00
Piotr Jastrzebski
3ec04433f7 main: Fix warning about deprecated usage of io_queue::capacity
This patch fixes the following warning:

main.cc:307:53: warning: 'capacity' is deprecated: modern I/O queues
should use a property file [-Wdeprecated-declarations]
            auto capacity = engine().get_io_queue().capacity();

It's fine to just check --max-io-requests directly because seastar
sets io_queue::capacity to the value of this parameter anyway.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-14 08:49:42 +02:00
Raphael S. Carvalho
846f0bd16e sstables: Fix incremental selection with compound sstable set
Incremental selection may not work properly for LCS and ICS due to an
use-after-free bug in partitioned set which came into existence after
compound set was introduced.

The use-after-free happens because partitioned set wasn't taking into
account that the next position can become the current position in the
next iteration, which will be used by all selectors managed by
compound set. So if next position is freed, when it were being used
as current position, subsequent selectors would find the current
position freed, making them produce incorrect results.

Fix this by moving ownership of next pos from incremental_selector_impl
to incremental_selector, which makes it more robust as the latter knows
better when the selection is done with the next pos. incremental_selector
will still return ring_position_view to avoid copies.

Fixes #8802.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>
2021-06-13 16:45:07 +03:00
Kamil Braun
9e85921006 storage_proxy: remove a feedback loop from the speculative retry latency metric
To handle a read request from a client, the coordinator node must send
data and digest requests to replicas, reconcile the obtained results
(by merging the obtained mutations and comparing digests), and possibly
send more requests to replicas if the digests turned out to be different
in order to perform read repair and preserve consistency of observed reads.

In contrast to writes, where coordinators send their mutation write requests
to all replicas in the replica set, for reads the coordinators send
their requests only to as many replicas as is required to achieve
the desired CL.

For example consider RF=3 and a CL=QUORUM read. Then the coordinator sends
its request to a subset of 2 nodes out of the 3 possible replicas. The
choice of the 2-node subset is random; the distribution used for the
random roll is affected by certain things such as the "cache hitrate"
metric. The details are not that relevant for this discussion.

If not all of the the initially chosen replicas
answer within a certain time period, the coordinator may send an
additional request to one more replica, hoping that this replica helps
achieving the desired CL so the entire client request succeeds. This
mechanism is called "speculative retry" and is enabled by default.

This time period - call it `T` - is chosen based on keyspace
configuration. The default value is "99.0PERCENTILE", which means that
`T` is roughly equal to the 99th percentile of the latency distribution
of previous requests (or at least the most recent requests; the
algorithm uses an exponential decay strategy to make old request less
relevant for the metric). The latencies used are the durations of whole
coordinator read requests: each such duration measurement starts before
the first replica request is sent and ends after the last replica
request is answered, among the replica requests whose results were used
for the reconciled result returned to the client (there may be more
requests sent later "in the background" - they don't affect the client
result and are not taken into account for the latency measurement).

This strategy, however, gives an undesired effect which appears
when a significant part of all requests require a speculative retry to
succeed. To explain this effect it's best to consider a scenario which
takes this to the extreme - where *all* requests require a speculative retry.

Consider RF=3 and CL=QUORUM so each read request initially uses 2
replicas. Let {A, B, C} be the set of replicas. We run a uniformly
distributed read workload.

Initially the cluster operates normally. Roughly 1/3 of all requests go
to replicas {A, B}, 1/3 go to {A, C}, and 1/3 go to {B, C}. The 99th
percentile of read request latencies is 50ms. Suppose that the average
round-trip latency between a coordinator and any replica is 10ms.

Suddenly replica C is hard-killed: non-graceful shutdown, e.g. power
outage. This means that other nodes are initially not aware that C is down,
they must wait for the failure detector to convict C as unavailable
which happens after a configurable amount of time. The current default
is 20s, meaning that by default coordinators will still attempt to send
requests to C for 20s after it is hard-killed.

During this period the following happens:
- About 2/3 of all requests - the ones which were routed to {A, C} and
  {B, C} - do not finish within 50ms because C does not answer. For
  these requests to finish, the coordinator performs a speculative retry
  to the third replica which finishes after ~10ms (the average round-trip
  latency). Thus the entire request, from the coordinator's POV, takes ~60ms.
- Eventually (very quickly in fact - assuming there are many concurrent
  requests) the P99 latency rises to 60ms.
- Furthermore, the requests which initially use {A, C} and {B, C} start
  taking more than 2/3 of all requests because they are stuck in the foreground
  longer than the {A, B} requests (since their latencies are higher).
- These requests do not finish within 60ms. Thus coordinators perform
  speculative retries. Thus they finish after ~70ms.
- Eventually the P99 latency rises to 70ms.
- These bad requests take an even longer portion of all requests.
- These requests do not finish within 70ms. They finish after ~80ms.
- Eventually the P99 latency rises to 80ms.
- And so on.

In metrics, we observe the following:
- Latencies rise roughly linearly. They rise until they hit a certain limit;
  this limit comes from the fact that `T` is upper-bounded by the
  read request timeout parameter divided by 2. Thus if the read request
  timeout is `5s` and P99 latencies are `3s`, `T` will be `2.5s`, not `3s`.
  Thus eventually all requests will take about `2.5s + 10ms` to finish
  (`2.5s` until speculative retry happens, `10ms` for the last round-trip),
  unless the node is marked as DOWN before we reach that limit.
- Throughput decreases roughly proportionally to the y = 1/x function, as
  expected from Little's law.

Everything goes back to normal when nodes mark C as DOWN, which happens
after ~20s by default as explained above. Then coordinators start
routing all requests to {A, B} only.

This does not happen for graceful shutdowns, where C announces to the
cluster that it's shutting down before shutting down, causing other
nodes to mark it as DOWN almost immediately.

The root cause of the issue is a feedback loop in the metric used to
calculate `T`: we perform a speculative retry after `T` -> P99 request
latencies rise above `T + 10ms` -> `T` rises above `T + 10ms` -> etc.

We fix the problem by changing the measurements used for calculating
`T`. Instead of measuring the entire coordinator read latency, we
measure each replica request separately and take the maximum over these
measurements. We only take into account the measurements for requests
that actually contributed to the request's result.

The previous statistic would also measure failed requests latencies. Now we
measure only latencies of successful replica requests. Indeed this makes
sense for the speculative retry use case; the idea behind speculative retry
is that we assume that requests usually succeed within a certain time
period, and we should perform the retry if they take longer than that.
To measure this time period, taking failed requests into account doesn't
make much sense.

In the scenario above, for a request that initially goes to {A, C}, the
following would happen after applying the fix:
- We send the requests to A and C.
- After ~10ms A responds. We record the ~10ms measurement.
- After ~50ms we perform speculative retry, sending a request to B.
- After ~10ms B responds. We record the ~10ms measurement.

The maximum over recorded measurements is ~10ms, not ~60ms.
The feedback loop is removed.

Experiments show that the solution is effective: in scenarios like
above, after C is killed, latencies only rise slightly by a constant
amount and then maintain their level, as expected. Throughput also drops
by a constant amount and maintains its level instead of continuously
dropping with an asymptote at 0.

Fixes #3746.
Fixes #7342.

Closes #8783
2021-06-13 16:19:11 +03:00
Avi Kivity
d6f3a62c13 Merge 'Add option to forbid SimpleStrategy in CREATE/ALTER KEYSPACE' from Nadav Har'El
This series adds a new configuration option -
restrict_replication_simplestrategy - which can be used to restrict the
ability to use SimpleStrategy in a CREATE KEYSPACE or ALTER KEYSPACE
statement. This is part of a new effort (dubbed "safe mode") to allow an
installation to restrict operations which are un-recommended or dangerous
(see issue #8586 for why SimpleStrategy is bad).

The new restrict_replication_simplestrategy option has three values:
"true", "false", and "warn":

For the time being, the default is still "false", which means SimpleStrategy is not
restricted, and can still be used freely.

Setting a value of "true" means that SimpleStrategy *is* restricted -
trying to create a a table with it will fail:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    ConfigurationException: SimpleStrategy replication class is not
    recommended, and forbidden by the current configuration. Please use
    NetworkToplogyStrategy instead. You may also override this restriction
    with the restrict_replication_simplestrategy=false configuration
    option.

Trying to ALTER an existing keyspace to use SimpleStrategy will
similarly fail.

The value "warn" allows - like "false" - SimpleStrategy to be used, but
produces a warning when used to create a keyspace. This warning appears
in the CREATE/ALTER KEYSPACE statement's response (an interactive cqlsh
user will see this warning), and also in Scylla's logs. For example:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    Warnings :
    SimpleStrategy replication class is not recommended, but was used for
    keyspace try1. The restrict_replication_simplestrategy configuration
    option can be changed to silence this warning or make it into an error.

Fixes #8586

Closes #8765

* github.com:scylladb/scylla:
  cql: create_keyspace_statement: move logger out of header file
  cql: allow restricting SimpleStrategy in ALTER KEYSPACE
  cql: allow restricting SimpleStrategy in CREATE KEYSPACE
  config: add configuration option restrict_replication_simplestrategy
  config: add "tri_mode_restriction" type of configurable value
  utils/enum_option.hh: add implicit converter to the underlying enum
2021-06-13 15:39:18 +03:00
Nadav Har'El
6f813bd3a1 cql: create_keyspace_statement: move logger out of header file
Move the logger declaration from the header file into the only source
file that uses it.

This is just a small cleanup similar to what the previous patch did in
alter_keyspace_statement.{cc,hh}.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:40 +03:00
Nadav Har'El
dea075c038 cql: allow restricting SimpleStrategy in ALTER KEYSPACE
In the previous patch we made CREATE KEYSPACE honor the
"restrict_replication_simplestrategy" option. In this patch we do the
same for ALTER KEYSPACE.

We use the same function check_restricted_replication_strategy()
used in CREATE KEYSPACE for the logic of what to allow depending on the
configuration, and what errors or warnings to generate.

One of the non-self-explanatory changes in this patch is to execute():
Previosuly, alter_keyspace_statement inherited its execute() from
schema_altering_statement. Now we need to override it to check if the
operation is forbidden before running schema_altering_statement's execute()
or to warn after it is run. In the previous patch we didn't need to add
a new execute() for create_keyspace_statement because we already had one.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:40 +03:00
Nadav Har'El
b9539d7135 cql: allow restricting SimpleStrategy in CREATE KEYSPACE
This patch uses the configuration option which we added in the previous
patch, "restrict_replication_simplestrategy", to control whether a user
can use the SimpleStrategy replication strategy in a CREATE KEYSPACE
operation. The next patch will do the same for ALTER KEYSPACE.

As a tri_mode_restriction, the restrict_replication_simplestrategy option
has three values - "true", "false", and "warn":

The value "false", which today is still the default, means that
SimpleStrategy is not restricted, and can still be used freely.

The value "true" means that SimpleStrategy *is* restricted - trying to
create a a table with it will fail:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    ConfigurationException: SimpleStrategy replication class is not
    recommended, and forbidden by the current configuration. Please use
    NetworkToplogyStrategy instead. You may also override this restriction
    with the restrict_replication_simplestrategy=false configuration
    option.

The value "warn" allows - like "false" - SimpleStrategy to be used, but
produces a warning when used to create a keyspace. This warning appears
in the CREATE KEYSPACE statement's response (an interactive cqlsh
user will see this warning), and also in Scylla's logs. For example:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    Warnings :
    SimpleStrategy replication class is not recommended, but was used for
    keyspace try1. The restrict_replication_simplestrategy configuration
    option can be changed to silence this warning or make it into an error.

Because we plan to use the same checks and the same error messages
also for ALTER TABLE (in the next patch), we encapsulate this logic in
a function check_restricted_replication_strategy() which we will use for
ALTER TABLE as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:25 +03:00
Nadav Har'El
8a4ac6914a config: add configuration option restrict_replication_simplestrategy
This patch adds a configuration option to choose whether the
SimpleStrategy replication strategy is restricted. It is a
tri_mode_restriction, allowing to restrict this strategy (true), to allow
it (false), or to just warn when it is used (warn).

After this patch, the option exists but doesn't yet do anything.
It will be used in the following two patches to restrict the
CREATE KEYSPACE and ALTER KEYSPACE operations, respectively.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:16 +03:00
Nadav Har'El
a3d6f502ad config: add "tri_mode_restriction" type of configurable value
This patch adds a new type of configurable value for our command-line
and YAML parsers - a "tri_mode_restriction" - which can be set to three
values: "true", "false", or "warn".

We will use this value type for many (but not all) of the restriction
options that we plan to start adding in the following patches.
Restriction options will allow users to ask Scylla to restrict (true),
to not restrict (false) or to warn about (warn) certain dangerous or
undesirable operations.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:44:20 +03:00
Nadav Har'El
afacffc556 utils/enum_option.hh: add implicit converter to the underlying enum
Add an implicit converter of the enum_option to the underyling enum
it is holding. This is needed for using switch() on an enum_option.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 13:18:49 +03:00
Avi Kivity
ec60f44b64 main: improve process file limit handling
We check that the number of open files is sufficent for normal
work (with lots of connections and sstables), but we can improve
it a little. Systemd sets up a low file soft limit by default (so that
select() doesn't break on file descriptors larger than 1023) and
recommends[1] raising the soft limit to the more generous hard limit
if the application doesn't use select(), as ours does not.

Follow the recommendation and bump the limit. Note that this applies
only to scylla started from the command line, as systemd integration
already raises the soft limit.

[1] http://0pointer.net/blog/file-descriptor-limits.html

Closes #8756
2021-06-13 09:19:35 +03:00
Tomasz Grabiec
7521301b72 Merge "raft: add tests for non-voters and fix related bugs" from Kostja
Add test coverage inspired by etcd for non-voter servers,
and fix issues discovered when testing.

* scylla-dev/raft-learner-test-v4:
  raft: (testing) test non-voter can vote
  raft: (testing) test receiving a confchange in a snapshot
  raft: (testing) test voter-non-voter config change loop
  raft: (testing) test non-voter doesn't start election on election timeout
  raft: (testing) test what happens when a learner gets TimeoutNow
  raft: (testing) implement a test for a leader becoming non-voter
  raft: style fix
  raft: step down as a leader if converted to a non-voter
  raft: improve configuration consistency checks
  raft: (testing) test that non-voter stays in PIPELINE mode
  raft: (testing) always return fsm_debug in create_follower()
2021-06-12 21:36:47 +03:00
Botond Dénes
cb208a56f2 docs/guides/debugging.md: expand section on libthread-db
Fix a typo in enabling libthread-db debugging.

Add command line snippet which can enable libthread-db debugging on
startup.

Split the long wall of text about likely problems into separate
per-problem subsections.

Add sub-section about recently found Fedora bug(?)
https://bugzilla.redhat.com/show_bug.cgi?id=1960867.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210603150607.378277-1-bdenes@scylladb.com>
2021-06-12 21:36:47 +03:00
Nadav Har'El
9774c146cc cql-pytest: add test for connecting with different SSL/TLS versions
This is a reproducer for issue #8827, that checks that a client which
tries to connect to Scylla with an unsupported version of SSL or TLS
gets the expected error alert - not some sort of unexpected EOF.

Issue #8827 is still open, so this test is still xfailing. However,
I verified that with a fix for this issue, the test passes.

The test also prints which protocol versions worked - so it also helps
checking issue #8837 (about the ancient SSL protocol being allowed).

Refs #8837
Refs #8827

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210610151714.1746330-1-nyh@scylladb.com>
2021-06-12 21:36:47 +03:00
Pavel Emelyanov
7b1f2d91a5 scylla-gdb: Remove maximum-request-size report
The recent seastar update moved the variable again, so to have a
proper support for it we'd need to have 2 try-catch attempts and
a default. Or 1 try-catch, but make sure the maintainer commits
this patch AND seastar update in one go, so that the intermediate
variable doesn't creep into an intermediate commit. Or bear the
scylla-gdb test is not bisect-safe a little bit.

Instead of making this complex choise I suggest to just drop the
volatile variable from the script at all. This thing is actually
a constant derived from the latency goal and io-properties.yaml
file, so it can be calculated without gdb help (unlike run-time
bits like group rovers or numbers of queued/executing resources).
To free developers from doing all this math by hands there's an
"ioinfo" tool that (when run with correct options) prints the
results of this math on the screen.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210610120151.1135-1-xemul@scylladb.com>
2021-06-11 19:06:43 +02:00
Michael Livshin
2bbc293e22 tests: improve error reporting of test_env::reusable_sst()
Distinguish the "no such sstable" case from any reading errors.

While at it, coroutinize the function.

Refs #8785.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210610113304.264922-1-michael.livshin@scylladb.com>
2021-06-11 19:06:43 +02:00
Pavel Emelyanov
fbd98e6292 alternator: Move start-stop code into controller
This move is not "just move", but also includes:

- putting the whole thing into seastar::async()
- switch from locally captured dependencies into controller's
  class members
- making smp_service_groups optional because it doesn't have
  default contructor and should somehow survive on constructed
  controller until its start()

Also copy few bits from main that can be generalized later:

- get_or_default() helper from main
- sharded_parameter lambda for cdc
- net family and preferred thing from main

( this also fixed the indentation broken by previous patch )

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:17:27 +03:00
Pavel Emelyanov
9e2ad77436 alternator: Move the whole starting code into a sched group
The controller won't have the database_config at hands to get
the sched group from. All other client services run the whole
controller start in the needed sched group, so prepare the
alternator controller for that.

To make it compile (and while-at-it) also move up the sharded
server and executor instances and the smp_service_group. All
of these will migrate onto the controller in the next patch.

( the indentation is deliberately left broken )

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:11:02 +03:00
Pavel Emelyanov
f918a75572 alternator: Dont capture db, use cfg
When .init()ing the server one needs to provide the
max_concurrent_requests_per_shard value from config.

Instead of carrying the database around for it -- use the
db::config itself which is at hand. All the shards share
its instance anyway.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:09:16 +03:00
Pavel Emelyanov
4aad618409 alternator: Controller skeleton
Add the controller class with all the needed dependencies. For
now completely unused (thus a bunch of (void)-s here and there).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:08:37 +03:00
Pavel Emelyanov
316e9af234 alternator: Controller basement
Add header and source file for transport- (and thrift-) like controller
that'll do all the bookkeeping needed to start and stop this client
service.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:06:10 +03:00
Pavel Emelyanov
773d2fe2a4 alternator: Drop storage service from executor
It's completely unused in it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-06-11 18:05:11 +03:00