Commit Graph

26903 Commits

Author SHA1 Message Date
Piotr Sarna
8a049c9116 view: fix use-after-move when handling view update failures
The code was susceptible to use-after-move if both local
and remote updates were going to be sent.
The whole routine for sending view updates is now rewritten
to avoid use-after-move.

Refs #8830
Tests: unit(release),
       dtest(secondary_indexes_test.py:TestSecondaryIndexes.test_remove_node_during_index_build)
2021-06-14 09:36:10 +02:00
Piotr Sarna
7cdbb7951a db,view: explicitly move the mutation to its helper function
The `apply_to_remote_endpoints` helper function used to take
its `mut` parameter by reference, but then moved the value from it,
which is confusing and prone to errors. Since the value is moved-from,
let's pass it to the helper function as rvalue ref explicitly.
2021-06-14 09:34:40 +02:00
Piotr Sarna
88d4a66e90 db,view: pass base token by value to mutate_MV
The base token is passed cross-continuations, so the current way
of passing it by const reference probably only works because the token
copying is cheap enough to optimize the reference out.
Fix by explicitly taking the token by value.
2021-06-14 09:30:38 +02:00
Raphael S. Carvalho
846f0bd16e sstables: Fix incremental selection with compound sstable set
Incremental selection may not work properly for LCS and ICS due to an
use-after-free bug in partitioned set which came into existence after
compound set was introduced.

The use-after-free happens because partitioned set wasn't taking into
account that the next position can become the current position in the
next iteration, which will be used by all selectors managed by
compound set. So if next position is freed, when it were being used
as current position, subsequent selectors would find the current
position freed, making them produce incorrect results.

Fix this by moving ownership of next pos from incremental_selector_impl
to incremental_selector, which makes it more robust as the latter knows
better when the selection is done with the next pos. incremental_selector
will still return ring_position_view to avoid copies.

Fixes #8802.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>
2021-06-13 16:45:07 +03:00
Kamil Braun
9e85921006 storage_proxy: remove a feedback loop from the speculative retry latency metric
To handle a read request from a client, the coordinator node must send
data and digest requests to replicas, reconcile the obtained results
(by merging the obtained mutations and comparing digests), and possibly
send more requests to replicas if the digests turned out to be different
in order to perform read repair and preserve consistency of observed reads.

In contrast to writes, where coordinators send their mutation write requests
to all replicas in the replica set, for reads the coordinators send
their requests only to as many replicas as is required to achieve
the desired CL.

For example consider RF=3 and a CL=QUORUM read. Then the coordinator sends
its request to a subset of 2 nodes out of the 3 possible replicas. The
choice of the 2-node subset is random; the distribution used for the
random roll is affected by certain things such as the "cache hitrate"
metric. The details are not that relevant for this discussion.

If not all of the the initially chosen replicas
answer within a certain time period, the coordinator may send an
additional request to one more replica, hoping that this replica helps
achieving the desired CL so the entire client request succeeds. This
mechanism is called "speculative retry" and is enabled by default.

This time period - call it `T` - is chosen based on keyspace
configuration. The default value is "99.0PERCENTILE", which means that
`T` is roughly equal to the 99th percentile of the latency distribution
of previous requests (or at least the most recent requests; the
algorithm uses an exponential decay strategy to make old request less
relevant for the metric). The latencies used are the durations of whole
coordinator read requests: each such duration measurement starts before
the first replica request is sent and ends after the last replica
request is answered, among the replica requests whose results were used
for the reconciled result returned to the client (there may be more
requests sent later "in the background" - they don't affect the client
result and are not taken into account for the latency measurement).

This strategy, however, gives an undesired effect which appears
when a significant part of all requests require a speculative retry to
succeed. To explain this effect it's best to consider a scenario which
takes this to the extreme - where *all* requests require a speculative retry.

Consider RF=3 and CL=QUORUM so each read request initially uses 2
replicas. Let {A, B, C} be the set of replicas. We run a uniformly
distributed read workload.

Initially the cluster operates normally. Roughly 1/3 of all requests go
to replicas {A, B}, 1/3 go to {A, C}, and 1/3 go to {B, C}. The 99th
percentile of read request latencies is 50ms. Suppose that the average
round-trip latency between a coordinator and any replica is 10ms.

Suddenly replica C is hard-killed: non-graceful shutdown, e.g. power
outage. This means that other nodes are initially not aware that C is down,
they must wait for the failure detector to convict C as unavailable
which happens after a configurable amount of time. The current default
is 20s, meaning that by default coordinators will still attempt to send
requests to C for 20s after it is hard-killed.

During this period the following happens:
- About 2/3 of all requests - the ones which were routed to {A, C} and
  {B, C} - do not finish within 50ms because C does not answer. For
  these requests to finish, the coordinator performs a speculative retry
  to the third replica which finishes after ~10ms (the average round-trip
  latency). Thus the entire request, from the coordinator's POV, takes ~60ms.
- Eventually (very quickly in fact - assuming there are many concurrent
  requests) the P99 latency rises to 60ms.
- Furthermore, the requests which initially use {A, C} and {B, C} start
  taking more than 2/3 of all requests because they are stuck in the foreground
  longer than the {A, B} requests (since their latencies are higher).
- These requests do not finish within 60ms. Thus coordinators perform
  speculative retries. Thus they finish after ~70ms.
- Eventually the P99 latency rises to 70ms.
- These bad requests take an even longer portion of all requests.
- These requests do not finish within 70ms. They finish after ~80ms.
- Eventually the P99 latency rises to 80ms.
- And so on.

In metrics, we observe the following:
- Latencies rise roughly linearly. They rise until they hit a certain limit;
  this limit comes from the fact that `T` is upper-bounded by the
  read request timeout parameter divided by 2. Thus if the read request
  timeout is `5s` and P99 latencies are `3s`, `T` will be `2.5s`, not `3s`.
  Thus eventually all requests will take about `2.5s + 10ms` to finish
  (`2.5s` until speculative retry happens, `10ms` for the last round-trip),
  unless the node is marked as DOWN before we reach that limit.
- Throughput decreases roughly proportionally to the y = 1/x function, as
  expected from Little's law.

Everything goes back to normal when nodes mark C as DOWN, which happens
after ~20s by default as explained above. Then coordinators start
routing all requests to {A, B} only.

This does not happen for graceful shutdowns, where C announces to the
cluster that it's shutting down before shutting down, causing other
nodes to mark it as DOWN almost immediately.

The root cause of the issue is a feedback loop in the metric used to
calculate `T`: we perform a speculative retry after `T` -> P99 request
latencies rise above `T + 10ms` -> `T` rises above `T + 10ms` -> etc.

We fix the problem by changing the measurements used for calculating
`T`. Instead of measuring the entire coordinator read latency, we
measure each replica request separately and take the maximum over these
measurements. We only take into account the measurements for requests
that actually contributed to the request's result.

The previous statistic would also measure failed requests latencies. Now we
measure only latencies of successful replica requests. Indeed this makes
sense for the speculative retry use case; the idea behind speculative retry
is that we assume that requests usually succeed within a certain time
period, and we should perform the retry if they take longer than that.
To measure this time period, taking failed requests into account doesn't
make much sense.

In the scenario above, for a request that initially goes to {A, C}, the
following would happen after applying the fix:
- We send the requests to A and C.
- After ~10ms A responds. We record the ~10ms measurement.
- After ~50ms we perform speculative retry, sending a request to B.
- After ~10ms B responds. We record the ~10ms measurement.

The maximum over recorded measurements is ~10ms, not ~60ms.
The feedback loop is removed.

Experiments show that the solution is effective: in scenarios like
above, after C is killed, latencies only rise slightly by a constant
amount and then maintain their level, as expected. Throughput also drops
by a constant amount and maintains its level instead of continuously
dropping with an asymptote at 0.

Fixes #3746.
Fixes #7342.

Closes #8783
2021-06-13 16:19:11 +03:00
Avi Kivity
d6f3a62c13 Merge 'Add option to forbid SimpleStrategy in CREATE/ALTER KEYSPACE' from Nadav Har'El
This series adds a new configuration option -
restrict_replication_simplestrategy - which can be used to restrict the
ability to use SimpleStrategy in a CREATE KEYSPACE or ALTER KEYSPACE
statement. This is part of a new effort (dubbed "safe mode") to allow an
installation to restrict operations which are un-recommended or dangerous
(see issue #8586 for why SimpleStrategy is bad).

The new restrict_replication_simplestrategy option has three values:
"true", "false", and "warn":

For the time being, the default is still "false", which means SimpleStrategy is not
restricted, and can still be used freely.

Setting a value of "true" means that SimpleStrategy *is* restricted -
trying to create a a table with it will fail:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    ConfigurationException: SimpleStrategy replication class is not
    recommended, and forbidden by the current configuration. Please use
    NetworkToplogyStrategy instead. You may also override this restriction
    with the restrict_replication_simplestrategy=false configuration
    option.

Trying to ALTER an existing keyspace to use SimpleStrategy will
similarly fail.

The value "warn" allows - like "false" - SimpleStrategy to be used, but
produces a warning when used to create a keyspace. This warning appears
in the CREATE/ALTER KEYSPACE statement's response (an interactive cqlsh
user will see this warning), and also in Scylla's logs. For example:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    Warnings :
    SimpleStrategy replication class is not recommended, but was used for
    keyspace try1. The restrict_replication_simplestrategy configuration
    option can be changed to silence this warning or make it into an error.

Fixes #8586

Closes #8765

* github.com:scylladb/scylla:
  cql: create_keyspace_statement: move logger out of header file
  cql: allow restricting SimpleStrategy in ALTER KEYSPACE
  cql: allow restricting SimpleStrategy in CREATE KEYSPACE
  config: add configuration option restrict_replication_simplestrategy
  config: add "tri_mode_restriction" type of configurable value
  utils/enum_option.hh: add implicit converter to the underlying enum
2021-06-13 15:39:18 +03:00
Nadav Har'El
6f813bd3a1 cql: create_keyspace_statement: move logger out of header file
Move the logger declaration from the header file into the only source
file that uses it.

This is just a small cleanup similar to what the previous patch did in
alter_keyspace_statement.{cc,hh}.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:40 +03:00
Nadav Har'El
dea075c038 cql: allow restricting SimpleStrategy in ALTER KEYSPACE
In the previous patch we made CREATE KEYSPACE honor the
"restrict_replication_simplestrategy" option. In this patch we do the
same for ALTER KEYSPACE.

We use the same function check_restricted_replication_strategy()
used in CREATE KEYSPACE for the logic of what to allow depending on the
configuration, and what errors or warnings to generate.

One of the non-self-explanatory changes in this patch is to execute():
Previosuly, alter_keyspace_statement inherited its execute() from
schema_altering_statement. Now we need to override it to check if the
operation is forbidden before running schema_altering_statement's execute()
or to warn after it is run. In the previous patch we didn't need to add
a new execute() for create_keyspace_statement because we already had one.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:40 +03:00
Nadav Har'El
b9539d7135 cql: allow restricting SimpleStrategy in CREATE KEYSPACE
This patch uses the configuration option which we added in the previous
patch, "restrict_replication_simplestrategy", to control whether a user
can use the SimpleStrategy replication strategy in a CREATE KEYSPACE
operation. The next patch will do the same for ALTER KEYSPACE.

As a tri_mode_restriction, the restrict_replication_simplestrategy option
has three values - "true", "false", and "warn":

The value "false", which today is still the default, means that
SimpleStrategy is not restricted, and can still be used freely.

The value "true" means that SimpleStrategy *is* restricted - trying to
create a a table with it will fail:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    ConfigurationException: SimpleStrategy replication class is not
    recommended, and forbidden by the current configuration. Please use
    NetworkToplogyStrategy instead. You may also override this restriction
    with the restrict_replication_simplestrategy=false configuration
    option.

The value "warn" allows - like "false" - SimpleStrategy to be used, but
produces a warning when used to create a keyspace. This warning appears
in the CREATE KEYSPACE statement's response (an interactive cqlsh
user will see this warning), and also in Scylla's logs. For example:

    cqlsh> CREATE KEYSPACE try1 WITH REPLICATION = { 'class' :
           'SimpleStrategy', 'replication_factor': 1 };

    Warnings :
    SimpleStrategy replication class is not recommended, but was used for
    keyspace try1. The restrict_replication_simplestrategy configuration
    option can be changed to silence this warning or make it into an error.

Because we plan to use the same checks and the same error messages
also for ALTER TABLE (in the next patch), we encapsulate this logic in
a function check_restricted_replication_strategy() which we will use for
ALTER TABLE as well.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:25 +03:00
Nadav Har'El
8a4ac6914a config: add configuration option restrict_replication_simplestrategy
This patch adds a configuration option to choose whether the
SimpleStrategy replication strategy is restricted. It is a
tri_mode_restriction, allowing to restrict this strategy (true), to allow
it (false), or to just warn when it is used (warn).

After this patch, the option exists but doesn't yet do anything.
It will be used in the following two patches to restrict the
CREATE KEYSPACE and ALTER KEYSPACE operations, respectively.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:45:16 +03:00
Nadav Har'El
a3d6f502ad config: add "tri_mode_restriction" type of configurable value
This patch adds a new type of configurable value for our command-line
and YAML parsers - a "tri_mode_restriction" - which can be set to three
values: "true", "false", or "warn".

We will use this value type for many (but not all) of the restriction
options that we plan to start adding in the following patches.
Restriction options will allow users to ask Scylla to restrict (true),
to not restrict (false) or to warn about (warn) certain dangerous or
undesirable operations.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 14:44:20 +03:00
Nadav Har'El
afacffc556 utils/enum_option.hh: add implicit converter to the underlying enum
Add an implicit converter of the enum_option to the underyling enum
it is holding. This is needed for using switch() on an enum_option.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2021-06-13 13:18:49 +03:00
Avi Kivity
ec60f44b64 main: improve process file limit handling
We check that the number of open files is sufficent for normal
work (with lots of connections and sstables), but we can improve
it a little. Systemd sets up a low file soft limit by default (so that
select() doesn't break on file descriptors larger than 1023) and
recommends[1] raising the soft limit to the more generous hard limit
if the application doesn't use select(), as ours does not.

Follow the recommendation and bump the limit. Note that this applies
only to scylla started from the command line, as systemd integration
already raises the soft limit.

[1] http://0pointer.net/blog/file-descriptor-limits.html

Closes #8756
2021-06-13 09:19:35 +03:00
Tomasz Grabiec
7521301b72 Merge "raft: add tests for non-voters and fix related bugs" from Kostja
Add test coverage inspired by etcd for non-voter servers,
and fix issues discovered when testing.

* scylla-dev/raft-learner-test-v4:
  raft: (testing) test non-voter can vote
  raft: (testing) test receiving a confchange in a snapshot
  raft: (testing) test voter-non-voter config change loop
  raft: (testing) test non-voter doesn't start election on election timeout
  raft: (testing) test what happens when a learner gets TimeoutNow
  raft: (testing) implement a test for a leader becoming non-voter
  raft: style fix
  raft: step down as a leader if converted to a non-voter
  raft: improve configuration consistency checks
  raft: (testing) test that non-voter stays in PIPELINE mode
  raft: (testing) always return fsm_debug in create_follower()
2021-06-12 21:36:47 +03:00
Botond Dénes
cb208a56f2 docs/guides/debugging.md: expand section on libthread-db
Fix a typo in enabling libthread-db debugging.

Add command line snippet which can enable libthread-db debugging on
startup.

Split the long wall of text about likely problems into separate
per-problem subsections.

Add sub-section about recently found Fedora bug(?)
https://bugzilla.redhat.com/show_bug.cgi?id=1960867.

Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210603150607.378277-1-bdenes@scylladb.com>
2021-06-12 21:36:47 +03:00
Nadav Har'El
9774c146cc cql-pytest: add test for connecting with different SSL/TLS versions
This is a reproducer for issue #8827, that checks that a client which
tries to connect to Scylla with an unsupported version of SSL or TLS
gets the expected error alert - not some sort of unexpected EOF.

Issue #8827 is still open, so this test is still xfailing. However,
I verified that with a fix for this issue, the test passes.

The test also prints which protocol versions worked - so it also helps
checking issue #8837 (about the ancient SSL protocol being allowed).

Refs #8837
Refs #8827

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210610151714.1746330-1-nyh@scylladb.com>
2021-06-12 21:36:47 +03:00
Pavel Emelyanov
7b1f2d91a5 scylla-gdb: Remove maximum-request-size report
The recent seastar update moved the variable again, so to have a
proper support for it we'd need to have 2 try-catch attempts and
a default. Or 1 try-catch, but make sure the maintainer commits
this patch AND seastar update in one go, so that the intermediate
variable doesn't creep into an intermediate commit. Or bear the
scylla-gdb test is not bisect-safe a little bit.

Instead of making this complex choise I suggest to just drop the
volatile variable from the script at all. This thing is actually
a constant derived from the latency goal and io-properties.yaml
file, so it can be calculated without gdb help (unlike run-time
bits like group rovers or numbers of queued/executing resources).
To free developers from doing all this math by hands there's an
"ioinfo" tool that (when run with correct options) prints the
results of this math on the screen.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210610120151.1135-1-xemul@scylladb.com>
2021-06-11 19:06:43 +02:00
Michael Livshin
2bbc293e22 tests: improve error reporting of test_env::reusable_sst()
Distinguish the "no such sstable" case from any reading errors.

While at it, coroutinize the function.

Refs #8785.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20210610113304.264922-1-michael.livshin@scylladb.com>
2021-06-11 19:06:43 +02:00
Konstantin Osipov
2be8a73c34 raft: (testing) test non-voter can vote
When a non-voter is requested a vote, it must vote
to preserve liveness. In Raft, servers respond
to messages without consulting with their current configuration,
and the non-voter may not have the latest configuration
when it is requested to vote.
2021-06-11 17:16:57 +03:00
Konstantin Osipov
eaf32f2c3c raft: (testing) test receiving a confchange in a snapshot 2021-06-11 17:16:56 +03:00
Konstantin Osipov
d08ad76c24 raft: (testing) test voter-non-voter config change loop 2021-06-11 17:16:55 +03:00
Konstantin Osipov
6e4619fe87 raft: (testing) test non-voter doesn't start election on election timeout 2021-06-11 17:16:55 +03:00
Konstantin Osipov
c8ae13a392 raft: (testing) test what happens when a learner gets TimeoutNow
Once learner receives TimeoutNow it becomes a candidate, discovers it
can't vote, doesn't increase its term and converts back to a
follower. Once entries arrive from a new leader it updates its
term.
2021-06-11 17:16:55 +03:00
Konstantin Osipov
a972269630 raft: (testing) implement a test for a leader becoming non-voter 2021-06-11 17:16:55 +03:00
Konstantin Osipov
ba046ed1ab raft: style fix 2021-06-11 17:16:54 +03:00
Konstantin Osipov
b0a1ebc635 raft: step down as a leader if converted to a non-voter
If the leader becomes a non-voter after a configuration change,
step down and become a follower.

Non-voting members are an extension to Raft, so the protocol spec does
not define whether they can be leaders. I can not think of a reason
why they can't, yet I also can not think of a reason why it's useful,
so let's forbid this.

We already do not allow non-voters to become candidates, and
they ignore timeout_now RPC (leadership transfer), so they
already can not be elected.
2021-06-11 17:16:50 +03:00
Konstantin Osipov
684e0d2a8c raft: improve configuration consistency checks
Isolate the checks for configuration transitions in a static function,
to be able to unit test outside class server.

Split the condition of transitioning to an empty configuration
from the condition of transitioning into a configuration with
no voters, to produce more user-friendly error messages.

*Allow* to transfer leadership in a configuration when
the only voter is the leader itself. This would be equivalent
to syncing the leader log with the learner and converting
the leader to the follower itself. This is safe, since
the leader will re-elect itself quickly after an election timeout,
and may be used to do a rolling restart of a cluster with
only one voter.

A test case follows.
2021-06-11 17:16:47 +03:00
Konstantin Osipov
3e6fd5705b raft: (testing) test that non-voter stays in PIPELINE mode
Test that configuration changes preserve PIPELINE mode.
2021-06-11 17:07:39 +03:00
Konstantin Osipov
1dfe946c91 raft: (testing) always return fsm_debug in create_follower()
create_follower() is a test helper, so it's OK to return
a test-enabled FSM from it.
This will be used in a subsequent patch/test case.
2021-06-11 12:24:43 +03:00
Alejo Sanchez
ff34a6515d raft: replication test: fix elect_new_leader
Recently, the logic of elect_new_leader was changed to allow the old
leader to vote for the new candidate. But the implementation is wrong as
it re-connects the old leader in all cases disregarding if the nodes
were already disconnected.

Check if both old leader and the requested new leader are connected
first and only if it is the case then the old leader can participate in
the election.

There were occasional hangs in the loop of elect_new_leader because
other nodes besides the candidate were ticked.  This patch fixes the
loop by removing ticks inside of it.

The loop is needed to handle prevote corner cases (e.g. 2 nodes).

While there, also wait log on all followers to avoid a previously
dropped leader to be a dueling candidate.

And update _leader only if it was changed.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210609193945.910592-3-alejo.sanchez@scylladb.com>
2021-06-10 12:36:25 +02:00
Alejo Sanchez
add12d801d raft: log ignored prevote
Add a log line for ignored prevote.

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210609193945.910592-2-alejo.sanchez@scylladb.com>
2021-06-10 12:33:34 +02:00
Benny Halevy
e0622ef461 compaction_manager: stop_ongoing_compactions: print reason for stopping
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210610084704.388215-1-bhalevy@scylladb.com>
2021-06-10 11:52:57 +03:00
Piotr Sarna
7506f44c77 cql3: use existing constant for max result in indexed statements
Original code which introduced enforcing page limits for indexed
statements created a new constant for max result size in bytes.
Botond reported that we already have such a constant, so it's now
used instead of reinventing it from scratch.

Closes #8839
2021-06-10 11:08:54 +03:00
Nadav Har'El
b26fcf5567 test/alternator: increase timeouts in test_tracing.py
The query tracing tests in test/alternator's test_tracing.py had one
timeout of 30 seconds to find the trace, and one unclearly-coded timeout
for finding the right content for the trace. We recently saw both
timeouts exceeded in tests, but only rarely and only in debug mode,
in a run 100 times slower than normal.

This patch increases both timeouts to 100 seconds. Whatever happens then,
we win: If the test stops failing, we know the new timeout was enough.
If the test continues to fail, we will be able to conclude that we have a
real bug - e.g., perhaps one of the LWT operations has a bug causing it
to hang indefinitely.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210608205026.1600037-1-nyh@scylladb.com>
2021-06-10 09:19:01 +03:00
Benny Halevy
8ecc626c15 queue_reader_handle: mark copy constructor noexcept
It is trivially so, as std::exception_ptr is nothrow default
constructible.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210609135925.270883-2-bhalevy@scylladb.com>
2021-06-09 20:09:01 +03:00
Benny Halevy
3100cdcc65 queue_reader_handle: move-construct also _ex
We're only moving the other reader without the
other's exception (as it maybe already be abandoned
or aborted).

While at it, mark the constructor noexcept.

Fixes #8833

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210609135925.270883-1-bhalevy@scylladb.com>
2021-06-09 20:09:01 +03:00
Pavel Emelyanov
990db016e9 transport: Untie transport and database
Both controller and server only need database to get config from.
Since controller creation only happens in main() code which has the
config itself, we may remove database mentioning from transport.

Previous attempt was not to carry the config down to the server
level, but it stepped on an updateable_value landmine -- the u._v.
isn't copyable cross-shard (despite the docs) and to properly
initialize server's max_concurrent_requests we need the config's
named_value member itself.

The db::config that flies through the stack is const reference, but
its named_values do not get copied along the way -- the updateable
value accepts both references and const references to subscribe on.

tests: start-stop in debug mode

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210607135656.18522-1-xemul@scylladb.com>
2021-06-09 20:04:12 +03:00
Eliran Sinvani
9bfb2754eb dist: rpm: Add specific versioning and python3 dependency
The Red Hat packages were missing two things, first the metapackage
wasn't dependant at all in the python3 package and second, the
scylla-server package dependencies didn't contain a version as part
of the dependency which can cause to some problems during upgrade.
Doing both of the things listed here is a bit of an overkill as either
one of them separately would solve the problem described in #XXXX
but both should be applied in order to express the correct concept.

Fixes #8829

Closes #8832
2021-06-09 20:02:43 +03:00
Asias He
0665d9c346 gossip: Handle nodes removed from live endpoints directly
When a node is removed from the _live_endpoints list directly, e.g., a
node being decommissioned, it is possible the node might not be marked
as down in gossiper::failure_detector_loop_for_node loop before the loop
exits. When the gossiper::failure_detector_loop loop starts again, the
node will not be considered because it is not present in _live_endpoints
list any more. As a result, the node will not be marked as down though
gossiper::failure_detector_loop_for_node loop.

To fix, we mark the nodes that are removed from _live_endpoints
lists as down in the gossiper::failure_detector_loop loop.

Fixes #8712

Closes #8770
2021-06-09 15:02:25 +02:00
Tomasz Grabiec
419ee84d86 Merge "sstable: validate first and last keys ordering" from Benny
In #8772, an assert validating first token <= last token
failed in leveled_manifest::overlapping.

It is unclear how we got to that state, so add validation
in sstable::set_first_and_last_keys() that the to-be-set
first and last keys are well ordered.
Otherwise, throw malformed_sstable_exception.

set_first_and_last_keys is called both on the write path
from the sstable writer before the sstable is sealed,
and on the open/load path via update_info_for_opened_data().

This series also fixes issues with unit tests with
regards to first/last keys so they won't fail the
validation.

Refs #8772

Test: unit(dev)
DTest: next-gating(dev), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)

* tag 'validate-first-and-last-keys-ordering-v1':
  sstable: validate first and last keys ordering
  test: lib: reusable_sst: save unexpected errors
  test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
  test: sstable_test: define primary key in schema for compressed sstable
2021-06-09 14:43:02 +02:00
Avi Kivity
a57d8eef49 Merge 'streaming: make_streaming_consumer: close reader on errors' from Benny Halevy
Currently, if e.g. find_column_family throws an error,
as seen in #8776 when the table was dropped during repair,
the reader is not closed.

Use a coroutine to simplify error handling and
close the reader if an exception is caught.

Also, catch an error inside the lambda passed to make_interposer_consumer
when making the shared_sstable for streaming, and close the reader
their and return an exceptional future early, since
the reader will not be moved to sst->write_components, that assumes
ownership over it and closes it in all cases.

Fixes #8776

Test: unit(dev)
DTest: repair_additional_test.py:RepairAdditionalTest.repair_while_table_is_dropped_test (dev, debug) w/ https://github.com/scylladb/scylla/pull/8635#issuecomment-856661138

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #8782

* github.com:scylladb/scylla:
  streaming: make_streaming_consumer: close reader on errors
  streaming: make_streaming_consumer: coroutinize returned function
2021-06-09 15:02:36 +03:00
Tomasz Grabiec
ce7a404f17 Merge "Cleanups/refactoring for Raft Group 0" from Kostja
* scylla-dev/raft-group-0-part-1-rebase:
  raft: (service) pass Raft service into storage_service
  raft: (service) add comments for boot steps
  raft: add ordering for raft::server_address based on id
  raft: (internal) simplify construction of tagged_id
  raft: (internal) tagged_id minor improvements
2021-06-09 10:48:05 +02:00
Avi Kivity
d2157dfea7 Merge 'locator: token_metadata: simplify tokens_iterator' from Michał Chojnowski
`ring_range()`/`tokens_iterator` are more complicated than they need to be. The `include_min` parameter is not used anywhere, and `tokens_iterator` is pimplified without a good reason. Simplify that.

Closes #8805

* github.com:scylladb/scylla:
  locator: token_metadata: depimplify tokens_iterator
  locator: token_metadata: remove _ring_pos from tokens_iterator_impl
  locator: token_metadata: remove tokens_end()
  locator: token_metadata: remove `include_min` from tokens_iterator_impl
  locator: token_metadata: remove the `include_min` parameter from `ring_range()`
2021-06-08 15:42:41 +03:00
Konstantin Osipov
267a8e99ad raft: (service) pass Raft service into storage_service
Raft group 0 initialization and configuration changes
should be integrated with Scylla cluster assembly,
happening when starting the storage service and joining
the cluster. Prepare for this.

Since Raft service depends on query processor, and query
processor depends on storage service, to break a dependency
loop split Raft initialization into two steps: starting
an under-constructed instance of "sharded" Raft service,
accepting an under-constructed instance of "sharded"
query_processor, and then passed into storage service start
function, and then the local state of Raft groups from system
tables once query processor starts.

Consistently abbreviate raft_services instance raft_svcs, as
is the convention at Scylla.

Update the tests.
2021-06-08 14:52:32 +03:00
Konstantin Osipov
959bd21cdb raft: (service) add comments for boot steps 2021-06-08 14:52:32 +03:00
Konstantin Osipov
b81580f3c6 raft: add ordering for raft::server_address based on id 2021-06-08 14:52:32 +03:00
Konstantin Osipov
d42d5aee8c raft: (internal) simplify construction of tagged_id
Make it easy to construct tagged_id from UUID.
2021-06-08 14:52:32 +03:00
Konstantin Osipov
c9a23e9b8a raft: (internal) tagged_id minor improvements
Introduce a syntax helper tagged_id::create_random_id(),
used to create a new Raft server or group id.

Provide a default ordering for tagged ids, for use
in Raft leader discovery, which selects the smallest
id for leader.
2021-06-08 14:52:32 +03:00
Benny Halevy
5a8531c4c8 repair: get_sharder_for_tables: throw no_such_column_family
Insteadof std::runtime_error with a message that
resembles no_such_column_family, throw a
no_such_column_family given the keyspace and table uuid.

The latter can be explicitly caught and handled if needed.

Refs #8612

Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210608113605.91292-1-bhalevy@scylladb.com>
2021-06-08 14:45:44 +03:00
Nadav Har'El
355dbf2140 test/cql-pytest: option for running the tests over SSL
This patch adds a "--ssl" option to test/cql-pytest's pytest, as well as
to the run script test/cql-pytest/run. When "test/cql-pytest/run --ssl"
is used, Scylla is started listening for encrypted connections on its
standard port (9042) - using a temporary unsigned certificate. Then, the
individual tests connect to this encrypted port using TLSv1.2 (Scylla
doesn't support earlier version of SSL) instead of TCP.

This "--ssl" feature allows writing test which stress various aspects of
the connection (e.g., oversized requests - see PR #8800), and then be
able to run those tests in both TCP and SSL modes.

Fixes #8811

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210607200329.1536234-1-nyh@scylladb.com>
2021-06-08 11:43:20 +02:00