Commit Graph

1629 Commits

Author SHA1 Message Date
Avi Kivity
8d6e575f59 perf_fast_forward: report instructions per fragment
Use a hardware counter to report instructions per fragment. Results
vary from ~4k insns/f when reading sequentially to more than 1M insns/f.

Instructions per fragment can be a more stable metric than frags/sec.
It would probably be even more stable with a fake file implementation
that works in-memory to eliminate seastar polling instruction variation.

Closes #8660
2021-05-17 11:33:24 +02:00
Tomasz Grabiec
8dddfab5db Merge 'db/virtual tables: Add infrastructure + system.status example table' from Piotr Wojtczak
This is the 1st PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). Virtual tables created within this framework are "materialized" in memtables, so current solution is for small tables only. As an example system.status was added. It was checked that DISTINCT and reverse ORDER BY do work.

This PR was created by @jul-stas and @StarostaGit
Fixes #8343

This is the same as #8364, but with a compilation fix (newly added `close()` method was not implemented by the reader)

Closes #8634

* github.com:scylladb/scylla:
  boost/tests: Add virtual_table_test for basic infrastructure
  boost/tests: Test memtable_filling_virtual_table as mutation_source
  db/system_keyspace: Add system.status virtual table
  db/virtual_table: Add a way to specify a range of partitions for virtual table queries.
  db/virtual_table: Introduce memtable_filling_virtual_table
  db: Add virtual tables interface
  db: Introduce chained_delegating_reader
2021-05-17 11:29:37 +02:00
Benny Halevy
f4cfa530cc perf: enable instructions_retired_counter only once per executor::run
Enabling it for each run_worker call will invoke ioctl
PERF_EVENT_IOC_ENABLE in parallel to other workers running
and this may skew the results.

Test: perf_simple_query
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210514130542.301168-1-bhalevy@scylladb.com>
2021-05-16 12:13:27 +03:00
Tomasz Grabiec
28ac8d0f2b Merge "raft: randomized_nemesis_test framework" from Kamil
We introduce `PureStateMachine`, which is the most direct translation
of the mathematical definition of a state machine to C++ that I could
come up with.  Represented by a C++ concept, it consists of: a set of
inputs (represented by the `input_t` type), outputs (`output_t` type),
states (`state_t`), an initial state (`init`) and a transition
function (`delta`) which given a state and an input returns a new
state and an output.

The rest of the testing infrastructure is going to be generic
w.r.t. `PureStateMachine`. This will allow easily implementing tests
using both simple and complex state machines by substituting the
proper definition for this concept.

Next comes `logical_timer`: it is a wrapper around
`raft::logical_clock` that allows scheduling events to happen after a
certain number of logical clock ticks.  For example,
`logical_timer::sleep(20_t)` returns a future that resolves after 20
calls to `logical_timer::tick()`. It will be used to introduce
timeouts in the tests, among other things.

To replicate a state machine, our Raft implementation requires it to
be represented with the `raft::state_machine` interface.

`impure_state_machine` is an implementation of `raft::state_machine`
that wraps a `PureStateMachine`. It keeps a variable of type `state_t`
representing the current state. In `apply` it deserializes the given
command into `input_t`, uses the transition (`delta`) function to
produce the next state and output, replaces its current state with the
obtained state and returns the output (more on that below); it does so
sequentially for every given command. We can think of `PureStateMachine`
as the actual state machine - the business logic, and
`impure_state_machine` as the ``boilerplate'' that allows the pure machine
to be replicated by Raft and communicate with the external world.

The interface also requires maintainance of snapshots. We introduce the
`snapshots_t` type representing a set of snapshots known by a state
machine. `impure_state_machine` keeps a reference to `snapshots_t`
because it will share it with an implementation of `persistence`.

Returning outputs is a bit tricky because apply is ``write-only'' - it
returns `future<>`. We use the following technique:

1. Before sending a command to a Raft leader through `server::add_entry`,
   one must first directly contact the instance of `impure_state_machine`
   replicated by the leader, asking it to allocate an ``output channel''.
2. On such a request, `impure_state_machine` creates a channel
   (represented by a promise-future pair) and a unique ID; it stores the
   input side of the channel (the promise) with this ID internally and returns
   the ID and the output side of the channel (the future) to the requester.
3. After obtaining the ID, one serializes the ID together with the input
   and sends it as a command to Raft. Thus commands are (ID, machine input)
   pairs.
4. When `impure_state_machine` applies a command, it looks for a promise
   with the given ID. If it finds one, it sends the output through this
   channel.
5. The command sender waits for the output on the obtained future.

The allocation and deallocation of channels is done using the
`impure_state_machine::with_output_channel` function. The `call`
function is an implementation of the above technique.

Note that only the leader will attempt to send the output - other
replicas won't find the ID in their internal data structure. The set of
IDs and channels is not a part of the replicated state.

A failure may cause the output to never arrive (or even the command to
never be applied) so `call` waits for a limited time. It may also
mistakenly `call` a server which is not currently the leader, but it
is prepared to handle this error.

We implement the `raft::rpc` interface, allowing Raft servers to
communicate with other Raft servers.

The implementation is mostly boilerplate. It assumes that there exists a
method of message passing, given by a `send_message_t` function passed
in the constructor. It also handles the receival of messages in the
`receive` function. It defines the message type (`message_t`) that will
 be used by the message-passing method.

The actual message passing is implemented with `network` and `delivery_queue`.

The only slightly complex thing in `rpc` is the implementation of `send_snapshot`
which is the only function in the `raft::rpc` interface that actually
expects a response. To implement this, before sending the snapshot
message we allocate a promise-future pair and assign to it a unique ID;
we store the promise and the ID in a data structure. We then send the
snapshot together with the ID and wait on the future. The message
receival function on the other side, when it receives the snapshot message,
applies the snapshot and sends back a snapshot reply message that contains
the same ID. When we receive a snapshot reply message we look up the ID in the
data structure and if we find a promise, we push the reply through that
promise.

`rpc` also keeps a reference to `snapshots_t` - it will refer to the
same set of snapshots as the `impure_state_machine` on the same server.
It accesses the set when it receives or sends a snapshot message.

`persistence` represents the data that does not get lost between server
crashes and restarts.

We store a log of commands in `_stored_entries`. It is invariably
``contiguous'', meaning that the index of each entry except the first is
equal to the index of the previous entry plus one at all times (i.e.
after each yield). We assume that the caller provides log entries
in strictly increasing index order and without gaps.

Additionally to storing log entries, `persistence` can be asked to store
or load a snapshot. To implement this it takes a reference to a set of snapshots
(`snapshots_t&`) which it will share with `impure_state_machine` and an
implementation of `rpc`.  We ensure that the stored log either ``touches''
the stored snapshot on the right side or intersects it.

In order to simulate a production environment as closely as possible, we
implement a failure detector which uses heartbeats for deciding whether
to convict a server as failed. We convict a server if we don't receive a
heartbeat for a long enough time.

Similarly to `rpc`, `failure_detector` assumes a message passing method
given by a `send_heartbeat_t` function through the constructor.

`failure_detector` uses the knowledge about existing servers to decide
who to send heartbeats to. Updating this knowledge happens through
`add_server` and `remove_server` functions.

`network` is a simple priority queue of "events", where an event is a
message associated with delivery time. Each message contains a source,
a destination, and payload. The queue uses a logical clock to decide
when to deliver messages; it delivers are messages whose associated
times are smaller than the current time.

The exact delivery method is unknown to `network` but passed as a
`deliver_t` function in the constructor. The type of payload is generic.

The fact that `network` has delivered a message does not mean the
message was processed by the receiver. In fact, `network` assumes that
delivery is instantaneous, while processing a message may be a long,
complex computation, or even require IO. Thus, after a message is
delivered, something else must ensure that it is processed by the
destination server.

That something in our framework is `delivery_queue`. It will be the
bridge between `network` and `rpc`. While `network` is shared by all
servers - it represents the ``environment'' in which the servers live -
each server has its own private `delivery_queue`. When `network`
delivers an RPC message it will end up inside `delivery_queue`. A
separate fiber, `delivery_queue::receive_fiber()`, will process those
messages by calling `rpc::receive` (which is a potentially long
operation, thus returns a `future<>`) on the `rpc` of the destination
server.

`raft_server` is a package that contains `raft::server` and other
facilities needed for the server to communicate with its environment:
the delivery queue, the set of snapshots (shared by
`impure_state_machine`, `rpc` and `persistence`) and references to the
`impure_state_machine` and `rpc` instances of this server.

`environment` represents a set of `raft_server`s connected by a `network`.

The `network` inside is initialized with a message delivery function
which notifies the destination server's failure detector on each message
and if the message contains an RPC payload, pushes it into the destination's
`delivery_queue`.

Needs to be periodically `tick()`ed which ticks the network
and underlying servers.

`ticker` calls the given function as fast as the Seastar reactor
allows and yields between each call. It may be provided a limit
for the number of calls; it crashes the test if the limit is reached
before the ticker is `abort()`ed.

Finally, we add a simple test that serves as an example of using the
implemented framework. We introduce `ExRegister`, an implementation
of `PureStateMachine` that stores an `int32_t` and handles ``exchange''
and ``read'' inputs; an exchange replaces the state with the given value
and returns the previous state, a read does not modify the state and returns
the current state.  In order to pass the inputs to Raft we must
serialize them into commands so we implement instances of `ser::serializer`
for `ExReg`'s input types.

* kbr/randomized-nemesis-test-v5:
  raft: randomized_nemesis_test: basic test
  raft: randomized_nemesis_test: ticker
  raft: randomized_nemesis_test: environment
  raft: randomized_nemesis_test: server
  raft: randomized_nemesis_test: delivery queue
  raft: randomized_nemesis_test: network
  raft: randomized_nemesis_test: heartbeat-based failure detector
  raft: randomized_nemesis_test: memory backed persistence
  raft: randomized_nemesis_test: rpc
  raft: randomized_nemesis_test: impure_state_machine
  raft: randomized_nemesis_test: introduce logical_timer
  raft: randomized_nemesis_test: `PureStateMachine` concept
2021-05-14 17:33:40 +02:00
Tomasz Grabiec
0fdd2f8217 Merge "raft: fsm cleanups" from Gleb
* scylla-dev/raft-cleanup-v1:
  raft: drop _leader_progress tracking from the tracker
  raft: move current_leader into the follower state
  raft: add some precondition checks
2021-05-14 17:24:59 +02:00
Kamil Braun
c21311ecca raft: randomized_nemesis_test: basic test
This is a simple test that serves as an example of using the
framework implemented in the previous commits. We introduce
`ExRegister`, an implementation of `PureStateMachine` that stores
an `int32_t` and handles ``exchange'' and ``read'' inputs;
an exchange replaces the state with the given value and returns
the previous state, a read does not modify the state and returns
the current state.  In order to pass the inputs to Raft we must
serialize them into commands so we implement instances of `ser::serializer`
for `ExReg`'s input types.
2021-05-14 15:11:01 +02:00
Kamil Braun
66b9bc6fe1 raft: randomized_nemesis_test: ticker
`ticker` calls the given function as fast as the Seastar reactor
allows and yields between each call. It may be provided a limit
for the number of calls; it crashes the test if the limit is reached
before the ticker is `abort()`ed.

The commit also introduces a `with_env_and_ticker` helper function which
creates an `environment`, a `ticker`, and passes references to them to
the given function. It destroys them after the function finishes
by calling `abort()`.
2021-05-14 15:11:01 +02:00
Kamil Braun
c7cef58797 raft: randomized_nemesis_test: environment
`environment` represents a set of `raft_server`s connected by a `network`.

The `network` inside is initialized with a message delivery function
which notifies the destination server's failure detector on each message
and if the message contains an RPC payload, pushes it into the destination's
`delivery_queue`.

Needs to be periodically `tick()`ed which ticks the network
and underlying servers.

New servers can be created in the environment by calling `new_server`.
2021-05-14 15:11:01 +02:00
Kamil Braun
5095a4158e raft: randomized_nemesis_test: server
`raft_server` is a package that contains `raft::server` and other
facilities needed for the server to communicate with its environment:
the delivery queue, the set of snapshots (shared by
`impure_state_machine`, `rpc` and `persistence`) and references to the
`impure_state_machine` and `rpc` instances of this server.
2021-05-14 15:11:01 +02:00
Kamil Braun
f139fd4c28 raft: randomized_nemesis_test: delivery queue
The fact that `network` has delivered a message does not mean the
message was processed by the receiver. In fact, `network` assumes that
delivery is instantaneous, while processing a message may be a long,
complex computation, or even require IO. Thus, after a message is
delivered, something else must ensure that it is processed by the
destination server.

That something in our framework is `delivery_queue`. It will be the
bridge between `network` and `rpc`. While `network` is shared by all
servers - it represents the ``environment'' in which the servers live -
each server has its own private `delivery_queue`. When `network`
delivers an RPC message it will end up inside `delivery_queue`. A
separate fiber, `delivery_queue::receive_fiber()`, will process those
messages by calling `rpc::receive` (which is a potentially long
operation, thus returns a `future<>`) on the `rpc` of the destination
server.
2021-05-14 15:11:01 +02:00
Kamil Braun
2956f5f76c raft: randomized_nemesis_test: network
`network` is a simple priority queue of "events", where an event is a
message associated with delivery time. Each message contains a source,
a destination, and payload. The queue uses a logical clock to decide
when to deliver messages; it delivers are messages whose associated
times are smaller than the current time.

The exact delivery method is unknown to `network` but passed as a
`deliver_t` function in the constructor. The type of payload is generic.
2021-05-14 15:11:01 +02:00
Kamil Braun
3068a0aa70 raft: randomized_nemesis_test: heartbeat-based failure detector
In order to simulate a production environment as closely as possible, we
implement a failure detector which uses heartbeats for deciding whether
to convict a server as failed. We convict a server if we don't receive a
heartbeat for a long enough time.

Similarly to `rpc`, `failure_detector` assumes a message passing method
given by a `send_heartbeat_t` function through the constructor.

`failure_detector` uses the knowledge about existing servers to decide
who to send heartbeats to. Updating this knowledge happens through
`add_server` and `remove_server` functions.
2021-05-14 15:11:01 +02:00
Kamil Braun
51df600478 raft: randomized_nemesis_test: memory backed persistence
`persistence` represents the data that does not get lost between server
crashes and restarts.

We store a log of commands in `_stored_entries`. It is invariably
``contiguous'', meaning that the index of each entry except the first is
equal to the index of the previous entry plus one at all times (i.e.
after each yield). We assume that the caller provides log entries
in strictly increasing index order and without gaps.

Additionally to storing log entries, `persistence` can be asked to store
or load a snapshot. To implement this it takes a reference to a set of snapshots
(`snapshots_t&`) which it will share with `impure_state_machine` and an
implementation of `rpc` coming in a later commit.  We ensure that the stored
log either ``touches'' the stored snapshot on the right side or intersects it.
2021-05-14 15:11:01 +02:00
Kamil Braun
7a1f6e6d7b raft: randomized_nemesis_test: rpc
We implement the `raft::rpc` interface, allowing Raft servers to
communicate with other Raft servers.

The implementation is mostly boilerplate. It assumes that there exists a
method of message passing, given by a `send_message_t` function passed
in the constructor. It also handles the receival of messages in the
`receive` function. It defines the message type (`message_t`) that will
 be used by the message-passing method.

The actual message passing is implemented with `network` and `delivery_queue`
which are introduced in later commits.

The only slightly complex thing in `rpc` is the implementation of `send_snapshot`
which is the only function in the `raft::rpc` interface that actually
expects a response. To implement this, before sending the snapshot
message we allocate a promise-future pair and assign to it a unique ID;
we store the promise and the ID in a data structure. We then send the
snapshot together with the ID and wait on the future. The message
receival function on the other side, when it receives the snapshot message,
applies the snapshot and sends back a snapshot reply message that contains
the same ID. When we receive a snapshot reply message we look up the ID in the
data structure and if we find a promise, we push the reply through that
promise.

`rpc` also keeps a reference to `snapshots_t` - it will refer to the
same set of snapshots as the `impure_state_machine` on the same server.
It accesses the set when it receives or sends a snapshot message.
2021-05-14 15:11:01 +02:00
Kamil Braun
905126acc3 raft: randomized_nemesis_test: impure_state_machine
To replicate a state machine, our Raft implementation requires it to
be represented with the `raft::state_machine` interface.

`impure_state_machine` is an implementation of `raft::state_machine`
that wraps a `PureStateMachine`. It keeps a variable of type `state_t`
representing the current state. In `apply` it deserializes the given
command into `input_t`, uses the transition (`delta`) function to
produce the next state and output, replaces its current state with the
obtained state and returns the output (more on that below); it does so
sequentially for every given command. We can think of `PureStateMachine`
as the actual state machine - the business logic, and
`impure_state_machine` as the ``boilerplate'' that allows the pure machine
to be replicated by Raft and communicate with the external world.

The interface also requires maintainance of snapshots. We introduce the
`snapshots_t` type representing a set of snapshots known by a state
machine. `impure_state_machine` keeps a reference to `snapshots_t`
because it will share it with an implementation of `raft::persistence`
coming with a later commit.

Returning outputs is a bit tricky because apply is ``write-only'' - it
returns `future<>`. We use the following technique:

1. Before sending a command to a Raft leader through `server::add_entry`,
   one must first directly contact the instance of `impure_state_machine`
   replicated by the leader, asking it to allocate an ``output channel''.
2. On such a request, `impure_state_machine` creates a channel
   (represented by a promise-future pair) and a unique ID; it stores the
   input side of the channel (the promise) with this ID internally and returns
   the ID and the output side of the channel (the future) to the requester.
3. After obtaining the ID, one serializes the ID together with the input
   and sends it as a command to Raft. Thus commands are (ID, machine input)
   pairs.
4. When `impure_state_machine` applies a command, it looks for a promise
   with the given ID. If it finds one, it sends the output through this
   channel.
5. The command sender waits for the output on the obtained future.

The allocation and deallocation of channels is done using the
`impure_state_machine::with_output_channel` function. The `call`
function is an implementation of the above technique.

Note that only the leader will attempt to send the output - other
replicas won't find the ID in their internal data structure. The set of
IDs and channels is not a part of the replicated state.

A failure may cause the output to never arrive (or even the command to
never be applied) so `call` waits for a limited time. It may also
mistakenly `call` a server which is not currently the leader, but it
is prepared to handle this error.
2021-05-14 15:11:01 +02:00
Kamil Braun
3e02befccd raft: randomized_nemesis_test: introduce logical_timer
This is a wrapper around `raft::logical_clock` that allows scheduling
events to happen after a certain number of logical clock ticks.
For example, `logical_timer::sleep(20_t)` returns a future that resolves
after 20 calls to `logical_timer::tick()`.
2021-05-13 11:34:00 +02:00
Kamil Braun
15e3bd2620 raft: randomized_nemesis_test: PureStateMachine concept
The commit introduces `PureStateMachine`, which is the most direct translation
of the mathematical definition of a state machine to C++ that I could come up with.
Represented by a C++ concept, it consists of: a set of inputs
(represented by the `input_t` type), outputs (`output_t` type), states (`state_t`),
an initial state (`init`) and a transition function (`delta`) which
given a state and an input returns a new state and an output.

The rest of the testing infrastructure is going to be
generic w.r.t. `PureStateMachine`. This will allow easily implementing
tests using both simple and complex state machines by substituting the
proper definition for this concept.

One possibility of modifying this definition would be to have `delta`
return `future<pair<state_t, output_t>>` instead of
`pair<state_t, output_t>`. This would lose some ``purity'' but allow
long computations without reactor stalls in the tests. Such modification,
if we decide to do it, is trivial.
2021-05-13 11:34:00 +02:00
Alejo Sanchez
68f69671b5 raft: style: test optionals directly
Avoid using has_value() and test optional directly

Signed-off-by: Alejo Sanchez <alejo.sanchez@scylladb.com>
Message-Id: <20210512142018.297203-2-alejo.sanchez@scylladb.com>
2021-05-12 20:39:52 +02:00
Piotr Wojtczak
e6254acfd3 boost/tests: Add virtual_table_test for basic infrastructure 2021-05-12 17:05:35 +02:00
Piotr Wojtczak
8825ae128d boost/tests: Test memtable_filling_virtual_table as mutation_source
Uses the infrastructure for testing mutation_sources, but only a
subset of it which does not do fast forwarding (since virtual_table
does not support it).
2021-05-12 17:05:35 +02:00
Tomasz Grabiec
a9dd7a295d tests: perf_row_cache_reads: Add scenario for lots of rows covered by a range tombstone
Reproduces #8626.

Output:

    test_scan_with_range_delete_over_rows
    Populating with rows
    Rows: 702710
    Scanning...
    read: 540.007324 [ms], preemption: {count: 2356, 99%: 1.131752 [ms], max: 1.148589 [ms]}, cache: 251/252 [MB]
    read: 651.942688 [ms], preemption: {count: 1176, 99%: 1.131752 [ms], max: 1.009652 [ms]}, cache: 251/252 [MB]
2021-05-12 11:58:36 +02:00
Nadav Har'El
cee4c075d2 Merge 'Fix index name conflicts with regular tables' from Piotr Sarna
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
When an index is created *with* an explicit name and a conflicting
regular table is found, index creation should simply fail.

This series comes with a test.

Fixes #8620
Tests: unit(release)

Closes #8632

* github.com:scylladb/scylla:
  cql-pytest: add regression tests for index creation
  cql3: fail to create an index if there is a name conflict
  database: check for conflicting table names for indexes
2021-05-11 18:40:15 +03:00
Benny Halevy
9ba960a388 utils: phased_barrier::operation do not leak gate entry when reassigned
utils::phased_barrier holds a `lw_shared_ptr<gate>` that is
typically `enter()`ed in `phased_barrier::start()`,
and left when the operation is destroyed in `~operation`.

Currently, the operation move-assign implementation is the
default one that just moves the lw_shared gate ptr from the
other operation into this one, without calling `_gate->leave()` first.

This change first destroys *this when move-assigned (if not self)
to call _gate->leave() if engaged, before reassigning the
gate with the other operation::_gate.

A unit test that reproduces the issue before this change
and passes with the fix was added to serialized_action_test.

Fixes #8613

Test: unit(dev), serialized_action_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210510120703.1520328-1-bhalevy@scylladb.com>
2021-05-11 18:39:10 +03:00
Avi Kivity
1d8234f52d Merge "reader_concurrency_semaphore: improve diagnostics printout" from Botond
"
The current printout is has multiple problems:
* It is segregated by state, each having its own sorting criteria;
* Number of permits and count resources is collapsed in to a single
  column, not clear which is the one printed.
* Number of available/initial units of the semaphore are not printed;

This series solves all this problems:
* It merges all states into a single table, sorted by memory
  consumption, in descending order.
* It separates number of permits and count resources into separate
  columns.
* Prints a summary of the semaphore units.
* Provides a cap on the maximum amount of printable lines, to not blow
  up the logs.

The goal of all this is to make it easy to find the culprit a semaphore
problem: easily spot the big memory consumers, then unpack the name
column to determine which table and code path is responsible.
This brings the printout close to the recently `scylla reads`
scylla-gdb.py command, providing a uniform report format across the two
tools.
Example report:
INFO  2021-05-07 09:52:16,806 [shard 0] testlog - With max-lines=4: Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/2147483647 count and 263599186/9223372036854775807 memory resources: user request, dumping permit diagnostics:
permits count   memory  table/description/state
7       2       77M     ks.tbl1/op1/active
6       3       59M     ks.tbl1/op0/active
4       0       36M     ks.tbl1/op2/active
3       1       36M     ks.tbl0/op2/active
11      2       43M     permits omitted for brevity

31      8       251M    total
"

* 'reader-concurrency-semaphore-dump-improvement/v1' of https://github.com/denesb/scylla:
  test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics
  reader_concurrency_semaphore: dump_reader_diagnostics(): print more information in the header
  reader_concurrency_semaphore: dump_reader_diagnostics(): cap number of printed lines
  reader_concurrency_semaphore: dump_reader_diagnostics(): sort lines in descending order
  reader_concurrency_semaphore: dump_reader_diagnostics(): merge all states into a single table
  reader_concurrency_semaphore: dump_reader_diagnostics(): separate number of permits and count resources
2021-05-11 18:39:10 +03:00
Nadav Har'El
af485f5226 secondary index: fix index name in IndexInfo system table
In commit 3e39985c7a we added the Cassandra-compatible system table
system."IndexInfo" (note the capitalized table name) which lists built
indexes. Because we already had a table of built materialized views, and
indexes are implemented as materialized views, the index list was
implemented as a virtual table based on the view list.

However, the *name* of each materialized view listed in the list of
views looks like something_index, with the suffix "_index", while the
name of the table we need to print is "something". We forgot to do this
transformation in the virtual table - and this is what this patch does.

This bug can confuse applications which use this system table to wait for
an index to be built. Several tests translated from Cassandra's unit
tests, in cassandra_tests/validation/entities/secondary_index_test.py fail
in wait_for_index() because of this incompatibility, and pass after this
patch.

This patch also changes the unit test that enshrined the previous, wrong,
behavior, to test for the correct behavior. This problem is typical of
C++ unit tests which cannot be run against Cassandra.

Fixes #8600

Unfortunately, although this patch fixes "typical" applications (including
all tests which I tried) - applications which read from IndexInfo in a
"typical" method to look for a specific index being ready, the
implementation is technically NOT correct: The problem is that index
names are not sorted in the right order, because they are sorted with
the "_index" prefix.
To give an example, the index names "a" should be listed before "a1", but
the view names "a1_index" comes before "a_index" (because in ASCII, 1
comes before underscore). I can't think of any way to fix this bug
without completely reimplementing IndexInfo in a different way - probably
based on a temporary memtable (which is fine as this is not a
performance-critical operation). We'll need to do this rewrite eventually,
and I'll open a new issue.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210509140113.1084497-1-nyh@scylladb.com>
2021-05-11 18:39:10 +03:00
Avi Kivity
61c7f874cc Merge 'Add per-service-level timeouts' from Piotr Sarna
Ref: #7617

This series adds timeout parameters to service levels.

Per-service-level timeouts can be set up in the form of service level parameters, which can in turn be attached to roles. Setting up and modifying role-specific timeouts can be achieved like this:
```cql
CREATE SERVICE LEVEL sl2 WITH read_timeout = 500ms AND write_timeout = 200ms AND cas_timeout = 2s;
ATTACH SERVICE LEVEL sl2 TO cassandra;
ALTER SERVICE LEVEL sl2 WITH write_timeout = null;
```
Per-service-level timeouts take precedence over default timeout values from scylla.yaml, but can still be overridden for a specific query by per-query timeouts (e.g. `SELECT * from t USING TIMEOUT 50ms`).

Closes #7913

* github.com:scylladb/scylla:
  docs: add a paragraph describing service level timeouts
  test: add per-service-level timeout tests
  test: add refreshing client state
  transport: add updating per-service-level params
  client_state: allow updating per service level params
  qos: allow returning combined service level options
  qos: add a way of merging service level options
  cql3: add preserving default values for per-sl timeouts
  qos: make getting service level public
  qos: make finding service level public
  treewide: remove service level controller from query state
  treewide: propagate service level to client state
  sstables: disambiguate boost::find
  cql3: add a timeout column to LIST SERVICE LEVEL statement
  db: add extracting service level info via CQL
  types: add a missing translation for cql_duration
  cql3: allow unsetting service level timeouts
  cql3: add validating service level timeout values
  db: add setting service level params via system_distributed
  cql3: add fetching service level attrs in ALTER and CREATE
  cql3: add timeout to service level params
  qos: add timeout to service level info
  db,sys_dist_ks: add timeout to the service level table
  migration_manager: allow table updates with timestamp
  cql3: allow a null keyword for CQL properties
2021-05-11 18:39:10 +03:00
Michael Livshin
ff7d781988 test: enable scylla-gdb/run
It should pass now.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Michael Livshin
73f9f08df6 test: add a basic test for scylla-gdb.py
(And disable it initially, because it won't pass without subsequent
commits)

Runs only in release mode, to keep things more realistic.

Doesn't exercise Scylla much at present -- just stops it after several
compactions and tries (almost) all "scylla *" commands in order.

Refs #6952.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Michael Livshin
3bff94cd29 test.py: refine test mode control
* Add ability to skip tests in individual modes using "skip_in_<mode>".

* Add ability to allow tests in specific modes using "run_in_<mode>".

* Rename "skip_in_debug_mode" to "skip_in_debug_modes", because there
  is an actual mode named "debug" and this is confusing.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-05-11 18:39:10 +03:00
Avi Kivity
b1f9df279a Merge "Untie cdc, storage service and migration notifier knot" from Pavel E
"
Storage service needs migration notifier reference to pass it to cdc
service via get_local_storage_service(). This set removes

- get_local_storage_service from cdc
- migration notifier from storage service
- db_context::builder from cdc (released nuclear binding energy)

tests: unit(dev)
"

* 'br-cdc-no-storage-service' of https://github.com/xemul/scylla:
  storage_service: Remove migration notifier dependency
  cdc: Remove db_context::builder
  cdc: Provide migration notifier right at once
  cdc: Remove db_context::builder::with_migration_notifier
2021-05-11 18:39:10 +03:00
Piotr Sarna
1cb804f024 cql-pytest: add regression tests for index creation
This commit adds unit tests for an issue with index creation
after a table with malicious name is previously created as well.
The cases cover both indexes with a default name and the ones with
explicit name set.
2021-05-11 17:34:37 +02:00
Botond Dénes
69d04d161e test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics
Not really testing anything, at least not automatically. It just
provides coverage for the diagnostics dump code, as well as allows for
developers to inspect the printout visually when making changes.
2021-05-10 18:06:30 +03:00
Piotr Sarna
570c63d39b test: add per-service-level timeout tests
The test suite checks if per-service-level timeouts
work and validate their input.
2021-05-10 12:39:41 +02:00
Piotr Sarna
43f1f9e445 test: add refreshing client state
With a helper client state refresher, some attributes
which are usually only refreshed after a client disconnects
and then reconnects, can be verified in the test suite.
2021-05-10 12:39:41 +02:00
Piotr Sarna
e257ec11c0 treewide: remove service level controller from query state
... since it's accessible through its member, client state.
2021-05-10 11:48:14 +02:00
Piotr Sarna
d1f2e8b469 treewide: propagate service level to client state
... since it's going to be used to set up per-service-level
timeouts.
2021-05-10 11:48:14 +02:00
Piotr Sarna
e8d271fea9 db: add extracting service level info via CQL 2021-05-10 11:45:09 +02:00
Piotr Sarna
7e6beabf27 migration_manager: allow table updates with timestamp
In order to avoid needless schema disagreements, a way of announcing
a schema change with fixed timestamp is added.
That way, when nodes update schemas of their internal tables (e.g.
during updates), it's possible for all nodes to use an identical
timestamp for this operation, which in turn makes their digests
identical.
2021-05-10 10:10:38 +02:00
Gleb Natapov
78c5a72b32 raft: drop _leader_progress tracking from the tracker
The tracker maintains a separate pointer to current leader progress,
but all this complexity is not needed because the tracker already have
find() function that can either find a leader's progress by id or return
null. Removing the tracking simplifies code and make going out of sync
(which is always a possibility if a state is maintained in two different
places) impossible.
2021-05-09 13:55:55 +03:00
Raphael S. Carvalho
8480839932 LCS/reshape: Don't reshape single sstable in level 0 with strict mode
With strict mode, it could happen that a sstable alone in level 0 is
selected for offstrategy compaction, which means that we could run
into an infinite reshape process.

This is fixed by respecting the offstrategy threshold. Unit test is
added.

Fixes #8573.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210506181324.49636-1-raphaelsc@scylladb.com>
2021-05-09 11:09:54 +03:00
Tomasz Grabiec
abe3d7d7d3 Merge 'storage_proxy: use small_vector for vectors of inet_address' from Avi Kivity
storage_proxy uses std::vector<inet_address> for small lists of nodes - for replication (often 2-3 replicas per operation) and for pending operations (usually 0-1). These vectors require an allocation, sometimes more than one if reserve() is not used correctly.

This series switches storage_proxy to use utils::small_vector instead, removing the allocations in the common case.

Test results (perf_simple_query --smp 1 --task-quota-ms 10):

```
before: median 184810.98 tps ( 91.1 allocs/op,  20.1 tasks/op,   54564 insns/op)
after:  median 192125.99 tps ( 87.1 allocs/op,  20.1 tasks/op,   53673 insns/op)
```

4 allocations and ~900 instructions are removed (the tps figure is also improved, but it is less reliable due to cpu frequency changes).

The type change is unfortunately not contained in storage_proxy - the abstraction leaks to providers of replica sets and topology change vectors. This is sad but IMO the benefits make it worthwhile.

I expect more such changes can be applied in storage_proxy, specifically std::unordered_set<gms::inet_address> and vectors of response handles.

Closes #8592

* github.com:scylladb/scylla:
  storage_proxy, treewide: use utils::small_vector inet_address_vector:s
  storage_proxy, treewide: introduce names for vectors of inet_address
  utils: small_vector: add print operator for std::ostream
  hints: messages.hh: add missing #include
2021-05-06 18:00:54 +02:00
Tomasz Grabiec
6aec8cc447 Merge "raft: fixes and improvements for snapshot transfer" from Gleb
* scylla-dev/raft-snapshot-fixes-v4:
  raft: document that add entry my throw commit_status_unknown
  raft: test: add test of a leadership change during ongoing snapshot transfer
  raft: test: retry submitting an entry if it was dropped
  raft: test: wait for the log to be fully replicated on new leader only
  raft: drop waiters with outdated terms
  raft: make snapshot transfer abortable
  raft: accept snapshots transfer from multiple nodes simultaneously
  raft: do not send probes while transferring snapshot
  raft: handle messages sending errors
  raft: test: return error from rpc module if nodes are disconnected
  raft: fix a typo in a variable name
2021-05-06 17:44:22 +02:00
Gleb Natapov
3a1bff26dd raft: test: add test of a leadership change during ongoing snapshot transfer 2021-05-06 11:34:31 +03:00
Gleb Natapov
612e0f08c4 raft: test: retry submitting an entry if it was dropped 2021-05-06 11:34:31 +03:00
Gleb Natapov
0b2c9c549a raft: test: wait for the log to be fully replicated on new leader only
When forcing new leader it should be enough to wait for log to be fully
replicated to that particular leader.
2021-05-06 11:34:31 +03:00
Gleb Natapov
6abe2772dc raft: make snapshot transfer abortable
A snapshot transfer may take a lot of time and meanwhile a leader doing
it may lose the leadership. If that happens the ongoing snapshot transfer
becomes obsolete since the snapshot will be rejected by the receiving
node as coming from an old leader. Make snapshot transfer abortable and
abort them when leader changes.
2021-05-06 11:34:31 +03:00
Gleb Natapov
d0ebd79deb raft: test: return error from rpc module if nodes are disconnected
Returning an error when nodes are disconnected closer resembles what
will happen in real networking.
2021-05-06 11:34:31 +03:00
Botond Dénes
c872a963b6 test: move reader_concurrency_semaphore related tests into separate file
The mutation_reader_test is already one of our largest test files.
Move the reader concurrency semaphore related tests to a new file,
making them easier to find making the mutation reader test a little bit
smaller too.
2021-05-06 08:59:47 +03:00
Botond Dénes
5f217b6dee test: mutation_reader_test: convert restricted reader tests to semaphore tests
These two tests (restricted_reader_timeout and
restricted_reader_max_queue_length) are testing the semaphore in
reality, but through the restricted reader, which is distracting as it
needlessly brings in an additional layer into the picture. Rewrite them
to test the semaphore directly, getting much lighter in the process.
2021-05-06 08:57:12 +03:00
Avi Kivity
cea5493cb7 storage_proxy, treewide: introduce names for vectors of inet_address
storage_proxy works with vectors of inet_addresses for replica sets
and for topology changes (pending endpoints, dead nodes). This patch
introduces new names for these (without changing the underlying
type - it's still std::vector<gms::inet_address>). This is so that
the following patch, that changes those types to utils::small_vector,
will be less noisy and highlight the real changes that take place.
2021-05-05 18:36:48 +03:00