This is the 1st PR in series with the goal to finish the hackathon project authored by @tgrabiec, @kostja, @amnonh and @mmatczuk (improved virtual tables + function call syntax in CQL). Virtual tables created within this framework are "materialized" in memtables, so current solution is for small tables only. As an example system.status was added. It was checked that DISTINCT and reverse ORDER BY do work.
This PR was created by @jul-stas and @StarostaGit
Fixes#8343
This is the same as #8364, but with a compilation fix (newly added `close()` method was not implemented by the reader)
Closes#8634
* github.com:scylladb/scylla:
boost/tests: Add virtual_table_test for basic infrastructure
boost/tests: Test memtable_filling_virtual_table as mutation_source
db/system_keyspace: Add system.status virtual table
db/virtual_table: Add a way to specify a range of partitions for virtual table queries.
db/virtual_table: Introduce memtable_filling_virtual_table
db: Add virtual tables interface
db: Introduce chained_delegating_reader
This flag is not really needed, because we can just attempt a resume on
first use which will fail with the default constructed inactive read
handle and the reader will be created via the recreate-after-evicted
path.
This allows the same path to be used for all reader creation cases,
simplifying the logic and more importantly making further patching
easier without the special case.
To make the recreate path (almost) as cheap for the first reader
creation as it was with the special path, `_trim_range_tombstones` and
`_validate_partition_key` is only set when really needed.
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210514141511.127735-1-bdenes@scylladb.com>
We now have close() which is expected to clean up, no need for cleanup
in the destructor and consequently a destructor at all.
Message-Id: <20210514112349.75867-1-bdenes@scylladb.com>
Enabling it for each run_worker call will invoke ioctl
PERF_EVENT_IOC_ENABLE in parallel to other workers running
and this may skew the results.
Test: perf_simple_query
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210514130542.301168-1-bhalevy@scylladb.com>
We introduce `PureStateMachine`, which is the most direct translation
of the mathematical definition of a state machine to C++ that I could
come up with. Represented by a C++ concept, it consists of: a set of
inputs (represented by the `input_t` type), outputs (`output_t` type),
states (`state_t`), an initial state (`init`) and a transition
function (`delta`) which given a state and an input returns a new
state and an output.
The rest of the testing infrastructure is going to be generic
w.r.t. `PureStateMachine`. This will allow easily implementing tests
using both simple and complex state machines by substituting the
proper definition for this concept.
Next comes `logical_timer`: it is a wrapper around
`raft::logical_clock` that allows scheduling events to happen after a
certain number of logical clock ticks. For example,
`logical_timer::sleep(20_t)` returns a future that resolves after 20
calls to `logical_timer::tick()`. It will be used to introduce
timeouts in the tests, among other things.
To replicate a state machine, our Raft implementation requires it to
be represented with the `raft::state_machine` interface.
`impure_state_machine` is an implementation of `raft::state_machine`
that wraps a `PureStateMachine`. It keeps a variable of type `state_t`
representing the current state. In `apply` it deserializes the given
command into `input_t`, uses the transition (`delta`) function to
produce the next state and output, replaces its current state with the
obtained state and returns the output (more on that below); it does so
sequentially for every given command. We can think of `PureStateMachine`
as the actual state machine - the business logic, and
`impure_state_machine` as the ``boilerplate'' that allows the pure machine
to be replicated by Raft and communicate with the external world.
The interface also requires maintainance of snapshots. We introduce the
`snapshots_t` type representing a set of snapshots known by a state
machine. `impure_state_machine` keeps a reference to `snapshots_t`
because it will share it with an implementation of `persistence`.
Returning outputs is a bit tricky because apply is ``write-only'' - it
returns `future<>`. We use the following technique:
1. Before sending a command to a Raft leader through `server::add_entry`,
one must first directly contact the instance of `impure_state_machine`
replicated by the leader, asking it to allocate an ``output channel''.
2. On such a request, `impure_state_machine` creates a channel
(represented by a promise-future pair) and a unique ID; it stores the
input side of the channel (the promise) with this ID internally and returns
the ID and the output side of the channel (the future) to the requester.
3. After obtaining the ID, one serializes the ID together with the input
and sends it as a command to Raft. Thus commands are (ID, machine input)
pairs.
4. When `impure_state_machine` applies a command, it looks for a promise
with the given ID. If it finds one, it sends the output through this
channel.
5. The command sender waits for the output on the obtained future.
The allocation and deallocation of channels is done using the
`impure_state_machine::with_output_channel` function. The `call`
function is an implementation of the above technique.
Note that only the leader will attempt to send the output - other
replicas won't find the ID in their internal data structure. The set of
IDs and channels is not a part of the replicated state.
A failure may cause the output to never arrive (or even the command to
never be applied) so `call` waits for a limited time. It may also
mistakenly `call` a server which is not currently the leader, but it
is prepared to handle this error.
We implement the `raft::rpc` interface, allowing Raft servers to
communicate with other Raft servers.
The implementation is mostly boilerplate. It assumes that there exists a
method of message passing, given by a `send_message_t` function passed
in the constructor. It also handles the receival of messages in the
`receive` function. It defines the message type (`message_t`) that will
be used by the message-passing method.
The actual message passing is implemented with `network` and `delivery_queue`.
The only slightly complex thing in `rpc` is the implementation of `send_snapshot`
which is the only function in the `raft::rpc` interface that actually
expects a response. To implement this, before sending the snapshot
message we allocate a promise-future pair and assign to it a unique ID;
we store the promise and the ID in a data structure. We then send the
snapshot together with the ID and wait on the future. The message
receival function on the other side, when it receives the snapshot message,
applies the snapshot and sends back a snapshot reply message that contains
the same ID. When we receive a snapshot reply message we look up the ID in the
data structure and if we find a promise, we push the reply through that
promise.
`rpc` also keeps a reference to `snapshots_t` - it will refer to the
same set of snapshots as the `impure_state_machine` on the same server.
It accesses the set when it receives or sends a snapshot message.
`persistence` represents the data that does not get lost between server
crashes and restarts.
We store a log of commands in `_stored_entries`. It is invariably
``contiguous'', meaning that the index of each entry except the first is
equal to the index of the previous entry plus one at all times (i.e.
after each yield). We assume that the caller provides log entries
in strictly increasing index order and without gaps.
Additionally to storing log entries, `persistence` can be asked to store
or load a snapshot. To implement this it takes a reference to a set of snapshots
(`snapshots_t&`) which it will share with `impure_state_machine` and an
implementation of `rpc`. We ensure that the stored log either ``touches''
the stored snapshot on the right side or intersects it.
In order to simulate a production environment as closely as possible, we
implement a failure detector which uses heartbeats for deciding whether
to convict a server as failed. We convict a server if we don't receive a
heartbeat for a long enough time.
Similarly to `rpc`, `failure_detector` assumes a message passing method
given by a `send_heartbeat_t` function through the constructor.
`failure_detector` uses the knowledge about existing servers to decide
who to send heartbeats to. Updating this knowledge happens through
`add_server` and `remove_server` functions.
`network` is a simple priority queue of "events", where an event is a
message associated with delivery time. Each message contains a source,
a destination, and payload. The queue uses a logical clock to decide
when to deliver messages; it delivers are messages whose associated
times are smaller than the current time.
The exact delivery method is unknown to `network` but passed as a
`deliver_t` function in the constructor. The type of payload is generic.
The fact that `network` has delivered a message does not mean the
message was processed by the receiver. In fact, `network` assumes that
delivery is instantaneous, while processing a message may be a long,
complex computation, or even require IO. Thus, after a message is
delivered, something else must ensure that it is processed by the
destination server.
That something in our framework is `delivery_queue`. It will be the
bridge between `network` and `rpc`. While `network` is shared by all
servers - it represents the ``environment'' in which the servers live -
each server has its own private `delivery_queue`. When `network`
delivers an RPC message it will end up inside `delivery_queue`. A
separate fiber, `delivery_queue::receive_fiber()`, will process those
messages by calling `rpc::receive` (which is a potentially long
operation, thus returns a `future<>`) on the `rpc` of the destination
server.
`raft_server` is a package that contains `raft::server` and other
facilities needed for the server to communicate with its environment:
the delivery queue, the set of snapshots (shared by
`impure_state_machine`, `rpc` and `persistence`) and references to the
`impure_state_machine` and `rpc` instances of this server.
`environment` represents a set of `raft_server`s connected by a `network`.
The `network` inside is initialized with a message delivery function
which notifies the destination server's failure detector on each message
and if the message contains an RPC payload, pushes it into the destination's
`delivery_queue`.
Needs to be periodically `tick()`ed which ticks the network
and underlying servers.
`ticker` calls the given function as fast as the Seastar reactor
allows and yields between each call. It may be provided a limit
for the number of calls; it crashes the test if the limit is reached
before the ticker is `abort()`ed.
Finally, we add a simple test that serves as an example of using the
implemented framework. We introduce `ExRegister`, an implementation
of `PureStateMachine` that stores an `int32_t` and handles ``exchange''
and ``read'' inputs; an exchange replaces the state with the given value
and returns the previous state, a read does not modify the state and returns
the current state. In order to pass the inputs to Raft we must
serialize them into commands so we implement instances of `ser::serializer`
for `ExReg`'s input types.
* kbr/randomized-nemesis-test-v5:
raft: randomized_nemesis_test: basic test
raft: randomized_nemesis_test: ticker
raft: randomized_nemesis_test: environment
raft: randomized_nemesis_test: server
raft: randomized_nemesis_test: delivery queue
raft: randomized_nemesis_test: network
raft: randomized_nemesis_test: heartbeat-based failure detector
raft: randomized_nemesis_test: memory backed persistence
raft: randomized_nemesis_test: rpc
raft: randomized_nemesis_test: impure_state_machine
raft: randomized_nemesis_test: introduce logical_timer
raft: randomized_nemesis_test: `PureStateMachine` concept
* scylla-dev/raft-cleanup-v1:
raft: drop _leader_progress tracking from the tracker
raft: move current_leader into the follower state
raft: add some precondition checks
In commit c82250e0cf (gossip: Allow deferring
advertise of local node to be up), the replacing node is changed to postpone
the responding of gossip echo message to avoid other nodes sending read
requests to the replacing node. It works as following:
1) replacing node does not respond echo message to avoid other nodes to
mark replacing node as alive
2) replacing node advertises hibernate state so other nodes knows
replacing node is replacing
3) replacing node responds echo message so other nodes can mark
replacing node as alive
This is problematic because after step 2, the existing nodes in the
cluster will start to send writes to the replacing node, but at this
time it is possible that existing nodes haven't marked the replacing
node as alive, thus failing the write request unnecessarily.
For instance, we saw the following errors in issue #8013 (Cassandra
stress fails to achieve consistency when only one of the nodes is down)
```
scylla:
[shard 1] consistency - Live nodes 2 do not satisfy ConsistencyLevel (2
required, 1 pending, live_endpoints={127.0.0.2, 127.0.0.1},
pending_endpoints={127.0.0.3}) [shard 0] gossip - Fail to send
EchoMessage to 127.0.0.3: std::runtime_error (Not ready to respond
gossip echo message)
c-s:
java.io.IOException: Operation x10 on key(s) [4c4f4d37324c35304c30]:
Error executing: (UnavailableException): Not enough replicas available
for query at consistency QUORUM (2 required but only 1 alive
```
To solve this problem for older releases without the patch "repair:
Switch to use NODE_OPS_CMD for replace operation", a minimum fix is
implemented in this patch. Once existing nodes learn the replacing node
is in HIBERNATE state, they add the replacing as replacing, but only add
the replacing to the pending list only after the replacing node is
marked as alive.
With this patch, when the existing nodes start to write to the replacing
node, the replacing node is already alive.
Tests: replace_address_test.py:TestReplaceAddress.replace_node_same_ip_test + manual test
Fixes: #8013Closes#8614
Too many or too resource-hungry reads often lie at the heart of issues
that require an investigation with gdb. Therefore it is very useful to
have a way to summarize all reads found on a shard with their states and
resource consumptions. This is exactly what this new command does. For
this it uses the reader concurrency semaphores and their permits
respectively, which are now arranged in an intrusive list and therefore
are enumerable.
Example output:
(gdb) scylla read-stats
Semaphore _read_concurrency_sem with: 1/100 count and 14334414/14302576 memory resources, queued: 0, inactive=1
permits count memory table/description/state
1 1 14279738 multishard_mutation_query_test.fuzzy_test/fuzzy-test/active
16 0 53532 multishard_mutation_query_test.fuzzy_test/shard-reader/active
1 0 1144 multishard_mutation_query_test.fuzzy_test/shard-reader/inactive
1 0 0 *.*/view_builder/active
1 0 0 multishard_mutation_query_test.fuzzy_test/multishard-mutation-query/active
20 1 14334414 Total
* botond/scylla-gdb.py-scylla-reads/v5:
scylla-gdb.py: introduce scylla read-stats
scylla-gdb.py: add pretty printer for std::string_view
scylla-gdb.py: std_map() add __len__()
scylla-gdb.py: prevent infinite recursion in intrusive_list.__len__()
The repair parallelism is calculated by the number of memory allocated to
repair and memory usage per repair instance. Currently, it does not
consider memory bloat issues (e.g., issue #8640) which cause repair to
use more memory and cause std::bad_alloc.
Be more conservative when calculating the parallelism to avoid repair
using too much memory.
Fixes#8641Closes#8652
The auth intialization path contains a fixed 15s delay,
which used to work around a couple of issues (#3320, #3850),
but is right now quite useless, because a retry mechanism
is already in place anyway.
This patch speeds up the boot process if authentication is enabled.
In particular, for a single-node clusters, common for test setups,
auth initialization now takes a couple of milliseconds instead
of the whole 15 seconds.
Fixes#8648Closes#8649
This is a simple test that serves as an example of using the
framework implemented in the previous commits. We introduce
`ExRegister`, an implementation of `PureStateMachine` that stores
an `int32_t` and handles ``exchange'' and ``read'' inputs;
an exchange replaces the state with the given value and returns
the previous state, a read does not modify the state and returns
the current state. In order to pass the inputs to Raft we must
serialize them into commands so we implement instances of `ser::serializer`
for `ExReg`'s input types.
`ticker` calls the given function as fast as the Seastar reactor
allows and yields between each call. It may be provided a limit
for the number of calls; it crashes the test if the limit is reached
before the ticker is `abort()`ed.
The commit also introduces a `with_env_and_ticker` helper function which
creates an `environment`, a `ticker`, and passes references to them to
the given function. It destroys them after the function finishes
by calling `abort()`.
`environment` represents a set of `raft_server`s connected by a `network`.
The `network` inside is initialized with a message delivery function
which notifies the destination server's failure detector on each message
and if the message contains an RPC payload, pushes it into the destination's
`delivery_queue`.
Needs to be periodically `tick()`ed which ticks the network
and underlying servers.
New servers can be created in the environment by calling `new_server`.
`raft_server` is a package that contains `raft::server` and other
facilities needed for the server to communicate with its environment:
the delivery queue, the set of snapshots (shared by
`impure_state_machine`, `rpc` and `persistence`) and references to the
`impure_state_machine` and `rpc` instances of this server.
The fact that `network` has delivered a message does not mean the
message was processed by the receiver. In fact, `network` assumes that
delivery is instantaneous, while processing a message may be a long,
complex computation, or even require IO. Thus, after a message is
delivered, something else must ensure that it is processed by the
destination server.
That something in our framework is `delivery_queue`. It will be the
bridge between `network` and `rpc`. While `network` is shared by all
servers - it represents the ``environment'' in which the servers live -
each server has its own private `delivery_queue`. When `network`
delivers an RPC message it will end up inside `delivery_queue`. A
separate fiber, `delivery_queue::receive_fiber()`, will process those
messages by calling `rpc::receive` (which is a potentially long
operation, thus returns a `future<>`) on the `rpc` of the destination
server.
`network` is a simple priority queue of "events", where an event is a
message associated with delivery time. Each message contains a source,
a destination, and payload. The queue uses a logical clock to decide
when to deliver messages; it delivers are messages whose associated
times are smaller than the current time.
The exact delivery method is unknown to `network` but passed as a
`deliver_t` function in the constructor. The type of payload is generic.
In order to simulate a production environment as closely as possible, we
implement a failure detector which uses heartbeats for deciding whether
to convict a server as failed. We convict a server if we don't receive a
heartbeat for a long enough time.
Similarly to `rpc`, `failure_detector` assumes a message passing method
given by a `send_heartbeat_t` function through the constructor.
`failure_detector` uses the knowledge about existing servers to decide
who to send heartbeats to. Updating this knowledge happens through
`add_server` and `remove_server` functions.
`persistence` represents the data that does not get lost between server
crashes and restarts.
We store a log of commands in `_stored_entries`. It is invariably
``contiguous'', meaning that the index of each entry except the first is
equal to the index of the previous entry plus one at all times (i.e.
after each yield). We assume that the caller provides log entries
in strictly increasing index order and without gaps.
Additionally to storing log entries, `persistence` can be asked to store
or load a snapshot. To implement this it takes a reference to a set of snapshots
(`snapshots_t&`) which it will share with `impure_state_machine` and an
implementation of `rpc` coming in a later commit. We ensure that the stored
log either ``touches'' the stored snapshot on the right side or intersects it.
We implement the `raft::rpc` interface, allowing Raft servers to
communicate with other Raft servers.
The implementation is mostly boilerplate. It assumes that there exists a
method of message passing, given by a `send_message_t` function passed
in the constructor. It also handles the receival of messages in the
`receive` function. It defines the message type (`message_t`) that will
be used by the message-passing method.
The actual message passing is implemented with `network` and `delivery_queue`
which are introduced in later commits.
The only slightly complex thing in `rpc` is the implementation of `send_snapshot`
which is the only function in the `raft::rpc` interface that actually
expects a response. To implement this, before sending the snapshot
message we allocate a promise-future pair and assign to it a unique ID;
we store the promise and the ID in a data structure. We then send the
snapshot together with the ID and wait on the future. The message
receival function on the other side, when it receives the snapshot message,
applies the snapshot and sends back a snapshot reply message that contains
the same ID. When we receive a snapshot reply message we look up the ID in the
data structure and if we find a promise, we push the reply through that
promise.
`rpc` also keeps a reference to `snapshots_t` - it will refer to the
same set of snapshots as the `impure_state_machine` on the same server.
It accesses the set when it receives or sends a snapshot message.
To replicate a state machine, our Raft implementation requires it to
be represented with the `raft::state_machine` interface.
`impure_state_machine` is an implementation of `raft::state_machine`
that wraps a `PureStateMachine`. It keeps a variable of type `state_t`
representing the current state. In `apply` it deserializes the given
command into `input_t`, uses the transition (`delta`) function to
produce the next state and output, replaces its current state with the
obtained state and returns the output (more on that below); it does so
sequentially for every given command. We can think of `PureStateMachine`
as the actual state machine - the business logic, and
`impure_state_machine` as the ``boilerplate'' that allows the pure machine
to be replicated by Raft and communicate with the external world.
The interface also requires maintainance of snapshots. We introduce the
`snapshots_t` type representing a set of snapshots known by a state
machine. `impure_state_machine` keeps a reference to `snapshots_t`
because it will share it with an implementation of `raft::persistence`
coming with a later commit.
Returning outputs is a bit tricky because apply is ``write-only'' - it
returns `future<>`. We use the following technique:
1. Before sending a command to a Raft leader through `server::add_entry`,
one must first directly contact the instance of `impure_state_machine`
replicated by the leader, asking it to allocate an ``output channel''.
2. On such a request, `impure_state_machine` creates a channel
(represented by a promise-future pair) and a unique ID; it stores the
input side of the channel (the promise) with this ID internally and returns
the ID and the output side of the channel (the future) to the requester.
3. After obtaining the ID, one serializes the ID together with the input
and sends it as a command to Raft. Thus commands are (ID, machine input)
pairs.
4. When `impure_state_machine` applies a command, it looks for a promise
with the given ID. If it finds one, it sends the output through this
channel.
5. The command sender waits for the output on the obtained future.
The allocation and deallocation of channels is done using the
`impure_state_machine::with_output_channel` function. The `call`
function is an implementation of the above technique.
Note that only the leader will attempt to send the output - other
replicas won't find the ID in their internal data structure. The set of
IDs and channels is not a part of the replicated state.
A failure may cause the output to never arrive (or even the command to
never be applied) so `call` waits for a limited time. It may also
mistakenly `call` a server which is not currently the leader, but it
is prepared to handle this error.
This is a wrapper around `raft::logical_clock` that allows scheduling
events to happen after a certain number of logical clock ticks.
For example, `logical_timer::sleep(20_t)` returns a future that resolves
after 20 calls to `logical_timer::tick()`.
The commit introduces `PureStateMachine`, which is the most direct translation
of the mathematical definition of a state machine to C++ that I could come up with.
Represented by a C++ concept, it consists of: a set of inputs
(represented by the `input_t` type), outputs (`output_t` type), states (`state_t`),
an initial state (`init`) and a transition function (`delta`) which
given a state and an input returns a new state and an output.
The rest of the testing infrastructure is going to be
generic w.r.t. `PureStateMachine`. This will allow easily implementing
tests using both simple and complex state machines by substituting the
proper definition for this concept.
One possibility of modifying this definition would be to have `delta`
return `future<pair<state_t, output_t>>` instead of
`pair<state_t, output_t>`. This would lose some ``purity'' but allow
long computations without reactor stalls in the tests. Such modification,
if we decide to do it, is trivial.
Uses the infrastructure for testing mutation_sources, but only a
subset of it which does not do fast forwarding (since virtual_table
does not support it).
table queries.
This change introduces a query_restrictions object into the virtual
table infrastructure, for now only holding a restriction on partition
ranges.
That partition range is then implemented into
memtable_filling_virtual_table.
This change adds a more specific implementation of the virtual table
called memtable_filling_virtual_table. It produces results by filling
a memtable on each read.
This change introduces the basic interface we expect each virtual
table to implement. More specific implementations will then expand
upon it if needed.
This change adds a new type of mutation reader which purpose
is to allow inserting operations before an invocation of the proper
reader. It takes a future to wait on and only after it resolves will
it forward the execution to the underlying flat_mutation_reader
implementation.
We remove a log of severity error that is later thrown as an
exception, being catched few lines below and then printed out as
a warning.
Fixes#8616Closes#8617
Refs #251.
Closes#8630
* github.com:scylladb/scylla:
statistics: add global bloom filter memory gauge
statistics: add some sstable management metrics
sstables: make the `_open` field more useful
sstables: stats: noexcept all accessors
"
Row cache reader can produce overlapping range tombstones in the mutation
fragment stream even if there is only a single range tombstone in sstables,
due to #2581. For every range between two rows, the row cache reader queries
for tombstones relevant for that range. The result of the query is trimmed to
the current position of the reader (=position of the previous row) to satisfy
key monotonicity. The end position of range tombstones is left unchanged. So
cache reader will split a single range tombstone around rows. Those range
tombstones are transient, they will be only materialized in the reader's
stream, they are not persisted anywhere.
That is not a problem in itself, but it interacts badly with mutation
compactor due to #8625. The range_tombstone_accumulator which is used to
compact the mutation fragment stream needs to accumulate all tombstones which
are relevant for the current clustering position in the stream. Adding a new
range tombstone is O(N) in the number of currently active tombstones. This
means that producing N rows will be O(N^2).
In a unit test introduced in this series, I saw reading 137'248 rows which
overlap with a range tombstone take 245 seconds. Almost all of CPU time is in
drop_unneeded_tombstones().
The solution is to make the cache reader trim range tombstone end to the
currently emited sub-range, so that it emits non-overlapping range tombstones.
Fixes#8626.
Tests:
- row_cache_test (release)
- perf_row_cache_reads (release)
"
* tag 'fix-perf-many-rows-covered-by-range-tombstone-v2' of github.com:tgrabiec/scylla:
tests: perf_row_cache_reads: Add scenario for lots of rows covered by a range tombstone
row_cache: Avoid generating overlapping range tombstones
range_tombstone_accumulator: Avoid update_current_tombstone() when nothing changed
Add the following metrics, as part of #251:
- open for writing (a.k.a. "created", unless I'm missing something?)
- open for reading
- deleted
- currently open for reading/writing (gauges)
Refs #251.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
As a function returning a future, simplify
its interface by handling any exceptions and
returning an exceptional future instead of
propagating the exception.
In this specific case, throwing from advance_and_await()
will propagate through table::await_pending_* calls
short-circuiting a .finally clause in table::stop().
Also, mark as noexcept methods of class table calling
advance_and_await and table::await_pending_ops that depends on them.
Fixes#8636
A followup patch will convert advance_and_await to a coroutine.
This is done separately to facilitate backporting of this patch.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210511161407.218402-1-bhalevy@scylladb.com>
Row cache reader can produce overlapping range tombstones in the
mutation fragment stream even if there is only a single range
tombstone in sstables, due to #2581. For every range between two rows,
the row cache reader queries for tombstones relevant for that
range. The result of the query is trimmed to the current position of
the reader (=position of the previous row) to satisfy key
monotonicity. The end position of range tombstones is left
unchanged. So cache reader will split a single range tombstone around
rows. Those range tombstones are transient, they will be only
materialized in the reader's stream, they are not persisted anywhere.
That is not a problem in itself, but it interacts badly with mutation
compactor due to #8625. The range_tombstone_accumulator which is used
to compact the mutation fragment stream needs to accumulate all
tombstones which are relevant for the current clustering position in
the stream. Adding a new range tombstone is O(N) in the number of
currently active tombstones. This means that producing N rows will be
O(N^2).
In a unit test, I saw reading 137'248 rows which overlap with a range
tombstone take 245 seconds. Almost all of CPU time is in
drop_unneeded_tombstones().
The solution is to make the cache reader trim range tombstone end to
the currently emited sub-range, so that it emits non-overlapping range
tombstones.
Fixes#8626.
Recalculation of the current tombstone is O(N) in the number of active
range tombstones. This can be a significant overhead, so better avoid it.
Solves the problem of quadratic complexity when producing lots of
overlaping range tombstones with a common end bound.
Refs #8625
Refs #8626
When an index is created without an explicit name, a default name
is chosen. However, there was no check if a table with conflicting
name already exists. The check is now in place and if any conflicts
are found, a new index name is chosen instead.
When an index is created *with* an explicit name and a conflicting
regular table is found, index creation should simply fail.
This series comes with a test.
Fixes#8620
Tests: unit(release)
Closes#8632
* github.com:scylladb/scylla:
cql-pytest: add regression tests for index creation
cql3: fail to create an index if there is a name conflict
database: check for conflicting table names for indexes
The operator== of enum_option<> (which we use to hold multi-valued
Scylla options) makes it easy to compare to another enum_option
wrapper, but ugly to compare the actual value held. So this patch
adds a nicer way to compare the value held.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210511120222.1167686-1-nyh@scylladb.com>
utils::phased_barrier holds a `lw_shared_ptr<gate>` that is
typically `enter()`ed in `phased_barrier::start()`,
and left when the operation is destroyed in `~operation`.
Currently, the operation move-assign implementation is the
default one that just moves the lw_shared gate ptr from the
other operation into this one, without calling `_gate->leave()` first.
This change first destroys *this when move-assigned (if not self)
to call _gate->leave() if engaged, before reassigning the
gate with the other operation::_gate.
A unit test that reproduces the issue before this change
and passes with the fix was added to serialized_action_test.
Fixes#8613
Test: unit(dev), serialized_action_test(debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210510120703.1520328-1-bhalevy@scylladb.com>
"
The current printout is has multiple problems:
* It is segregated by state, each having its own sorting criteria;
* Number of permits and count resources is collapsed in to a single
column, not clear which is the one printed.
* Number of available/initial units of the semaphore are not printed;
This series solves all this problems:
* It merges all states into a single table, sorted by memory
consumption, in descending order.
* It separates number of permits and count resources into separate
columns.
* Prints a summary of the semaphore units.
* Provides a cap on the maximum amount of printable lines, to not blow
up the logs.
The goal of all this is to make it easy to find the culprit a semaphore
problem: easily spot the big memory consumers, then unpack the name
column to determine which table and code path is responsible.
This brings the printout close to the recently `scylla reads`
scylla-gdb.py command, providing a uniform report format across the two
tools.
Example report:
INFO 2021-05-07 09:52:16,806 [shard 0] testlog - With max-lines=4: Semaphore reader_concurrency_semaphore_dump_reader_diganostics with 8/2147483647 count and 263599186/9223372036854775807 memory resources: user request, dumping permit diagnostics:
permits count memory table/description/state
7 2 77M ks.tbl1/op1/active
6 3 59M ks.tbl1/op0/active
4 0 36M ks.tbl1/op2/active
3 1 36M ks.tbl0/op2/active
11 2 43M permits omitted for brevity
31 8 251M total
"
* 'reader-concurrency-semaphore-dump-improvement/v1' of https://github.com/denesb/scylla:
test: reader_concurrency_test: add reader_concurrency_semaphore_dump_reader_diganostics
reader_concurrency_semaphore: dump_reader_diagnostics(): print more information in the header
reader_concurrency_semaphore: dump_reader_diagnostics(): cap number of printed lines
reader_concurrency_semaphore: dump_reader_diagnostics(): sort lines in descending order
reader_concurrency_semaphore: dump_reader_diagnostics(): merge all states into a single table
reader_concurrency_semaphore: dump_reader_diagnostics(): separate number of permits and count resources
In commit 3e39985c7a we added the Cassandra-compatible system table
system."IndexInfo" (note the capitalized table name) which lists built
indexes. Because we already had a table of built materialized views, and
indexes are implemented as materialized views, the index list was
implemented as a virtual table based on the view list.
However, the *name* of each materialized view listed in the list of
views looks like something_index, with the suffix "_index", while the
name of the table we need to print is "something". We forgot to do this
transformation in the virtual table - and this is what this patch does.
This bug can confuse applications which use this system table to wait for
an index to be built. Several tests translated from Cassandra's unit
tests, in cassandra_tests/validation/entities/secondary_index_test.py fail
in wait_for_index() because of this incompatibility, and pass after this
patch.
This patch also changes the unit test that enshrined the previous, wrong,
behavior, to test for the correct behavior. This problem is typical of
C++ unit tests which cannot be run against Cassandra.
Fixes#8600
Unfortunately, although this patch fixes "typical" applications (including
all tests which I tried) - applications which read from IndexInfo in a
"typical" method to look for a specific index being ready, the
implementation is technically NOT correct: The problem is that index
names are not sorted in the right order, because they are sorted with
the "_index" prefix.
To give an example, the index names "a" should be listed before "a1", but
the view names "a1_index" comes before "a_index" (because in ASCII, 1
comes before underscore). I can't think of any way to fix this bug
without completely reimplementing IndexInfo in a different way - probably
based on a temporary memtable (which is fine as this is not a
performance-critical operation). We'll need to do this rewrite eventually,
and I'll open a new issue.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20210509140113.1084497-1-nyh@scylladb.com>
Ref: #7617
This series adds timeout parameters to service levels.
Per-service-level timeouts can be set up in the form of service level parameters, which can in turn be attached to roles. Setting up and modifying role-specific timeouts can be achieved like this:
```cql
CREATE SERVICE LEVEL sl2 WITH read_timeout = 500ms AND write_timeout = 200ms AND cas_timeout = 2s;
ATTACH SERVICE LEVEL sl2 TO cassandra;
ALTER SERVICE LEVEL sl2 WITH write_timeout = null;
```
Per-service-level timeouts take precedence over default timeout values from scylla.yaml, but can still be overridden for a specific query by per-query timeouts (e.g. `SELECT * from t USING TIMEOUT 50ms`).
Closes#7913
* github.com:scylladb/scylla:
docs: add a paragraph describing service level timeouts
test: add per-service-level timeout tests
test: add refreshing client state
transport: add updating per-service-level params
client_state: allow updating per service level params
qos: allow returning combined service level options
qos: add a way of merging service level options
cql3: add preserving default values for per-sl timeouts
qos: make getting service level public
qos: make finding service level public
treewide: remove service level controller from query state
treewide: propagate service level to client state
sstables: disambiguate boost::find
cql3: add a timeout column to LIST SERVICE LEVEL statement
db: add extracting service level info via CQL
types: add a missing translation for cql_duration
cql3: allow unsetting service level timeouts
cql3: add validating service level timeout values
db: add setting service level params via system_distributed
cql3: add fetching service level attrs in ALTER and CREATE
cql3: add timeout to service level params
qos: add timeout to service level info
db,sys_dist_ks: add timeout to the service level table
migration_manager: allow table updates with timestamp
cql3: allow a null keyword for CQL properties