This patch changes the signatures of `test_assignment` and
`test_all` functions to accept `cql3::column_specification` by
const reference instead of shared pointer.
Mostly a cosmetic change reducing overall shared_ptr bloat in
cql3 code.
Tests: unit(dev, debug)
Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>
Message-Id: <20200529195249.767346-1-pa.solodovnikov@scylladb.com>
Compaction is checking for abortion whenever it's consuming a new partition.
The problem with this approach is that the abortion can take too long if
compaction is working with really large partitions. If the current partition
takes minutes to be compacted, it means that abortion may be delayed by
a factor of minutes as well.
Truncate, for example, relies on this abortion mechanism, so it could happen
that the operation would take much longer than expected due to this
ineffiency, probably result in timeouts in the user side.
To fix this, it's clear that we need to increase the frequency at which
we check for abortion requests. More precisely, we need to do it not only
on partition granularity, but also on row granularity.
Fixes#6309.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20200529172847.44444-1-raphaelsc@scylladb.com>
The timer.stop() call, that reports not only the time-taken, but also
the reclaimation rate, was unintentionally dropped while expanding its
scope (c70ebc7c).
Take it back (and mark the compact_and_evict_locked as private while
at it).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200528185331.10537-1-xemul@scylladb.com>
Currently test.py has three different places it checks whether stdout is
a tty. This patch centralizes these into a single global variable. This
ensures consistency and makes it easier to override it later with a
command-line switch, should we want to.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200529101124.123925-1-bdenes@scylladb.com>
Instead of doing 3 smp::invoke_on_all-s and duplicating
tracker::impl API for the tracker itself, introduce the
tracker::configure, simplify the tracker configuration
and narrow down the public tracker API.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200528185442.10682-1-xemul@scylladb.com>
"
Currently we classify queries as "system" or "user" based on the table
they target. The class of a query determines how the query is treated,
currently: timeout, limits for reverse queries and the concurrency
semaphore. The catch is that users are also allowed to query system
tables and when doing so they will bypass the limits intended for user
queries. This has caused performance problems in the past, yet the
reason we decided to finally address this is that we want to introduce a
memory limit for unpaged queries. Internal (system) queries are all
unpaged and we don't want to impose the same limit on them.
This series uses scheduling groups to distinguish user and system
workloads, based on the assumption that user workloads will run in the
statement scheduling group, while system workloads will run in the main
(or default) scheduling group, or perhaps something else, but in any
case not in the statement one. Currently the scheduling group of reads
and writes is lost when going through the messaging service, so to be
able to use scheduling groups to distinguish user and system reads this
series refactors the messaging service to retain this distinction across
verb calls. Furthermore, we execute some system reads/writes as part of
user reads/writes, such as auth and schema sync. These processes are
tagged to run in the main group.
This series also centralises query classification on the replica and
moves it to a higher level. More specifically, queries are now
classified -- the scheduling group they run in is translated to the
appropriate query class specific configuration -- on the database level
and the configuration is propagated down to the lower layers.
Currently this query class specific configuration consists of the reader
concurrency semaphore and the max memory limit for otherwise unlimited
queries. A corollary of the semaphore begin selected on the database
level is that the read permit is now created before the read starts. A
valid permit is now available during all stages of the read, enabling
tracking the memory consumption of e.g. the memtable and cache readers.
This change aligns nicely with the needs of more accurate reader memory
tracking, which also wants a valid permit that is available in every layer.
The series can be divided roughly into the following distinct patch
groups:
* 01-02: Give system read concurrency a boost during startup.
* 03-06: Introduce user/system statement isolation to messaging service.
* 07-13: Various infrastructure changes to prepare for using read
permits in all stages of reads.
* 14-19: Propagate the semaphore and the permit from database to the
various table methods that currently create the permit.
* 20-23: Migrate away from using the reader concurrency semaphore for
waiting for admission, use the permit instead.
* 24: Introduce `database::make_query_config()` and switch the database
methods needing such a config to use it.
* 25-31: Get rid of all uses of `no_reader_permit()`.
* 32-33: Ban empty permits for good.
* 34: querier_cache: use the queriers' permits to obtain the semaphore.
Fixes: #5919
Tests: unit(dev, release, debug),
dtest(bootstrap_test.py:TestBootstrap.start_stop_test_node), manual
testing with a 2 node mixed cluster with extra logging.
"
* 'query-class/v6' of https://github.com/denesb/scylla: (34 commits)
querier_cache: get semaphore from querier
reader_permit: forbid empty permits
reader_permit: fix reader_resources::operator bool
treewide: remove all uses of no_reader_permit()
database: make_multishard_streaming_reader: pass valid permit to multi range reader
sstables: pass valid permits to all internal reads
compaction: pass a valid permit to sstable reads
database: add compaction read concurrency semaphore
view: use valid permits for reads from the base table
database: use valid permit for counter read-before-write
database: introduce make_query_class_config()
reader_concurrency_semaphore: remove wait_admission and consume_resources()
test: move away from reader_concurrency_semaphore::wait_admission()
reader_permit: resource_units: introduce add()
mutation_reader: restricted_reader: work in terms of reader_permit
row_cache: pass a valid permit to underlying read
memtable: pass a valid permit to the delegate reader
table: require a valid permit to be passed to most read methods
multishard_mutation_query: pass a valid permit to shard mutation sources
querier: add reader_permit parameter and forward it to the mutation_source
...
GC writer, used for incremental compaction, cannot be currently used if interposer
consumer is used. That's because compaction assumes that GC writer will be operated
only by a single compaction writer at a given point in time.
With interposer consumer, multiple writers will concurrently operate on the same
GC writer, leading to race condition which potentially result in use-after-free.
Let's disable GC writer if interposer consumer is enabled. We're not losing anything
because GC writer is currently only needed on strategies which don't implement an
interposer consumer. Resharding will always disable GC writer, which is the expected
behavior because it doesn't support incremental compaction yet.
The proper fix, which allows GC writer and interposer consumer to work together,
will require more time to implement and test, and for that reason, I am postponing
it as #6472 is a showstopper for the current release.
Fixes#6472.
tests: mode(dev).
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reviewed-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200526195428.230472-1-raphaelsc@scylladb.com>
The ScanFilter and QueryFilter features are only partially implemented.
Most of their unimplemented features cause clear errors telling the user
of the unimplemented feature, but one exception is the ConditionalOperator
parameter, which can be used to "OR", instead of the default "AND", of
several conditions. Before this patch, we simply ignored this parameter -
causing wrong results to be returned instead of an error.
In this patch, ScanFilter and QueryFilter parse, instead of ignoring, the
ConditionalOperator. The common implementation, get_filtering_restrictions(),
still does not implement the OR case, but returns an error if we reach
this case instead of just ignoring it.
There is no new test. The existing test_query_filter.py::test_query_filter_or
xfailed before this patch, and continues to xfail after it, but the failure
is different (you can see it by running the test with "--runxfail"):
Before this patch, the failure was because of different results. After this
patch, the failure is because of an "unimplemented" error message.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200528214721.230587-2-nyh@scylladb.com>
The code for parsing the ConditionalOperator attribute was used once in
for the "Expected" case, but we will also need it for the "QueryFilter" and
"ScanFilter" cases, so let's extract it into a function,
get_conditional_operator().
While doing this extraction, I also noticed a bug: when Expected is missing,
ConditionalOperator should not be allowed. We correctly checked the case
of an empty Expected, but forgot to also check the case of a missing
Expected. So the new code also fixes this corner case, and we include
a new test case for it (which passes on DynamoDB and used to fail in
Alternator but passes after this patch).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20200528214721.230587-1-nyh@scylladb.com>
I just hit a circularity in header inclusion that I traced back to the
fact that schema.hh includes compaction_strategy.hh. schema.hh is in
turn included in lots of places, so a circularity is not hard to come
by.
The schema header really only needs to know about the compaction_type,
so it can inform schema users about it. Following the trend in header
clenups, I am moving that to a separate header which will both break
the circularity and make sure we are included less stuff that is not
needed.
With this change, Scylla fails to compile due to a new missing forward
declaration at index/secondary_index_manager.hh, so this is fixed.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20200527172203.915936-1-glauber@scylladb.com>
Until now, view updates were generated with a bunch of random
time points, because the interface was not adjusted for passing
a single time point. The time points were used to determine
whether cells were alive (e.g. because of TTL), so it's better
to unify the process:
1. when generating view updates from user writes, a single time point
is used for the whole operation
2. when generating view updates via the view building process,
a single time point is used for each build step
NOTE: I don't see any reliable and deterministic way of writing
test scenarios which trigger problems with the old code.
After #6488 is resolved and error injection is integrated
into view.cc, tests can be added.
Fixes#6429
Tests: unit(dev)
Message-Id: <f864e965eb2e27ffc13d50359ad1e228894f7121.1590070130.git.sarna@scylladb.com>
Currently the `querier_cache` is passed a semaphore during its
construction and it uses this semaphore to do all the inactive reader
registering/unregistering. This is inaccurate as in theory cached reads
could belong to different semaphores (although currently this is not yet
the case). As all queriers store a valid permit now, use this
permit to obtain the semaphore the querier is associated with, and
register the inactive read with this semaphore.
Remove `no_reader_permit()` and all ways to create empty (invalid)
permits. All permits are guaranteed to be valid now and are only
obtainable from a semaphore.
`reader_permit::semaphore()` now returns a reference, as it is
guaranteed to always have a valid semaphore reference.
We will soon require a valid permit for all reads, including low level
index reads. The sstable layer has several internal reads which can not
be associated with either the user or the system read semaphores or it
would be very hard to obtain the correct semaphore, for limited/no gain.
To be able to pass a valid permit still, we either expose a permit
parameter so upper layers can pass down one, or create a local semaphore
for these reads and use that to obtain a permit.
The following methods now require a permit to be passed to them:
* `sstables::sstabe::read_data()`: only used in tests.
The following methods use internal semaphores:
* `sstables::sstable::generate_summary()` used when loading an sstable.
* `sstables::sstable::has_partition_key()`: used by a REST API method.
All reads will soon require a valid permit, including those done during
compaction. To allow creating valid permits for these reads create a
compaction specific semaphore. This semaphore is unlimited as compaction
concurrency is managed by higher level layer, we use just for resource
usage accounting.
View update generation involves reading existing values from the base
table, which will soon require a valid permit to be passed to it, so
make sure we create and pass a valid permit to these reads.
We use `database::make_query_class_config()` to obtain the semaphore for
the read which selects the appropriate user/system semaphore based on
the scheduling group the base table write is running in.
Counter writes involve a read-before-write, which will soon require a
valid permit to be passed to it, so make sure we create and pass a valid
permit to this read. We use `database::make_query_class_config()` to
obtain the semaphore for the read which selects the appropriate
user/system semaphore based on the scheduling group the counter write is
running in.
And use it to obtain any query-class specific configuration that was
obtained from `table::config` before, such as the read concurrency
semaphore and the max memory limit for unlimited queries. As all users
of these items get these from the query class config now, we can remove
them from `table::config`.
Permits are now created with `make_permit()` and code is using the
permit to do all resource consumption tracking and admission waiting, so
we can remove these from the semaphore. This allows us to remove some
now unused code from the permit as well, namely the `base_cost` which
was used to track the resource amount the permit was created with. Now
this amount is also tracked with a `resource_units` RAII object, returned
from `reader_permit::wait_admission()`, so it can be removed. Curiously,
this reduces the reader permit to be glorified semaphore pointer. Still,
the permit abstraction is worth keeping, because it allows us to make
changes to how the resource tracking part of the semaphore works,
without having to change the huge amount of code sites passing around
the permit.
And use the reader_permit for this instead. This refactoring has
revealed a pre-existing bug in the `test_lifecycle_policy`, which is
also addressed in this patch. The bug is that said policy executes
reader destructions in the background, and these are not waited for. For
some reason, the semaphore -> permit transition pushes these races over
the edge and we start seeing some of these destruction fibers still
being unfinished when test scopes are exited, causing all sorts of
trouble. The solution is to introduce a special gate that tests can use
to wait for all background work to finish, before the test scope is
exited.
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the underlying reader when
creating it. This means `row_cache::make_reader()` now also requires
a permit to be passed to it.
All reader are soon going to require a valid permit, so make sure we
have a valid permit which we can pass to the delegate reader when
creating it. This means `memtable::make_flat_reader()` now also requires
a permit to be passed to it.
Internally the permit is stored in `scanning_reader`, which is used both
for flushes and normal reads. In the former case a permit is not
required.
Now that the most prevalent users (range scan and single partition
reads) all pass valid permits we require all users to do so and
propagate the permit down towards `make_sstable_reader()`. The plan is
to use this permit for restricting the sstable readers, instead of the
semaphore the table is configured with. The various
`make_streaming_*reader()` overloads keep using the internal semaphores
as but they also create the permit before the read starts and pass it to
`make_sstable_reader()`.
In preparation of a valid permit being required to be passed to all
mutation sources, create a permit before creating the shard readers and
pass it to the mutation source when doing so. The permit is also
persisted in the `shard_mutation_querier` object when saving the reader,
which is another forward looking change, to allow the querier-cache to
use it to obtain the semaphore the read is actually registered with.
In preparation of a valid permit being required to be passed to all
mutation sources, also add a permit to the querier object, which is then
passed to the source when it is used to create a reader.
We want to move away from the current practice of selecting the relevant
read concurrency semaphore inside `table` and instead want to pass it
down from `database` so that we can pass down a semaphore that is
appropriate for the class of the query. Use the recently created
`query_class_config` struct for this. This is added as a parameter to
`data_query`, `mutation_query` and propagated down to the point where we
create the `querier` to execute the read. We are already propagating
down a parameter down the same route -- max_memory_reverse_query --
which also happens to be part of `query_class_config`, so simply replace
this parameter with a `query_class_config` one. As the lower layers are
not prepared for a semaphore passed from above, make sure this semaphore
is the same that is selected inside `table`. After the lower layers are
prepared for a semaphore arriving from above, we will switch it to be
the appropriate one for the class of the query.
This struct will serve as a container of all the query-class
dependent configuration such as the semaphore to be used and the memory
limit for unlimited queries. As there is no good place to put this, we
create a separate header for it.
Mutation sources will soon require a valid permit so make sure we have
one and pass it to the mutation sources when creating the underlying
readers.
For now, pass no_reader_permit() on call sites, deferring the obtaining
of a valid permit to later patches.
This contains a reader concurrency semaphore for the tests, that they
can use to obtain a valid permit for reads. Soon we are going to start
working towards a point where all APIs taking a permit will require a
valid one. Before we start this work we must ensure test code is able to
obtain a valid permit.
We want to make `read_permit` the single interface through which reads
interact with the concurrency limiting mechanism. So far it was only
usable to track memory consumption. Add the missing `wait_admission()`
and `consume_resources()` to the permit API. As opposed to
`reader_concurrency_semaphore::` equivalents which returned a
permit, the `reader_permit::` variants jut return
`reader_permit::resource_units` which is an RAII holder for the acquired
units. This also allows for the permit to be created earlier, before the
reader is admitted, allowing for tracking pre-admission memory usage as
well. In fact this is what we are going to do in the next patches.
This patch also introduces a `broken()` method on the reader concurrency
semaphore which resolves waiters with an exception. This method is also
called internally from the semaphore's destructor. This is needed
because the semaphore can now have external waiters, who has to be
resolved before the semaphore itself is destroyed.
We want to refactor reader_permit::memory_units to work in terms of
reader_resources, as we are planning to use it for guarding count
resources as well. This patch makes the first step: renames it from
memory_units to resources_units. Since this is a very noisy change, we
do it in a separate patch, the semantic change is in the next patch.
Tenants get their own connections for statement verbs and are further
isolated from each other by different scheduling groups. A tenant is
identified by a scheduling group and a name. When selecting the client
index for a statement verb, we look up the tenant whose scheduling group
matches the current one. This scheduling group is persisted across the
RPC call, using the name to identify the tenant on the remote end, where
a reverse lookup (name -> scheduling group) happens.
Instead of a single scheduling group to be used for all statement verbs,
messaging_service::scheduling_config now contains a list of tenants. The
first among these is the default tenant, the one we use when the current
scheduling group doesn't match that of any configured tenant.
To make this mapping easier, we reshuffle the client index assignment,
such that statement and statement-ack verbs have the idx 2 and 3
respectively, instead of 0 and 3.
The tenant configuration is configured at message service construction
time and cannot be changed after. Adding such capability should be easy
but is not needed for query classification, the current user of the
tenant concept.
Currently two tenants are configured: $user (default tenant) and
$system.
Per-user SLA means we have connection classifications determined dynamically,
as SLAs are added or removed. This means the classification information cannot
be static.
Fix by making it a non-static vector (instead of a static array), allowing it
to be extended. The scheduling group member pointer is replaced by a scheduling
group as a member pointer won't work anymore - we won't have a member to refer
to.
On the client side, we supply an isolation cookie based on the connection index
On the server side, we convert an isolation cookie back to a scheduling_group.
This has two advantages:
- rpc processes the entire connection using the scheduling group, so that code
is also isolated and accounted for
- we can later add per-user connections; the previous approach of looking at the
verb to decide the scheduling_group doesn't help because we don't have a set of
verbs per user
With this, the main group sees <0.1% usage under simple read and write loads.
Move it from a function-local static to a class static variable. We will want
to extend it in two ways:
- add more information per connection index (like the rpc isolation cookie)
- support adding more connections for per-user SLA
As a first step, make it an array of structures and make it accessible to all
of messaging_service.
In the next patches we will match reads to the appropriate reader
concurrency semaphore based on the scheduling group they run in. This
will result in a lot of system reads that are executed during startup
and that were up to now (incorrectly) using the user read semaphore to
switch to the system read semaphore. This latter has a much more
constrained concurrency, which was observed to cause system reads to
saturate and block on the semaphore, slowing down startup.
To solve this, boost the concurrency of the system read semaphore during
startup to match that of the user semaphore. This is ok, as during
startup there are no user reads to compete with. After startup, before
we start serving user reads the concurrency is reverted back to the
normal value.