Not a completely mechanical transition. The consumer has to generate its
mutation via a mutation_rebuilder_v2 as mutation fragment v2 cannot be
applied to mutations directly yet.
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
First, it's to fix the discarded future during the register. The
future is not actually such, as it's always the no-op ready one as
at that stage the view_update_generator is neither aborted nor is
in throttling state.
Second, this change is to keep database start-up code in main
shorter and cleaner. Registering staging sstables belongs to the
view_update_generator start code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Prepare for updating seastar submodule to a change
that requires deferred actions to be noexcept
(and return void).
Test: unit(dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Get rid of unused includes of seastar/util/{defer,closeable}.hh
and add a few that are missing from source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch flips two "switches":
1) It switches admission to be up-front.
2) It changes the admission algorithm.
(1) by now all permits are obtained up-front, so this patch just yanks
out the restricted reader from all reader stacks and simultaneously
switches all `obtain_permit_nowait()` calls to `obtain_permit()`. By
doing this admission is now waited on when creating the permit.
(2) we switch to an admission algorithm that adds a new aspect to the
existing resource availability: the number of used/blocked reads. Namely
it only admits new reads if in addition to the necessary amount of
resources being available, all currently used readers are blocked. In
other words we only admit new reads if all currently admitted reads
requires something other than CPU to progress. They are either waiting
on I/O, a remote shard, or attention from their consumers (not used
currently).
We flip these two switches at the same time because up-front admission
means cache reads now need to obtain a permit too. For cache reads the
optimal concurrency is 1. Anything above that just increases latency
(without increasing throughput). So we want to make sure that if a cache
reader hits it doesn't get any competition for CPU and it can run to
completion. We admit new reads only if the read misses and has to go to
disk.
Another change made to accommodate this switch is the replacement of the
replica side read execution stages which the reader concurrency
semaphore as an execution stage. This replacement is needed because with
the introduction of up-front admission, reads are not independent of
each other any-more. One read executed can influence whether later reads
executed will be admitted or not, and execution stages require
independent operations to work well. By moving the execution stage into
the semaphore, we have an execution stage which is in control of both
admission and running the operations in batches, avoiding the bad
interaction between the two.
Adding an non-empty set of sstables as the set of all sstables in
an sstable_set could cause inconsistencies with the values returned
by select_sstable_runs because the _all_runs map would still be
initialized empty. For similar reasons, the provided sstable_set_impl
should also be empty.
Dispel doubts by removing the unordered_set from the constructor, and
adding a check of emptiness of the sstable_set_impl.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
This reverts commit dc77d128e9. It was reverted
due to a strange and unexplained diff, which is now explained. The
HEAD on the working directory being pulled from was set back, so git
thought it was merging the intended commits, plus all the work that was
committed from HEAD to master. So it is safe to restore it.
This reverts commit 0aa1f7c70a, reversing
changes made to 72c59e8000. The diff is
strange, including unrelated commits. There is no understanding of the
cause, so to be safe, revert and try again.
Lower level functions such as `create_single_key_sstable_reader`
were made methods of `sstable_set`.
The motivation is that each concrete sstable_set
may decide to use a better sstable reading algorithm specific to the
data structures used by this sstable_set. For this it needs to access
the set's internals.
A nice side effect is that we moved some code out of table.cc
and database.hh which are huge files.
Require a schema and an operation name to be given to each permit when
created. The schema is of the table the read is executed against, and
the operation name, which is some name identifying the operation the
permit is part of. Ideally this should be different for each site the
permit is created at, to be able to discern not only different kind of
reads, but different code paths the read took.
As not all read can be associated with one schema, the schema is allowed
to be null.
The name will be used for debugging purposes, both for coredump
debugging and runtime logging of permit-related diagnostics.
We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.
In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
sstable_manager will soon wait for all sstables under its
control to be deleted (if so marked), but that can't happen
if someone is holding on to references to those sstables.
To allow sstables_manager::stop() to work, drop remaining
queued work when terminating.
fea83f6 introduced a race between processing (and hence removing)
sstables from `_sstables_with_tables` and registering new ones. This
manifested in sstables that were added concurrently with processing a
batch for the same sstables being dropped and the semaphore units
associated with them not returned. This resulted in repairs being
blocked indefinitely as the units of the semaphore were effectively
leaked.
This patch fixes this by moving the contents of `_sstables_with_tables`
to a local variable before starting the processing. A unit test
reproducing the problem is also added.
Fixes: #6892
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20200817160913.2296444-1-bdenes@scylladb.com>
"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:
template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);
template <typename T>
shared_ptr<T> make_shared(T&& a);
The second variant doesn't exist in std::make_shared.
This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"
* 'espindola/make_shared' of https://github.com/espindola/scylla:
Everywhere: Explicitly instantiate make_lw_shared
Everywhere: Add a make_shared_schema helper
Everywhere: Explicitly instantiate make_shared
cql3: Add a create_multi_column_relation helper
main: Return a shared_ptr from defer_verbose_shutdown
It is important that all replicas participating in a read use the same
memory limits to avoid artificial differences due to different amount of
results. The coordinator now passes down its own memory limit for reads,
in the form of max_result_size (or max_size). For unpaged or reverse
queries this has to be used now instead of the locally set
max_memory_unlimited_query configuration item.
To avoid the replicas accidentally using the local limit contained in
the `query_class_config` returned from
`database::make_query_class_config()`, we refactor the latter into
`database::get_reader_concurrency_semaphore()`. Most of its callers were
only interested in the semaphore only anyway and those that were
interested in the limit as well should get it from the coordinator
instead, so this refactoring is a win-win.
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_lw_shared(T(...)
to
make_lw_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The
staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.
This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.
And pass it to `make_range_sstable_reader()` when creating the reader,
thus allowing the incremental selector created therein to exploit the
fact that staging sstables are disjoint (in the case of repair and
streaming at least). This should reduce the memory consumption of the
staging reader considerably when reading from a lot of sstables.
Streaming is handled by just once group for CPU scheduling, so
separating it into read and write classes for I/O is artificial, and
inflates the resources we allow for streaming if both reads and writes
happen at the same time.
Merge both classes into one class ("streaming") and adjust callers. The
merged class has 200 shares, so it reduces streaming bandwidth if both
directions are active at the same time (which is rare; I think it only
happens in view building).
View update generation involves reading existing values from the base
table, which will soon require a valid permit to be passed to it, so
make sure we create and pass a valid permit to these reads.
We use `database::make_query_class_config()` to obtain the semaphore for
the read which selects the appropriate user/system semaphore based on
the scheduling group the base table write is running in.
There is no reason to read a single SSTable at a time from the staging
directory. Moving SSTables from staging directory essentially involves
scanning input SSTables and creating new SSTables (albeit in a different
directory).
We have a mechanism that does that: compactions. In a follow up patch, I
will introduce a new specialization of compaction that moves SSTables
from staging (potentially compacting them if there are plenty).
In preparation for that, some signatures have to be changed and the
view_updating_consumer has to be more compaction friendly. Meaning:
- Operating with an sstable vector
- taking a table reference, not a database
Because this code is a bit fragile and the reviewer set is fundamentally
different from anything compaction related, I am sending this separately
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The view_update_generator acceps (and keeps) database and storage_proxy,
the latter is only needed to initialize the view_updating_consumer which,
in turn, only needs it to get database from (to find column family).
This can be relaxed by providing the database from _generator to _consumer
directly, without using the storage_proxy in between.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20200207112427.18419-1-xemul@scylladb.com>
The former was never really more than a reader_permit with one
additional method. Currently using it doesn't even save one from any
includes. Now that readers will be using reader_permit we would have to
pass down both to mutation_source. Instead get rid of
reader_resource_tracker and just use reader_permit. Instead of making it
a last and optional parameter that is easy to ignore, make it a
first class parameter, right after schema, to signify that permits are
now a prominent part of the reader API.
This -- mostly mechanical -- patch essentially refactors mutation_source
to ask for the reader_permit instead of reader_resource_tracking and
updates all usage sites.
Hold the _sstable_deletion_sem while moving sstables from the staging directory
so not to move them under the feet of table::snapshot.
Fixes#5340
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Consumer may throw, in this case, break from the loop and retry.
move_sstable_from_staging_in_thread may theoretically throw too,
ignore the error in this case since the sstable was already processed,
individual move failures are already ignored and moving from staging
will be retried upon restart.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>