We want to start tracking the memory consumption of mutation fragments.
For this we need schema and permit during construction, and on each
modification, so the memory consumption can be recalculated and pass to
the permit.
In this patch we just add the new parameters and go through the insane
churn of updating all call sites. They will be used in the next patch.
We will soon want to update the memory consumption of mutation fragment
after each modification done to it, to do that safely we have to forbid
direct access to the underlying data and instead have callers pass a
lambda doing their modifications.
Uses where this method was just used to move the fragment away are
converted to use `as_clustering_row() &&`.
Not used yet, this patch does all the churn of propagating a permit
to each impl.
In the next patch we will use it to track to track the memory
consumption of `_buffer`.
We use class template argument deduction (CTAD) in a few places, but it appears
not to work for alias templates in clang. While it looks like a clang bug, using
the class name is an improvement, so let's do that.
The lockup:
When view_builder starts all shards at some point get to a
barrier waiting for each other to pass. If any shard misses
this checkpoint, all others stuck forever. As this barrier
lives inside the _started future, which in turn is waited
on stop, the stop stucks as well.
Reasons to miss the barrier -- exception in the middle of the
fun^w start or explicit abort request while waiting for the
schema agreement.
Fix the "exception" case by unlocking the barrier promise with
exception and fix the "abort request" case by turning it into
an exception.
The bug can be reproduced by hands if making one shard never
see the schema agreement and continue looping until the abort
request.
The crash:
If the background start up fails, then the _started future is
resolved into exception. The view_builder::stop then turns this
future into a real exception caught-and-rethrown by main.cc.
This seems wrong that a failure in a background fiber aborts
the regular shutdown that may proceed otherwise.
tests: unit(dev), manual start-stop
branch: https://github.com/xemul/scylla/tree/br-view-builder-shutdown-fix-3fixes: #7077
Patch #5 leaves the seastar::async() in the 1-st phase of the
start() although can also be tuned not to produce a thread.
However, there's one more (painless) issue with the _sem usage,
so this change appears too large for the part of the bug-fix
and will come as a followup.
* 'br-view-builder-shutdown-fix-3' of git://github.com/xemul/scylla:
view_builder: Add comment about builder instances life-times
view_builder: Do sleep abortable
view_builder: Wakeup barrier on exception
view_builder: Always resolve started future to success
view_builder: Re-futurize start
view_builder: Split calculate_shard_build_step into two
view_builder: Populate the view_builder_init_state
view_builder: Fix indentation after previous patch
view_builder: Introduce view_builder_init_state
If one shard delays in seeing the schema agreement and returns on
abort request, other shards may get stuck waiting for it on the
status read barrier. Luckily with the previous patch the barrier
is exception-proof, so we may abort the waiting loop with exception
and handle the lock-up.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If an exception pops up during the view_builder::start while some
shards wait for the status-read barrier, these shards are not woken
up, thus causing the shutdown to stuck.
Fix this by setting exception on the barrier promise, resolving all
pending and on-going futures.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
If the view builder background start fails, the _started future resolves
to exceptional state. In turn, stopping the view builder keeps this state
through .finally() and aborts the shutdown very early, while it may and
should proceed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Step two turning the view_builder::start() into a chain of lambdas --
rewrite (most of) the seastar::async()'s lambda into a more "classical"
form.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The calculate_shard_build_step() has a cross-shard barrier in the middle and
passing the barrier is broken wrt exceptions that may happen before it. The
intention is to prepare this barrier passing for exception handling by turning
the view_builder::start() into a dedicated continuation lambda.
Step one in this campaign -- split the calculate_shard_build_step() into
steps called by view_builder::start():
- before the barrier
- barrier
- after the barrier
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Keep the internal calculate_shard_build_step()'s stuff on the init helper
struct, as the method in question is about to be split into a chain of
continuation lambdas.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This is the helper initialization struct that will carry the needed
objects accross continuation lambdas.
The indentation in ::start() will be fixed in the next patch.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It is no longer called once for a given view_info, so the name
"initialize" is not appropriate.
This patch splits the "initialize" method into the "make" part, which
makes a new base_info object, and the "set" part, which changes the
current base_info object attached to the view.
The view building process was accessing mutation fragments using
current table's schema. This is not correct, fragments must be
accessed using the schema of the generating reader.
This could lead to undefined behavior when the column set of the base
table changes. out_of_range exceptions could be observed, or data in
the view ending up in the wrong column.
Refs #7061.
The fix has two parts. First, we always use the reader's schema to
access fragments generated by the reader.
Second, when calling populate_views() we upgrade the fragment-wrapping
reader's schema to the base table schema so that it matches the base
table schema of view_and_base snapshots passed to populate_views().
The view_info object, which is attached to the schema object of the
view, contains a data structure called
"base_non_pk_columns_in_view_pk". This data structure contains column
ids of the base table so is valid only for a particular version of the
base table schema. This data structure is used by materialized view
code to interpret mutations of the base table, those coming from base
table writes, or reads of the base table done as part of view updates
or view building.
The base table schema version of that data structure must match the
schema version of the mutation fragments, otherwise we hit undefined
behavior. This may include aborts, exceptions, segfaults, or data
corruption (e.g. writes landing in the wrong column in the view).
Before this patch, we could get schema version mismatch here after the
base table was altered. That's because the view schema does not change
when the base table is altered.
Part of the fix is to extract base_non_pk_columns_in_view_pk into a
third entitiy called base_dependent_view_info, which changes both on
base table schema changes and view schema changes.
It is managed by a shared pointer so that we can take immutable
snapshots of it, just like with schema_ptr. When starting the view
update, the base table schema_ptr and the corresponding
base_dependent_view_info have to match. So we must obtain them
atomically, and base_dependent_view_info cannot change during update.
Also, whenever the base table schema changes, we must update
base_dependent_view_infos of all attached views (atomically) so that
it matches the base table schema.
Refs #7061.
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
`count` function was often used in various ways.
`contains` does not only express the intend of the code better but also
does it in more unified way.
This commit replaces all the occurences of the `count` with the
`contains`.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <b4ef3b4bc24f49abe04a2aba0ddd946009c9fcb2.1597314640.git.piotr@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/7018
by Piotr Sarna:
This series addresses various issues with metrics and semaphores - it mainly adds missing metrics, which makes it possible to see the length of the queues attached to the semaphores. In case of view building and view update generation, metrics was not present in these services at all, so a first, basic implementation is added.
More precise semaphore metrics would ease the testing and development of load shedding and admission control.
view_builder: add metrics
db, view: add view update generator metrics
hints: track resource_manager sending queue length
hints: add drain queue length to metrics
table: add metrics for sstable deletion semaphore
database: remove unused semaphore
C++20 introduced `contains` member functions for maps and sets for
checking whether an element is present in the collection. Previously
the code pattern looked like:
<collection>.find(<element>) != <collection>.end()
In C++20 the same can be expressed with:
<collection>.contains(<element>)
This is not only more concise but also expresses the intend of the code
more clearly.
This commit replaces all the occurences of the old pattern with the new
approach.
Tests: unit(dev)
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f001bbc356224f0c38f06ee2a90fb60a6e8e1980.1597132302.git.piotr@scylladb.com>
Move the classes representing CQL expressions (and utility functions
on them) from the `restrictions` namespace to a new namespace `expr`.
Most of the restriction.hh content was moved verbatim to
expression.hh. Similarly, all expression-related code was moved from
statement_restrictions.cc verbatim to expression.cc.
As suggested in #5763 feedback
https://github.com/scylladb/scylla/pull/5763#discussion_r443210498
Tests: dev (unit)
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
"
While working on another patch I was getting odd compiler errors
saying that a call to ::make_shared was ambiguous. The reason was that
seastar has both:
template <typename T, typename... A>
shared_ptr<T> make_shared(A&&... a);
template <typename T>
shared_ptr<T> make_shared(T&& a);
The second variant doesn't exist in std::make_shared.
This series drops the dependency in scylla, so that a future change
can make seastar::make_shared a bit more like std::make_shared.
"
* 'espindola/make_shared' of https://github.com/espindola/scylla:
Everywhere: Explicitly instantiate make_lw_shared
Everywhere: Add a make_shared_schema helper
Everywhere: Explicitly instantiate make_shared
cql3: Add a create_multi_column_relation helper
main: Return a shared_ptr from defer_verbose_shutdown
It is important that all replicas participating in a read use the same
memory limits to avoid artificial differences due to different amount of
results. The coordinator now passes down its own memory limit for reads,
in the form of max_result_size (or max_size). For unpaged or reverse
queries this has to be used now instead of the locally set
max_memory_unlimited_query configuration item.
To avoid the replicas accidentally using the local limit contained in
the `query_class_config` returned from
`database::make_query_class_config()`, we refactor the latter into
`database::get_reader_concurrency_semaphore()`. Most of its callers were
only interested in the semaphore only anyway and those that were
interested in the limit as well should get it from the coordinator
instead, so this refactoring is a win-win.
seastar::make_lw_shared has a constructor taking a T&&. There is no
such constructor in std::make_shared:
https://en.cppreference.com/w/cpp/memory/shared_ptr/make_shared
This means that we have to move from
make_lw_shared(T(...)
to
make_lw_shared<T>(...)
If we don't want to depend on the idiosyncrasies of
seastar::make_lw_shared.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
So that tests can test the `view_update_consumer` in isolation, without
having to set up the whole database machinery. In addition to less
infrastructure setup, this allows more direct checking of mutations
pushed for view generation.
The view update generation process creates two readers. One is used to
read the staging sstables, the data which needs view updates to be
generated for, and another reader for each processed mutation, which
reads the current value (pre-image) of each row in said mutation. The
staging reader is created first and is kept alive until all staging data
is processed. The pre-image reader is created separately for each
processed mutation. The staging reader is not restricted, meaning it
does not wait for admission on the relevant reader concurrency
semaphore, but it does register its resource usage on it. The pre-image
reader however *is* restricted. This creates a situation, where the
staging reader possibly consumes all resources from the semaphore,
leaving none for the later created pre-image reader, which will not be
able to start reading. This will block the view building process meaning
that the staging reader will not be destroyed, causing a deadlock.
This patch solves this by making the staging reader restricted and
making it evictable. To prevent thrashing -- evicting the staging reader
after reading only a really small partition -- we only make the staging
reader evictable after we have read at least 1MB worth of data from it.
The schema_tables.hh -> migration_manager.hh couple seems to work as one
of "single header for everyhing" creating big blot for many seemingly
unrelated .hh's.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
"
This is the first stage of replacing the existing restrictions code with a new representation. It adds a new class `expression` to replace the existing class `restriction`. Lots of the old code is deleted, though not all -- that will come in subsequent stages.
Tests: unit (dev, debug restrictions_test), dtest (next-gating)
"
* dekimir-restrictions-rewrite:
cql3/restrictions: Drop dead code
cql3/restrictions: Use free functions instead of methods
cql3/restrictions: Create expression objects
cql3/restrictions: Add free functions over new classes
cql3/restrictions: Add new representation
Instead of `restriction` class methods, use the new free functions.
Specific replacement actions are listed below.
Note that class `restrictions` (plural) remains intact -- both its
methods and its type hierarchy remain intact for now.
Ensure full test coverage of the replacement code with new file
test/boost/restrictions_test.cc and some extra testcases in
test/cql/*.
Drop some existing tests because they codify buggy behaviour
(reference #6369, #6382). Drop others because they forbid relation
combinations that are now allowed (eg, mixing equality and
inequality, comparing to NULL, etc.).
Here are some specific categories of what was replaced:
- restriction::is_foo predicates are replaced by using the free
function find_if; sometimes it is used transitively (see, eg,
has_slice)
- restriction::is_multi_column is replaced by dynamic casts (recall
that the `restrictions` class hierarchy still exists)
- utility methods is_satisfied_by, is_supported_by, to_string, and
uses_function are replaced by eponymous free functions; note that
restrictions::uses_function still exists
- restriction::apply_to is replaced by free function
replace_column_def
- when checking infinite_bound_range_deletions, the has_bound is
replaced by local free function bounded_ck
- restriction::bounds and restriction::value are replaced by the more
general free function possible_lhs_values
- using free functions allows us to simplify the
multi_column_restriction and token_restriction hierarchies; their
methods merge_with and uses_function became identical in all
subclasses, so they were moved to the base class
- single_column_primary_key_restrictions<clustering_key>::needs_filtering
was changed to reuse num_prefix_columns_that_need_not_be_filtered,
which uses free functions
Fixes#5799.
Fixes#6369.
Fixes#6371.
Fixes#6372.
Fixes#6382.
Signed-off-by: Dejan Mircevski <dejan@scylladb.com>
Modified log message in view_builder::calculate_shard_build_step to make it distinct from the one in view_builder::execute, changed their logging level to warning, since we're continuing even if we handle an exception.
Fixes#4600
The seastar api v4 changes the return type of when_all_succeed. This
patch adds discard_result when that is best solution to handle the
change.
This doesn't do the actual update to v4 since there are still a few
issues left to fix in seastar. A patch doing just the update will
follow.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20200617233150.918110-1-espindola@scylladb.com>
Merged pull request https://github.com/scylladb/scylla/pull/6516 from
Piotr Sarna:
This series adds error injection points to materialized view paths:
view update generation from staging sstables;
view building;
generating view updates from user writes.
This series comes with a corresponding dtest pull request which adds some
test cases based on error injection.
Fixes#6488
In current mutate_MV() code it's possible for a local endpoint
to become a target for a network operation. That's the source
of occasional `broken promise` benign error messages appearing,
since the mutation is actually applied locally, so there's no point
in creating a write response handler - the node will not send a response
to itself via network.
While at it, the code is deduplicated a little bit - with the paths
simplified, it's easier to ensure that a local endpoint is never
listed as a target for remote network operations.
Fixes#5459
Tests: unit(dev),
dtest(materialized_views_test.TestMaterializedViews.add_dc_during_mv_insert_test)
"
Currently we classify queries as "system" or "user" based on the table
they target. The class of a query determines how the query is treated,
currently: timeout, limits for reverse queries and the concurrency
semaphore. The catch is that users are also allowed to query system
tables and when doing so they will bypass the limits intended for user
queries. This has caused performance problems in the past, yet the
reason we decided to finally address this is that we want to introduce a
memory limit for unpaged queries. Internal (system) queries are all
unpaged and we don't want to impose the same limit on them.
This series uses scheduling groups to distinguish user and system
workloads, based on the assumption that user workloads will run in the
statement scheduling group, while system workloads will run in the main
(or default) scheduling group, or perhaps something else, but in any
case not in the statement one. Currently the scheduling group of reads
and writes is lost when going through the messaging service, so to be
able to use scheduling groups to distinguish user and system reads this
series refactors the messaging service to retain this distinction across
verb calls. Furthermore, we execute some system reads/writes as part of
user reads/writes, such as auth and schema sync. These processes are
tagged to run in the main group.
This series also centralises query classification on the replica and
moves it to a higher level. More specifically, queries are now
classified -- the scheduling group they run in is translated to the
appropriate query class specific configuration -- on the database level
and the configuration is propagated down to the lower layers.
Currently this query class specific configuration consists of the reader
concurrency semaphore and the max memory limit for otherwise unlimited
queries. A corollary of the semaphore begin selected on the database
level is that the read permit is now created before the read starts. A
valid permit is now available during all stages of the read, enabling
tracking the memory consumption of e.g. the memtable and cache readers.
This change aligns nicely with the needs of more accurate reader memory
tracking, which also wants a valid permit that is available in every layer.
The series can be divided roughly into the following distinct patch
groups:
* 01-02: Give system read concurrency a boost during startup.
* 03-06: Introduce user/system statement isolation to messaging service.
* 07-13: Various infrastructure changes to prepare for using read
permits in all stages of reads.
* 14-19: Propagate the semaphore and the permit from database to the
various table methods that currently create the permit.
* 20-23: Migrate away from using the reader concurrency semaphore for
waiting for admission, use the permit instead.
* 24: Introduce `database::make_query_config()` and switch the database
methods needing such a config to use it.
* 25-31: Get rid of all uses of `no_reader_permit()`.
* 32-33: Ban empty permits for good.
* 34: querier_cache: use the queriers' permits to obtain the semaphore.
Fixes: #5919
Tests: unit(dev, release, debug),
dtest(bootstrap_test.py:TestBootstrap.start_stop_test_node), manual
testing with a 2 node mixed cluster with extra logging.
"
* 'query-class/v6' of https://github.com/denesb/scylla: (34 commits)
querier_cache: get semaphore from querier
reader_permit: forbid empty permits
reader_permit: fix reader_resources::operator bool
treewide: remove all uses of no_reader_permit()
database: make_multishard_streaming_reader: pass valid permit to multi range reader
sstables: pass valid permits to all internal reads
compaction: pass a valid permit to sstable reads
database: add compaction read concurrency semaphore
view: use valid permits for reads from the base table
database: use valid permit for counter read-before-write
database: introduce make_query_class_config()
reader_concurrency_semaphore: remove wait_admission and consume_resources()
test: move away from reader_concurrency_semaphore::wait_admission()
reader_permit: resource_units: introduce add()
mutation_reader: restricted_reader: work in terms of reader_permit
row_cache: pass a valid permit to underlying read
memtable: pass a valid permit to the delegate reader
table: require a valid permit to be passed to most read methods
multishard_mutation_query: pass a valid permit to shard mutation sources
querier: add reader_permit parameter and forward it to the mutation_source
...
Until now, view updates were generated with a bunch of random
time points, because the interface was not adjusted for passing
a single time point. The time points were used to determine
whether cells were alive (e.g. because of TTL), so it's better
to unify the process:
1. when generating view updates from user writes, a single time point
is used for the whole operation
2. when generating view updates via the view building process,
a single time point is used for each build step
NOTE: I don't see any reliable and deterministic way of writing
test scenarios which trigger problems with the old code.
After #6488 is resolved and error injection is integrated
into view.cc, tests can be added.
Fixes#6429
Tests: unit(dev)
Message-Id: <f864e965eb2e27ffc13d50359ad1e228894f7121.1590070130.git.sarna@scylladb.com>
View update generation involves reading existing values from the base
table, which will soon require a valid permit to be passed to it, so
make sure we create and pass a valid permit to these reads.
We use `database::make_query_class_config()` to obtain the semaphore for
the read which selects the appropriate user/system semaphore based on
the scheduling group the base table write is running in.
In order to add tracing to places where it can be useful,
e.g. materialized view updates and hinted handoff, tracing state
is propagated to all applicable call sites.
When generating view updates, an endpoint can appear both
as a primary paired endpoint for the view update, and as a pending
endpoint (due to range movements). In order not to generate
the same update twice for the same endpoint, the paired endpoint
is removed from the list of pending endpoints if present.
Fixes#5459
Tests: unit(dev),
dtest(TestMaterializedViews.add_dc_during_mv_insert_test)
The startup routine performs some bookkeeping operations on views,
and so do these events:
- on_create_view;
- on_drop_view;
- on_update_view.
Since the above events are guarded with a semaphore, the startup
routine should also take the same semaphore - in order to ensure
that all bookkeeping operations are serialized.
Refs #6094
The future was marked with a `FIXME: discarded future`, but there's really
no reason not to wait for it, and it was probably meant to be waited for
since its implementation.