Currently, when generating and propagating view updates, if we notice
that we've already exceeded the time limit, we throw an exception
inheriting from `request_timeout_exception`, to later catch and
log it when finishing request handling. However, when catching, we
only check timeouts by matching the `timed_out_error` exception,
so the exception thrown in the view update code is not registered
as a timeout exception, but an unknown one. This can cause tests
which were based on the log output to start failing, as in the past
we were noticing the timeout at the end of the request handling
and using the `timed_out_error` to keep processing it and now, even
though we do notice the timeout even earlier, due to it's type we
log an error to the log, instead of treating it as a regular timeout.
In this patch we make the error thrown on timeout during view updates
inherit from `timed_out_error` instead of the `request_timeout_exception`
(it is also moved from the "exceptions" directory, where we define
exceptions returned to the user).
Aside from helping with the issue described above, we also improve our
metrics, as the `request_timeout_exception` is also not checked for
in the `is_timeout_exception` method, and because we're using it to
check whether we should update write timeout metrics, they will only
start getting updated after this patch.
Fixes#19261
(cherry picked from commit 4aa7ada771)
Closesscylladb/scylladb#19262
This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed.
See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions.
This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh.
The existing mechanism works in the following way:
* Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes
* Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking
* We keep track of the percent of consumed units on each node, this is called `view update backlog`.
* Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level.
This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates.
To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`.
The new algorithm of view update generation looks something like this:
```c++
for(;;) {
auto updates = generate_updates_batch_with_max_100_rows();
co_await seastar::sleep(calculate_sleep_time_from_backlogs());
spawn_background_tasks_for_updates(updates);
}
```
Fixes: https://github.com/scylladb/scylladb/issues/12379Closesscylladb/scylladb#16819
* github.com:scylladb/scylladb:
test: add test for bad_allocs during large mv queries
mv: throttle view update generation for large queries
exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception
db/view: extract view throttling delay calculation to a global function
view_update_generator: add get_storage_proxy()
storage_proxy: make view backlog getters public
This patch adds a test for reproducing issue #12379, which is
being fixed in #16819.
The test case works by creating a table with a materialized
view, and then performing a partition delete query on it.
At the same time, it uses injections to limit the memory
to a level lower than usual, in order to increase the
consistency of the test, and to limit its runtime.
Before #16819, the test would exceed the limit and fail,
and now the next allocation is throttled using a sleep.
For every mutation applied to the base table we have to
generate the corresponding materialized view table updates.
In case of simple requests, like INSERT or UPDATE, the number
of view updates generated per base table mutation is limited
to at most a few view table updates per base table update.
The situation is different for DELETE queries, which can delete
the whole partitions or clustering ranges. Range deletions are
fast on the base table, but for the view table the situation
is different. Deleting a single partition in the base table
will generate as many singular view updates as there are rows
in the deleted partition, which could potentially be in the millions.
To prevent OOM view updates are generated in batches of at most 100 rows.
There is a loop which generates the next batch of updates, spawns tasks
to send them to remote nodes, generates another batch and so on.
The problem is that there is no concurrency control - each batch is scheduled
to be sent in the background, but the following batch is generated without
waiting for the previously generated updates to be sent. This can lead to
unbounded concurrency and OOM.
To protect against this view update generation should be limited somehow.
There is an existing mechanism for limiting view updates - throttling.
We keep track of how many pending view updates there are, in the view backlog,
and delay responses to the client based on this backlog's fullness.
For a well behaved client with limited concurrency this will slow down
the amount of incoming requests until it reaches an optimal point.
This works for simple queries (INSERT, UPDATE, ...), but it doesn't do anything
for range DELETEs. A DELETE is a single request that generates millions of view
updates, delaying client response doesn't help.
The throttling mechanism could be extend to cover this case - we could treat the
DELETE request like any other client and force it to wait before sending more updates.
This commit implements this approach - before sending the next batch of updates
the generator is forced to sleep for a bit of time, calculated using the exisiting
throttling equation.
The more full the backlog gets the more the generator will have to sleep for,
and hopefully this will prevent overloading the system with view updates.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When view builder is drained (it now happens very early, but next patch
moves this into regular drain) it waits for all on-going view build
steps to complete. This includes waiting for any outstanding proxy view
writes to complete as well.
View writes in proxy have very high timeout of 5 minutes but they are
cancellable. However, canecelling of such writes happens in proxy's
drain_on_shutdown() call which, in turn, happens pretty late on
shutdown. Effectively, by the time it happens all view writes mush have
completed already, so stop-time cancelling doesn't really work nowadays.
Next patch makes view builder drain happen a bit later during shutdown,
namely -- _after_ shutting down messaging service. When it happen that
late, non-working view writes cancellation becomes critical, as view
builder drain hangs for aforementioned 5 minutes. This patch explicitly
cancels all view writes when view builder stops.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The _view_update_concurrency_sem field on database propagates itself via
keyspace config down to table config and view_update_generator then
grabs one via table:: helper. That's an overkil, view_update_generator
has direct reference on the database and can get this semaphore from
there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the two methods belong to another class, move the code itself
to db/view , where the class itself resides.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.
Replace with seastar::future::get(), which does the same thing.
This patch reverts commit 10f8f13b90 from
November 2022. That commit added to the "view update generator", the code
which builds view updates for staging sstables, a filter that ignores
ranges that do not belong to this node. However,
1. I believe this filter was never necessary, because the view update
code already silently ignores base updates which do not belong to
this replica (see get_view_natural_endpoint()). After all, the view
update needs to know that this replica is the Nth owner of the base
update to send its update to the Nth view replica, but if no such
N exists, no view update is sent.
2. The code introduced for that filter used a per-keyspace replication
map, which was ok for vnodes but no longer works for tablets, and
causes the operation using it to fail.
3. The filter was used every time the "view update generator" was used,
regardless of whether any cleanup is necessary or not, so every
such operation would fail with tablets. So for example the dtest
test_mvs_populating_from_existing_data fails with tablets:
* This test has view building in parallel with automatic tablet
movement.
* Tablet movement is streaming.
* When streaming happens before view building has finished, the
streamed sstables get "view update generator" run on them.
This causes the problematic code to be called.
Before this patch, the dtest test_mvs_populating_from_existing_data
fails when tablets are enabled. After this patch, it passes.
Fixes#16598
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The "view update generator" is responsible for generating view updates
for staging sstables (such as coming from repair). If the processing
fails, the code retries - immediately. If there is some persistent bug,
such as issue #16598, we will have a tight loop of error messages,
potentially a gigabyte of identical messages every second.
In this patch we simply add a sleep of one second after view update
generation fails before retrying. We can still get many identical
error messages if there is some bug, but not more than one per second.
Refs #16598.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.
Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Patch 967ebacaa4 (view_update_generator: Move abort kicking to
do_abort()) moved unplugging v.u.g from database from .stop() to
.do_abort(). The latter call happens very early on stop -- once scylla
receives SIGINT. However, database may still need v.u.g. plugged to
flush views.
This patch moves unplug to later, namely to .stop() method of v.u.g.
which happens after database is drained and should no longer continue
view updates.
fixes: #16001
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#16091
When v.u.g. stops is first aborts the generation background fiber by
requesting abort on the internal abort source and signalling the fiber
in case it's waiting. Right now v.u.g.::stop() is defer-scheduled last
in main(), so this move doesn't change much -- when stop_signal fires,
it will kick the v.u.g.::do_abort() just a bit earlier, there's nothing
that would happen after it before real ::stop() is called that depends
on it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
As a preparation for ensuring access safety for column families
related maps, add tables_metadata, access to members of which
would be protected by rwlock.
Very helpful for user to understand how fast view update generation
is processing the staging sstables. Today, logs are completely
silent on that. It's not uncommon for operators to peek into
staging dir and deduce the throughput based on removal of files,
which is terrible.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).
So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command
The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields
Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)
Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile
The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#13963
Now the mutate_MV is the method of v.u.generator which has reference to
the sharded<storage_proxy>. Few helper static wrappers are patched to
get the needed proxy or database reference from the mutate_MV call.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The consumer is in fact pushing the updates and _that_'s the component
that would really need the view_update_generator at hand. The consumer
is created from the generator itself so no troubles getting the pointer.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The database is low-level service and currently view update generator
implicitly depend on it via storage proxy. However, database does need
to push view updates with the help of mutate_MV helper, thus adding the
dependency loop.
This patch exploits the fact that view updates start being pushed late
enough, by that time all other service, including proxy and view update
generator, seem to be up and running. This allows a "weak dependency"
from database to view update generator, like there's one from database
to system keyspace already.
So in this patch the v.u.g. puts the shared-from-this pointer onto the
database at the time it starts. On stop it removes this pointer after
database is drained and (hopefully) all view updates are pushed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The generator will be responsible for spreading view updates with the
help of mutate_MV helper. The latter needs storage proxy to operate, so
the generator gets this dependency in advance.
There's no need to change start/stop order at the moment, generator
already starts after and stops before proxy. Also, services that have
generator as dependency are not required by proxy (even indirectly) so
no circular dependency is produced at this point.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
And propagate it down to where it is created. This will be used to add
trace points for semaphore related events, but this will come in the
next patches.
The initial intent was to reduce the fanout of shared_sstable.hh through
v.u.g.hh -> cql_test_env.hh chain, but it also resulted in some shots
around v.u.g.hh -> database.hh inclusion.
By and large:
- v.u.g.hh doesn't need database.hh
- cql_test_env.hh doesn't need v.u.g.hh (and thus -- the
shared_sstable.hh) but needs database.hh instead
- few other .cc files need v.u.g.hh directly as they pulled it via
cql_test_env.hh before
- add forward declarations in few other places
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closes#12952
Since they are currently not cleaned up by cleanup compaction
filter their tokens, processing only tokens owned by the
current node (based on the keyspace replication strategy).
Refs #9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Today, we're completely blind about the progress of view updating
on Staging files. We don't know how long it will take, nor how much
progress we've made.
This patch adds visibility with a new metric that will inform
the number of bytes to be processed from Staging files.
Before any work is done, the metric tell us the total size to be
processed. As view updating progresses, the metric value is
expected to decrease, unless work is being produced faster than
we can consume them.
We're piggybacking on sstables::read_monitor, which allows the
progress metric to be updated whenever the SSTable reader makes
progress.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closes#11751
clone staging sstables so their content may be compacted while
views are built. When done, the hard-linked copy in the staging
subdirectory will be simply unlinked.
Fixes#9559
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We don't have to go over all sstables in the table to select the
staging sstables out of them, we can get it directly from the
_sstables_staging map.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
It's potentially a bit more efficient since
t.get_sstables is called only once, while
t.shared_from_this() is called per staging sstable.
Also, prepare for the following patches that modify
this function further.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The sstable set param isn't being used anywhere, and it's also buggy
as sstable run list isn't being updated accordingly. so it could happen
that set contains sstables but run list is empty, introducing
inconsistency.
we're fortunate that the bug wasn't activated as it would've been
a hard one to catch. found this while auditting the code.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220617203438.74336-1-raphaelsc@scylladb.com>
Not a completely mechanical transition. The consumer has to generate its
mutation via a mutation_rebuilder_v2 as mutation fragment v2 cannot be
applied to mutations directly yet.
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.
References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.
scylla-gdb.py is adjusted to look for both the new and old names.
First, it's to fix the discarded future during the register. The
future is not actually such, as it's always the no-op ready one as
at that stage the view_update_generator is neither aborted nor is
in throttling state.
Second, this change is to keep database start-up code in main
shorter and cleaner. Registering staging sstables belongs to the
view_update_generator start code.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Prepare for updating seastar submodule to a change
that requires deferred actions to be noexcept
(and return void).
Test: unit(dev, debug)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Get rid of unused includes of seastar/util/{defer,closeable}.hh
and add a few that are missing from source files.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
This patch flips two "switches":
1) It switches admission to be up-front.
2) It changes the admission algorithm.
(1) by now all permits are obtained up-front, so this patch just yanks
out the restricted reader from all reader stacks and simultaneously
switches all `obtain_permit_nowait()` calls to `obtain_permit()`. By
doing this admission is now waited on when creating the permit.
(2) we switch to an admission algorithm that adds a new aspect to the
existing resource availability: the number of used/blocked reads. Namely
it only admits new reads if in addition to the necessary amount of
resources being available, all currently used readers are blocked. In
other words we only admit new reads if all currently admitted reads
requires something other than CPU to progress. They are either waiting
on I/O, a remote shard, or attention from their consumers (not used
currently).
We flip these two switches at the same time because up-front admission
means cache reads now need to obtain a permit too. For cache reads the
optimal concurrency is 1. Anything above that just increases latency
(without increasing throughput). So we want to make sure that if a cache
reader hits it doesn't get any competition for CPU and it can run to
completion. We admit new reads only if the read misses and has to go to
disk.
Another change made to accommodate this switch is the replacement of the
replica side read execution stages which the reader concurrency
semaphore as an execution stage. This replacement is needed because with
the introduction of up-front admission, reads are not independent of
each other any-more. One read executed can influence whether later reads
executed will be admitted or not, and execution stages require
independent operations to work well. By moving the execution stage into
the semaphore, we have an execution stage which is in control of both
admission and running the operations in batches, avoiding the bad
interaction between the two.