The following command had been executed to get the
list of headers that did not contain '#pragma once':
'grep -rnw . -e "#pragma once" --include *.hh -L'
This change adds missing include guard to headers
that did not contain any guard.
Signed-off-by: Patryk Wrobel <patryk.wrobel@scylladb.com>
Closesscylladb/scylladb#19626
This series is another approach of https://github.com/scylladb/scylladb/pull/18646 and https://github.com/scylladb/scylladb/pull/19181. In this series we only change where the view backlog gets
updated - we do not assure that the view update backlog returned in a response is necessarily the backlog
that increased due to the corresponding write, the returned backlog may be outdated up to 10ms. Because
this series does not include this change, it's considerably less complex and it doesn't modify the common
write patch, so no particular performance considerations were needed in that context. The issue being fixed
is still the same, the full description can be seen below.
When a replica applies a write on a table which has a materialized view
it generates view updates. These updates take memory which is tracked
by `database::_view_update_concurrency_sem`, separate on each shard.
The fraction of units taken from the semaphore to the semaphore limit
is the shard's view update backlog. Based on these backlogs, we want
to estimate how busy a node is with its view updates work. We do that
by taking the max backlog across all shards.
To avoid excessive cross-shard operations, the node's (max) backlog isn't
calculated each time we need it, but up to 1 time per 10ms (the `_interval`) with an optimization where the backlog of the calculating shard is immediately up-to-date (we don't need cross-shard operations for it):
```
update_backlog node_update_backlog::fetch() {
auto now = clock::now();
if (now >= _last_update.load(std::memory_order_relaxed) + _interval) {
_last_update.store(now, std::memory_order_relaxed);
auto new_max = boost::accumulate(
_backlogs,
update_backlog::no_backlog(),
[] (const update_backlog& lhs, const per_shard_backlog& rhs) {
return std::max(lhs, rhs.load());
});
_max.store(new_max, std::memory_order_relaxed);
return new_max;
}
return std::max(fetch_shard(this_shard_id()), _max.load(std::memory_order_relaxed));
}
```
For the same reason, even when we do calculate the new node's backlog,
we don't read from the `_view_update_concurrency_sem`. Instead, for
each shard we also store a update_backlog atomic which we use for
calculation:
```
struct per_shard_backlog {
// Multiply by 2 to defeat the prefetcher
alignas(seastar::cache_line_size * 2) std::atomic<update_backlog> backlog = update_backlog::no_backlog();
need_publishing need_publishing = need_publishing::no;
update_backlog load() const {
return backlog.load(std::memory_order_relaxed);
}
};
std::vector<per_shard_backlog> _backlogs;
```
Due to this distinction, the update_backlog atomic need to be updated
separately, when the `_view_update_concurrency_sem` changes.
This is done by calling `storage_proxy::update_view_update_backlog`, which reads the `_view_update_concurrency_sem` of the shard (in `database::get_view_update_backlog`)
and then calls node`_update_backlog::add` where the read backlog
is stored in the atomic:
```
void storage_proxy::update_view_update_backlog() {
_max_view_update_backlog.add(get_db().local().get_view_update_backlog());
}
void node_update_backlog::add(update_backlog backlog) {
_backlogs[this_shard_id()].backlog.store(backlog, std::memory_order_relaxed);
_backlogs[this_shard_id()].need_publishing = need_publishing::yes;
}
```
For this implementation of calculating the node's view update backlog to work,
we need the atomics to be updated correctly when the semaphores of corresponding
shards change.
The main event where the view update backlog changes is an incoming write
request. That's why when handling the request and preparing a response
we update the backlog calling `storage_proxy::get_view_update_backlog` (also
because we want to read the backlog and send it in the response):
backlog update after local view updates (`storage_proxy::send_to_live_endpoints` in `mutate_begin`)
```
auto lmutate = [handler_ptr, response_id, this, my_address, timeout] () mutable {
return handler_ptr->apply_locally(timeout, handler_ptr->get_trace_state())
.then([response_id, this, my_address, h = std::move(handler_ptr), p = shared_from_this()] {
// make mutation alive until it is processed locally, otherwise it
// may disappear if write timeouts before this future is ready
got_response(response_id, my_address, get_view_update_backlog());
});
};
backlog update after remote view updates (storage_proxy::remote::handle_write)
auto f = co_await coroutine::as_future(send_mutation_done(netw::messaging_service::msg_addr{reply_to, shard}, trace_state_ptr,
shard, response_id, p->get_view_update_backlog()));
```
Now assume that on a certain node we have a write request received on shard A,
which updates a row on shard B (A!=B). As a result, shard B will generate view
updates and consume units from its `_view_update_concurrency_sem`, but will
not update its atomic in `_backlogs` yet. Because both shards in the example
are on the same node, shard A will perform a local write calling `lmutate` shown
above. In the `lmutate` call, the `apply_locally` will initiate the actual write on
shard B and the `storage_proxy::update_view_update_backlog` will be called back
on shard A. In no place will the backlog atomic on shard B get updated even
though it increased in size due to the view updates generated there.
Currently, what we calculate there doesn't really matter - it's only used for the
MV flow control delays, so currently, in this scenario, we may only overload
a replica causing failed replica writes which will be later retried as hints. However,
when we add MV admission control, the calculated backlog will be the difference
between an accepted and a rejected request.
Fixes: https://github.com/scylladb/scylladb/issues/18542
Without admission control (https://github.com/scylladb/scylladb/pull/18334), this patch doesn't affect much, so I'm marking it as backport/none
Closesscylladb/scylladb#19341
* github.com:scylladb/scylladb:
test: add test for view backlog not being updated on correct shard
test: move auxiliary methods for waiting until a view is built to util
mv: update view update backlog when it increases on correct shard
This patch adds a test for reproducing issue https://github.com/scylladb/scylladb/issues/18542
The test performs writes on a table with a materialized view and
checks that the view backlog increases. To get the current view
update backlog, a new metric "view_update_backlog" is added to
the `storage_proxy` metrics. The metric differs from the metric
from `database` metric with the same name by taking the backlog
from the max_view_update_backlog which keeps view update backlogs
from all shards which may be a bit outdated, instead of taking
the backlog by checking the view_update_semaphore which the backlog
is based on directly.
When performing a write, we should update the view update backlog
on the shard where the mutation is actually applied. Instead,
currently we only update it on the shard that initially received
the write request (which didn't change at all) and as a result,
the backlog on the correct shard and the aggregated max view update
backlog are not updated at all.
This patch enables updating the backlog on the correct shard. The
update is now performed just after the view generation and propagation
finishes, so that all backlog increases are noted and the backlog is
ready to be used in the write response.
Additionally, after this patch, we no longer (falsely) assume that
the backlog is modified on the same shard as where we later read it
to attach to a response. However, we still compare the aggregated
backlog from all shards and the backlog from the shard retrieving
the max, as with a shard-aware driver, it's likely the exact shard
whose backlog changed.
Currently, a pending replica that applies a write on a table that has
materialized views, will build all the view updates as a normal replica,
only to realize at a late point, in db::view::get_view_natural_endpoint(),
that it doesn't have a paired view replica to send the updates to. It will
then either drop the view updates, or send them to a pending view
replica, if such exists.
This work is unnecessary since it may be dropped, and even if there is a
pending view replica to send the updates to, the updates that are built
by the pending replica may be wrong since it may have incomplete
information.
This commit fixes the inefficiency by skipping the view update building
step when applying an update on a pending replica.
The metric total_view_updates_on_wrong_node is added to count the cases
that a view update is determined to be unnecessary.
The test reproduces the scenario of writing to a table and applying
the update on a pending replica, and verifies that the pending replica
doesn't try to build view updates.
Fixesscylladb/scylladb#19152Closesscylladb/scylladb#19488
flat_mutation_reader_v2 was introduced in a pair of commits in 2021:
e3309322c3 "Clone flat_mutation_reader related classes into v2 variants"
08b5773c12 "Adapt flat_mutation_reader_v2 to the new version of the API"
as a replacement for flat_mutation_reader, using range_tombstone_change
instead of range_tombstone to represent represent range tombstones. See
those commits for more information.
The transition was incremental; the last use of the original
flat_mutation_reader was removed in 2022 in commit
026f8cc1e7 "db: Use mutation_partition_v2 in mvcc"
In turn, flat_mutation_reader was introduced in 2017 in commit
748205ca75 "Introduce flat_mutation_reader"
To transition from a mutation_reader that nested rows within
a partition in a separate stream, to a flat reader that streamed
partitions and rows in the same stream.
Here, we reclaim the original name and rename the awkward
flat_mutation_reader_v2 to mutation_reader.
Note that mutation_fragment_v2 remains since we still use the original
for compatibilty, sometimes.
Some notes about the transition:
- files were also renamed. In one case (flat_mutation_reader_test.cc), the
rename target already existed, so we rename to
mutation_reader_another_test.cc.
- a namespace 'mutation_reader' with two definitions existed (in
mutation_reader_fwd.hh). Its contents was folded into the mutation_reader
class. As a result, a few #includes had to be adjusted.
Closesscylladb/scylladb#19356
Currently, a view update backlog may reach an invalid state, when
its max is 0 and its relative_size() is NaN as a result. This can
be achieved either by constructing the backlog with a 0 max or by
modifying the max of an existing backlog. In particular, this
happens when creating the backlog using the default constructor.
In this patch the the default constructor is deleted and a check
is added to make sure that the max is different than 0 is added
to its constructor - if the check fails, we construct an empty
backlog instead, to handle the possibility of getting an invalid
backlog sent from a node with a version that's missing this check.
Additionally, we make the backlogs members private, exposing them
only through const getters.
Currently, when calculating the view update backlog for gossip,
we start with `db::view::update_backlog()` and compare it to backlogs
from all shards. However, this backlog can't be compared to other
backlogs - it has size 0 and we compare the fraction current/size
when comparing backlogs, causing us to compare with `NaN`.
This patch fixes it by starting the comparisons with an empty backlog.
For various reasons, a view building write may fail. When that
happens, the view building should not finish until these writes
are successfully retried and they should not interfere with any
writes that are performed to the base table while the view is
building.
The test introduced in this patch confirms that this is the case.
Refs scylladb/scylladb#19261Closesscylladb/scylladb#19263
Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog - this can be observed in the following scenario:
1. Cluster starts, all nodes gossip their empty view update backlog to one another
2. On node N, `view_update_backlog_broker` (the backlog gossiper) performs an iteration of its backlog update loop, sees no change (backlog has been empty since the start), schedules the next iteration after 1s
3. Within the next 1s, coordinator (different than N) sends a write to N causing a remote view update (which we do not wait for). As a result, node N replies immediately with an increased view update backlog, which is then noted by the coordinator.
4. Still within the 1s, node N finishes the view update in the background, dropping its view update backlog to 0.
5. In the next and following iterations of `view_update_backlog_broker` on N, backlog is empty, as it was in step 2, so no change is seen and no update is sent due to the check
```
auto backlog = _sp.local().get_view_update_backlog();
if (backlog_published && *backlog_published == backlog) {
sleep_abortable(gms::gossiper::INTERVAL, _as).get();
continue;
}
```
After this scenario happens, the coordinator keeps an information about an increased view update backlog on N even though it's actually already empty
This patch fixes the issue this by notifying the gossip that a different backlog
was sent in a response, causing it to send an unchanged backlog to other
nodes in the following gossip round.
Fixes: https://github.com/scylladb/scylladb/issues/18461
Similarly to https://github.com/scylladb/scylladb/pull/18646, without admission control (https://github.com/scylladb/scylladb/pull/18334), this patch doesn't affect much, so I'm marking it as backport/none
Tests: manual. Currently this patch only affects the length of MV flow control delay, which is not reliable to base a test on. A proper test will be added when MV admission control is added, so we'll be able to base the test on rejected requests
Closesscylladb/scylladb#18663
* github.com:scylladb/scylladb:
mv: gossip the same backlog if a different backlog was sent in a response
node_update_backlog: divide adding and fetching backlogs
Currently, when generating and propagating view updates, if we notice
that we've already exceeded the time limit, we throw an exception
inheriting from `request_timeout_exception`, to later catch and
log it when finishing request handling. However, when catching, we
only check timeouts by matching the `timed_out_error` exception,
so the exception thrown in the view update code is not registered
as a timeout exception, but an unknown one. This can cause tests
which were based on the log output to start failing, as in the past
we were noticing the timeout at the end of the request handling
and using the `timed_out_error` to keep processing it and now, even
though we do notice the timeout even earlier, due to it's type we
log an error to the log, instead of treating it as a regular timeout.
In this patch we make the error thrown on timeout during view updates
inherit from `timed_out_error` instead of the `request_timeout_exception`
(it is also moved from the "exceptions" directory, where we define
exceptions returned to the user).
Aside from helping with the issue described above, we also improve our
metrics, as the `request_timeout_exception` is also not checked for
in the `is_timeout_exception` method, and because we're using it to
check whether we should update write timeout metrics, they will only
start getting updated after this patch.
Closesscylladb/scylladb#19102
Currently, there are 2 ways of sharing a backlog with other nodes: through
a gossip mechanism, and with responses to replica writes. In gossip, we
check each second if the backlog changed, and if it did we update other
nodes with it. However if the backlog for this node changed on another
node with a write response, the gossiped backlog is currently not updated,
so if after the response the backlog goes back to the value from the previous
gossip round, it will not get sent and the other node will stay with an
outdated backlog.
This patch changes this by notifying the gossip that a the backlog changed
since the last gossip round so a different backlog could have been send
through the response piggyback mechanism. With that information, gossip
will send an unchanged backlog to other nodes in the following gossip round.
Fixes: https://github.com/scylladb/scylladb/issues/18461
Currently, we only update the backlogs in node_update_backlog at the
same time when we're fetching them. This is done using storage_proxy's
method get_view_update_backlog, which is confusing because it's a getter
with side-effects. Additionally, we don't always want to update the
backlog when we're reading it (as in gossip which is only on shard 0)
and we don't always want to read it when we're updating it (when we're
not handling any writes but the backlog drops due to background work
finish).
This patch divides the node_view_backlog::add_fetch as well the
storage_proxy::get_view_update_backlog both into two methods; one
for updating and one for reading the backlog. This patch only replaces
the places where we're currently using the view backlog getter, more
situations where we should get/update the backlog should be considered
in a following patch.
Currently it gets the streaming/maintenance one from database, but it
can as well just assume that it's already running in the correct one,
and the main code fulfils this assumption.
This removes one more place that uses database as sched groups provider.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#19078
Currently, the backlog used for MV flow control is only updated
after we generate view updates as a result of a write request.
However, when the resources are no longer used, we should also
notice that to prevent excessive slowdowns caused by the MV
flow control calulating the delays based of an outdated, large
backlog.
This patch makes it so the backlogs are updated every time
a view update finishes, and not only when the updates start.
Fixes#18783Closesscylladb/scylladb#18804
When calculating the base-view mapping while the topology
is changing, we may encounter a situation where the base
table noticed the change in its effective replication map
while the view table hasn't, or vice-versa. This can happen
because the ERM update may be performed during the preemption
between taking the base ERM and view ERM, or, due to f2ff701,
the update may have just been performed partially when we are
taking the ERMs.
Until now, we assumed that the ERMs are synchronized while calling
finding the base-view endpoint mapping, so in particular, we were
using the topology from the base's ERM to check the datacenters of
all endpoints. Now that the ERMs are more likely to not be the same,
we may try to get the datacenter of a view endpoint that doesn't
exist in the base's topology, causing us to crash.
This is fixed in this patch by using the view table's topology for
endpoints coming from the view ERM. The mapping resulting from the
call might now be a temporary mapping between endpoints in different
topologies, but it still maps base and view replicas 1-to-1.
Fixes: #17786Fixes: #18709Closesscylladb/scylladb#18816
Callers of it had just checked if an sstable still has some views
building, so the should talk to view-builder to register the sstable
that's now considered to be staging.
Effectively. this is to hide the view-update-generator from other
services and make them communicate with the builder only.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This helper checks if there's an ongoing build of a view, and it's in
fact internal to view-builder, who keeps its status in one of its
system tables.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The existing inet_address::to_string() calls fmt::format("{}", *this)
anyway. However, the to_string() method is declared in .cc file, while
form formatter is in the header and is equipeed with constexprs so
that converting an address to string is done as much as possible
compile-time.
Also, though minor, fmt::to_string(foo) is believed to be even faster
than fmt::format("{}", foo).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#18712
This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed.
See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions.
This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh.
The existing mechanism works in the following way:
* Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes
* Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking
* We keep track of the percent of consumed units on each node, this is called `view update backlog`.
* Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level.
This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates.
To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`.
The new algorithm of view update generation looks something like this:
```c++
for(;;) {
auto updates = generate_updates_batch_with_max_100_rows();
co_await seastar::sleep(calculate_sleep_time_from_backlogs());
spawn_background_tasks_for_updates(updates);
}
```
Fixes: https://github.com/scylladb/scylladb/issues/12379Closesscylladb/scylladb#16819
* github.com:scylladb/scylladb:
test: add test for bad_allocs during large mv queries
mv: throttle view update generation for large queries
exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception
db/view: extract view throttling delay calculation to a global function
view_update_generator: add get_storage_proxy()
storage_proxy: make view backlog getters public
This patch adds a test for reproducing issue #12379, which is
being fixed in #16819.
The test case works by creating a table with a materialized
view, and then performing a partition delete query on it.
At the same time, it uses injections to limit the memory
to a level lower than usual, in order to increase the
consistency of the test, and to limit its runtime.
Before #16819, the test would exceed the limit and fail,
and now the next allocation is throttled using a sleep.
For every mutation applied to the base table we have to
generate the corresponding materialized view table updates.
In case of simple requests, like INSERT or UPDATE, the number
of view updates generated per base table mutation is limited
to at most a few view table updates per base table update.
The situation is different for DELETE queries, which can delete
the whole partitions or clustering ranges. Range deletions are
fast on the base table, but for the view table the situation
is different. Deleting a single partition in the base table
will generate as many singular view updates as there are rows
in the deleted partition, which could potentially be in the millions.
To prevent OOM view updates are generated in batches of at most 100 rows.
There is a loop which generates the next batch of updates, spawns tasks
to send them to remote nodes, generates another batch and so on.
The problem is that there is no concurrency control - each batch is scheduled
to be sent in the background, but the following batch is generated without
waiting for the previously generated updates to be sent. This can lead to
unbounded concurrency and OOM.
To protect against this view update generation should be limited somehow.
There is an existing mechanism for limiting view updates - throttling.
We keep track of how many pending view updates there are, in the view backlog,
and delay responses to the client based on this backlog's fullness.
For a well behaved client with limited concurrency this will slow down
the amount of incoming requests until it reaches an optimal point.
This works for simple queries (INSERT, UPDATE, ...), but it doesn't do anything
for range DELETEs. A DELETE is a single request that generates millions of view
updates, delaying client response doesn't help.
The throttling mechanism could be extend to cover this case - we could treat the
DELETE request like any other client and force it to wait before sending more updates.
This commit implements this approach - before sending the next batch of updates
the generator is forced to sleep for a bit of time, calculated using the exisiting
throttling equation.
The more full the backlog gets the more the generator will have to sleep for,
and hopefully this will prevent overloading the system with view updates.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
In order to prevent overload caused by too many view updates,
their number is limited by delaying client responses.
The amount of time to delay for is calculated based on the
fullness of the view update backlog.
Currently this is done in the function calculate_delay,
used by abstract_write_response_handler.
In the following commits I will introduce another throttling
mechanism that uses the same equation to calculate wait time,
so it would be good to reuse the exsiting function.
Let's make the function globally accessible.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
During view generation we would like to be able
to access information about the current state
of view update backlogs, but this information
is kept inside storage_proxy.
A reference to storage_proxy is kept inside view_update_generator,
so the easiest way to get access to it from the view update code
is by adding a public getter there.
There's already a similar getter for replica::database: get_db(),
so it's in line with the rest of the code.
Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
When view builder is drained (it now happens very early, but next patch
moves this into regular drain) it waits for all on-going view build
steps to complete. This includes waiting for any outstanding proxy view
writes to complete as well.
View writes in proxy have very high timeout of 5 minutes but they are
cancellable. However, canecelling of such writes happens in proxy's
drain_on_shutdown() call which, in turn, happens pretty late on
shutdown. Effectively, by the time it happens all view writes mush have
completed already, so stop-time cancelling doesn't really work nowadays.
Next patch makes view builder drain happen a bit later during shutdown,
namely -- _after_ shutting down messaging service. When it happen that
late, non-working view writes cancellation becomes critical, as view
builder drain hangs for aforementioned 5 minutes. This patch explicitly
cancels all view writes when view builder stops.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two places that workaround db.column_family_exists() call with some fancy exceptions-catching lambda.
This PR makes things simpler.
Closesscylladb/scylladb#18441
* github.com:scylladb/scylladb:
view: Open-code one line lambda checking if table exists
view: Use non-throwoing check if a table exists
Continuation of the previous patch. The lambda in question used to be a
heavyweight(y) code, but now it's one-liner. And it's only called once,
so no more point in keeping it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Two places in view code check if a table exists by finding its schema ID
and catching no_such_column_family exception. That's a bit heavyweight,
database has column_family_exists() method for such cases.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The populate_views() and generate_and_propagate_view_updates() both naturally belong to view_update_generator -- they don't need anything special from table itself, but rather depend on some internals of the v.u.generator itself.
Moving them there lets removing the view concurrency semaphore from keyspace and table, thus reducing the cross-components dependencies.
Closesscylladb/scylladb#18421
* github.com:scylladb/scylladb:
replica: Do not carry view concurrency semaphore pointer around
view: Get concurrency semaphore via database, not table
view_update_generator: Mark mutate_MV() private
view: Move view_update_generator methods' code
view: Move table::generate_and_propagate_view_updates into view code
view: Move table::populate_views() into view_update_generator class
The _view_update_concurrency_sem field on database propagates itself via
keyspace config down to table config and view_update_generator then
grabs one via table:: helper. That's an overkil, view_update_generator
has direct reference on the database and can get this semaphore from
there.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Now when the two methods belong to another class, move the code itself
to db/view , where the class itself resides.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Similarly to populate_views() method, this one also naturally belongs to
view_update_generator class.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The method in question has little to do with table, effectively it only
needs stats and consurrency semaphore. And the semaphore in question is
obtained from table indirectly, it really resides on database. On the
other hand, the method carries lots of bits from db::view, e.g. the
view_update_builder class, memory_usage_of() helper and a bit more.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Commit 23c891923e (main: make sure view_builder doesn't propagate
semaphore errors) ignored some exceptions that could pop up from the
_build_step/do_build_step() serialized action, since they are "benign"
on stop.
Later there came b56b10a4bb (view_builder: do_build_step: handle
unexpected exceptions) that plugged any exception from the action in
question, regardless of they happen on stop or run-time.
Apparently, the latter commit supersedes the former.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.
in this change, we include `fmt/ranges.h` and/or `fmt/std.h`
for formatting the container types, like vector, map
optional and variant using {fmt} instead of the homebrew
formatter based on operator<<.
with this change, the changes adding fmt::formatter and
the changes using ostream formatter explicitly, we are
allowed to drop `FMT_DEPRECATED_OSTREAM` macro.
Refs scylladb#13245
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
When a view update has both a local and remote target endpoint,
it extends the lifetime of its memory tracking semaphore units
only until the end of the local update, while the resources are
actually used until the remote update finishes.
This patch changes the semaphore transferring so that in case
of both local and remote endpoints, both view updates share the
units, causing them to be released only after the update that
takes longer finishes.
Fixes#17890Closesscylladb/scylladb#17891
This PR fixes a problem with replacing a node with tablets when
RF=N. Currently, this will fail because tablet replica allocation for
rebuild will not be able to find a viable destination, as the replacing node
is not considered to be a candidate. It cannot be a candidate because
replace rolls back on failure and we cannot roll back after tablets
were migrated.
The solution taken here is to not drain tablet replicas from replaced
node during topology request but leave it to happen later after the
replaced node is in left state and replacing node is in normal state.
The replacing node waits for this draining to be complete on boot
before the node is considered booted.
Fixes https://github.com/scylladb/scylladb/issues/17025
Nodes in the left state will be kept in tablet replica sets for a while after node
replace is done, until the new replica is rebuilt. So we need to know
about those node's location (dc, rack) for two reasons:
1) algorithms which work with replica sets filter nodes based on their location. For example materialized views code which pairs base replicas with view replicas filters by datacenter first.
2) tablet scheduler needs to identify each node's location in order to make decisions about new replica placement.
It's ok to not know the IP, and we don't keep it. Those nodes will not
be present in the IP-based replica sets, e.g. those returned by
get_natural_endpoints(), only in host_id-based replica
sets. storage_proxy request coordination is not affected.
Nodes in the left state are still not present in token ring, and not
considered to be members of the ring (datacanter endpoints excludes them).
In the future we could make the change even more transparent by only
loading locator::node* for those nodes and keeping node* in tablet replica sets.
Currently left nodes are never removed from topology, so will
accumulate in memory. We could garbage-collect them from topology
coordinator if a left node is absent in any replica set. That means we
need a new state - left_for_real.
Closesscylladb/scylladb#17388
* github.com:scylladb/scylladb:
test: py: Add test for view replica pairing after replace
raft, api: Add RESTful API to query current leader of a raft group
test: test_tablets_removenode: Verify replacing when there is no spare node
doc: topology-on-raft: Document replace behavior with tablets
tablets, raft topology: Rebuild tablets after replacing node is normal
tablets: load_balancer: Access node attributes via node struct
tablets: load_balancer: Extract ensure_node()
mv: Switch to using host_id-based replica set
effective_replication_map: Introduce host_id-based get_replicas()
raft topology: Keep nodes in the left state to topology
tablets: Introduce read_required_hosts()
Currently, when dividing memory tracked for a batch of updates
we do not take into account the overhead that we have for processing
every update. This patch adds the overhead for single updates
and joins the memory calculation path for batches and their parts
so that both use the same overhead.
Fixes#17854Closesscylladb/scylladb#17855
This is necessary to not break replica pairing between base and
view. After replacing a node, tablet replica set contains for a while
the replaced node which is in the left state. This node is not
returned by the IP-based get_natural_endpoints() so the replica
indexes would shift, changing the pairing with the view.
The host_id-based replica set always has stable indexes for replicas.
Builder works in "steps". Each step runs for a given base table, when a
new view is created it either initiates a step or appends to currently
running step.
Running a step means reading mutations from local sstables reader and
applying them to all views that has jumped into this step so far. When a
view is added to the step it remembers the current token value the step
is on. When step receives end-of-stream it rewinds to minimal-token.
Rewinding is done by closing current reader and creating a new one. Each
time token is advanced, all the views that meet the new token value for
the second time (i.e. -- scan full round) are marked as built and are
removed from step. When no views are left on step, it finishes.
The above machinery can break when rewinding the end-of-stream reader.
The trick is that a running step silently assumes that if the reader
once produced some token (and there can be a view that remembered this
token as its starting one), then after rewinding the reader would
generate the same token or greater. With tablets, however, that's not
the case. When a node is decommissioned tablets are cleaned and all
sstables are removed. Rewinding a reader after it makes empty reader
that produces no tokens from now on. Respectively, any build steps that
had captured tokens prior to cleanup would get stuck forever.
The fix is to check if the mutation consumer stepped at least one step
forward after rewind, and if no -- complete all the attached views.
fixes: #17293
Similar thing should happen if the base table is truncated with views
being built from it. Testing it steps on compaction assertion elsewhere
and needs more research.
refs: #17543
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#17548
This one-line patch fixes a failure in the dtest
lwt_schema_modification_test.py::TestLWTSchemaModification
::test_table_alter_delete
Where an update sometimes failed due to an internal server error, and the
log had the mysterious warning message:
"std::logic_error (Empty materialized view updated)"
We've also seen this log-message in the past in another user's log, and
never understood what it meant.
It turns out that the error message was generated (and warning printed)
while building view updates for a base-table mutation, and noticing that
the base mutation contains an *empty* row - a row with no cells or
tombstone or anything whatsoever. This case was deemed (8 years ago,
in d5a61a8c48) unexpected and nonsensical,
and we threw an exception. But this case actually *can* happen - here is
how it happened in test_table_alter_delete - which is a test involving
a strange combination of materialized views, LWT and schema changes:
1. A table has a materialized view, and also a regular column "int_col".
2. A background thread repeatedly drops and re-creates this column
int_col.
3. Another thread deletes rows with LWT ("IF EXISTS").
4. These LWT operations each reads the existing row, and because of
repeated drop-and-recreate of the "int_col" column, sometimes this
read notices that one node has a value for int_col and the other
doesn't, and creates a read-repair mutation setting int_col (the
difference between the two reads includes just in this column).
5. The node missing "int_col" receives this mutation which sets only
int_col. It upgrade()s this mutation to its most recent schema,
which doesn't have int_col, so it removes this column from the
mutation row - and is left with a completely empty mutation row.
This completely empty row is not useful, but upgrade() doesn't
remove it.
6. The view-update generation code sees this empty base-mutation row
and fails it with this std::logic_error.
7. The node which sent the read-repair mutation sees that the read
repair failed, so it fails the read and therefore fails the LWT
delete operation.
It is this LWT operation which failed in the test, and caused
the whole test to fail.
The fix is trivial: an empty base-table row mutation should simply be
*ignored* when generating view updates - it shouldn't cause any error.
Before this patch, test_table_alter_delete used to fail in roughly
20% of the runs on my laptop. After this patch, I ran it 100 times
without a single failure.
Fixes#15228Fixes#17549
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#17607
Our interval template started life as `range`, and was supported
wrapping to follow Cassandra's convention of wrapping around the
maximum token.
We later recognized that an interval type should usually be non-wrapping
and split it into wrapping_range and nonwrapping_range, with `range`
aliasing wrapping_range to preserve compatibility.
Even later, we realized the name was already taken by C++ ranges and
so renamed it to `interval`. Given that intervals are usually non-wrapping,
the default `interval` type is non-wrapping.
We can now simplify it further, recognizing that everyone assumes
that an interval is non-wrapping and so doesn't need the
nonwrapping_interval_designation. We just rename nonwrapping_interval
to `interval` and remove the type alias.
range.hh was deprecated in bd794629f9 (2020) since its names
conflict with the C++ library concept of an iterator range. The name
::range also mapped to the dangerous wrapping_interval rather than
nonwrapping_interval.
Complete the deprecation by removing range.hh and replacing all the
aliases by the names they point to from the interval library. Note
this now exposes uses of wrapping intervals as they are now explicit.
The unit tests are renamed and range.hh is deleted.
Closesscylladb/scylladb#17428