The materialized-views flow control carefully calculates an amount of
microseconds to delay a client to slow it down to the desired rate -
but then a typo (std::min instead of std::max) causes this delay to
be zeroed, which in effect completely nullifies the flow control
algorithm.
Before this fix, experiments suggested that view flow control was
not having any effect and view backlog not bounded at all. After this
fix, we can see the flow control having its desired effect, and the
view backlog converging.
Fixes#4143.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190226161452.498-1-nyh@scylladb.com>
Part of the code is already implemented (counters and hinted-handoff).
Part of the code will probably never be (triggers). And the rest is
the code that estimates number of rows per range to determine query
parallelism, but we implemented exponential growth algorithms instead.
Message-Id: <20190214112226.GE19055@scylladb.com>
Do not recalculate too much ranges in advance, it requires large
allocation and usually means that a consumer of the interface is going
to do to much work in parallel.
Fixes: #3767
get_restricted_ranges() function gets query provided key ranges
and divides them on vnode boundaries. It iterates over all ranges and
calculates all vnodes, but all its users are usually interested in only
one vnode since most likely it will be enough to populate a page. If it
will be not enough they will ask for more. This patch introduces new
interface instead of the function that allows to generate vnode ranges
on demand instead of precalculating all of them.
Replace stdx::optional and stdx::string_view with the C++ std
counterparts.
Some instances of boost::variant were also replaced with std::variant,
namely those that called seastar::visit.
Scylla now requires GCC 8 to compile.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20190108111141.5369-1-duarte@scylladb.com>
While we keep ordinary hints in a directory parallel to the data directory,
we decided to keep the materialized view hints in a subdirectory of the data
directory, named "view_pending_updates". But during boot, we expect all
subdirectories of data/ to be keyspace names, and when we notice this one,
we print a warning:
WARN: database - Skipping undefined keyspace: view_pending_updates
This spurious warning annoyed users. But moreover, we could have bigger
problems if the user actually tries to create a keyspace with that name.
So in this patch, we move the view hints to a separate top-level directory,
which defaults to /var/lib/scylla/view_hints, but as usual can be configured.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20190107142257.16342-1-nyh@scylladb.com>
This header, which is easily replaced with a forward declaration,
introduces a dependency on database.hh everywhere. Remove it and scatter
includes of database.hh in source files that really need it.
When delaying a base write, there is no need to hold on to the
mutation if all replicas have already replied.
We introduce mutation_holder::release_mutation(), which frees the
mutations that are no longer needed during the rest of the delay.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
As the amount of pending view updates increases we know that there’s a
mismatch between the rate at which the base receives writes and the
rate at which the view retires them. We react by applying backpressure
to decrease the rate of incoming base writes, allowing the slow view
replicas to catch up. We want to delay the client’s next writes to a
base replica. We use the base’s backlog of view updates to derive
this delay.
If we achieve CL and the backlogs of all replicas involved were last
seen to be empty, then we wouldn't delay the client's reply. However,
it could be that one of the replicas is actually overloaded, and won't
reply for many new such requests. We'll eventually start applying
backpressure to the client via the background's write queue, but in
the meanwhile we may be dropping view updates. To mitigate this we rely
on the backlog being gossiped periodically.
Fixes#2538
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Change the inter-node protocol so we can propagate the view update
backlog from a base replica to the coordinator through the
mutation_done and mutation_failed verbs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
In subsequent patches, replicas will reply to the coordinator with
their view update backlog. Before introducing changes to the
messaging_service, prepare the storage_proxy to receive and store
those backlogs.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
The local view update backlog is the max backlog out of the relative
memory backlog size and the relative hints backlog size.
We leverage the db::view::node_update_backlog class so we can send the
max backlog out of the node's shards.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
View updates are sent with a timeout of 5 minutes, unrelated to
any user-defined value and meant as a protection mechanism. During
normal operation we don’t benefit from timing out view writes and
offloading them to the hinted-handoff queue, since they are an
internal, non-real time workload that we already spent resources on.
This value should be increases further, but that change depends on
Refs #2538
Refs #3826
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Embedding the expire timer for a write response in the
abstract_write_response_handler simplifies the code as it allows
removing the rh_entry type.
It will also make the timeout easily accessible inside the handler,
for future patches.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181213111818.39983-1-duarte@scylladb.com>
rh_entry address is captured inside timeout's callback lambda, so the
structure should not be moved after it is created. Change the code to
create rh_entry in-place instead of moving it into the map.
Fixes#3972.
Message-Id: <20181206164043.GN25283@scylladb.com>
Current code assumes that request failed if all replicas replied with
failure, but this is not true for CL=ANY requests. Take it into account.
Fixed: #3565
Currently if write request reaches CL and all replicas replied, but some
replied with failures, the request will wait for timeout to be retired.
Detect this case and retire request immediately instead.
Fixes#3566
* seastar d59fcef...b924495 (2):
> build: Fix protobuf generation rules
> Merge "Restructure files" from Jesse
Includes fixup patch from Jesse:
"
Update Seastar `#include`s to reflect restructure
All Seastar header files are now prefixed with "seastar" and the
configure script reflects the new locations of files.
Signed-off-by: Jesse Haber-Kucharsky <jhaberku@scylladb.com>
Message-Id: <5d22d964a7735696fb6bb7606ed88f35dde31413.1542731639.git.jhaberku@scylladb.com>
"
Every call of a tracing::global_trace_state_ptr object instead of a
tracing::tracing_state_ptr or a call to tracing::global_trace_state_ptr::get()
creates a new tracing session (span) object.
This should never be done unless query handling moves to a different shard.
Fixes#3862
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <20181018003500.10030-1-vladz@scylladb.com>
When messaging_service is started we may immediately receive a mutation
from another node (e.g. in the MV update context). If hinted handoff is not
ready to store hints at that point we may fail some of MV updates.
We are going to resolve this by start()ing hints::managers before we
start messaging_service and blocking hints replaying until all relevant
objects are initialized.
Refs #3828
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"
This series changes hinted handoff to work with `frozen_mutation`s
instead of naked `mutation`s. Instead of unfreezing a mutation from
the commitlog entry and then freezing it again for sending, now we'll
just keep the read, frozen mutation.
Tests: unit(release)
"
* 'hh-manager-cleanup/v1' of https://github.com/duarten/scylla:
db/hints/manager: Use frozen_mutation instead of mutation
db/hints/manager: Use database::find_schema()
db/commitlog/commitlog_entry: Allow moving the contained mutation
service/storage_proxy: send_to_endpoint overload accepting frozen_mutation
service/storage_proxy: Build a shared_mutation from a frozen_mutation
service/storage_proxy: Lift frozen_mutation_and_schema
service/storage_proxy: Allow non-const ranges in mutate_prepare()
write_stats is referenced from write handler which is available in
send_to_live_endpoints already. No need to pass it down.
Message-Id: <20181009133017.GA14449@scylladb.com>
It is useful to have this counter to investigate the reason for read
repairs. Non zero value means that writes were lost after CL is reached
and RR is expected.
Message-Id: <20181009120900.GF22665@scylladb.com>
Add an overload to send_to_endpoint() which accepts a frozen_mutation.
The motivation is to allow better accounting of pending view updates,
but this change also allows some callers to avoid unfreezing already
frozen mutations.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
So we don't attempt to send mutations to unreachable endpoints and
instead store a hint for them, we now check the endpoint status and
populate dead_endpoints accordingly in
storage_proxy::send_to_endpoint().
Fixes#3820
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20181007100640.2182-1-duarte@scylladb.com>
The foreground reads metric is derived from the number of live read
executors minus the number of background reads. Background reads are
counted down when their resolver times out. However, a read executor
may still be around for a while, resulting in such reads being
accounted as foreground.
Usually, the gap in which this happens is short, because executor
reference holders timeout quickly as well. It's not always the case
though. For instance, local read executor doesn't time out quickly
when the target shard has an overloaded CPU, and it takes a while
before the request goes through all the queues, even if IO is not
involved. Observed in #3628.
Fixes#3734.
Another problem is that all reads which received CL responses are
accounted as background, until all replicas respond, but if such read
needs reconciliation, it's still practically a foreground read and
should be accounted as such. Found during code review.
Fixes#3745.
This patch fixes both issues by rearranging accounting to track
foreground reads instead of background reads, and considering all
reads as foreground until the resulting promise is resolved.
Message-Id: <1535999620-25784-1-git-send-email-tgrabiec@scylladb.com>
Right now, storage_proxy's mutate_stage violates isolation by running
in a plain execution_stage without a scheduling_group. This means do_mutate()
will run under the main scheduling_group, at least until we reach the database
apply execution stage, which is correct.
Fix by moving to an inheriting execution stage; this works because the
messaging service will tell RPC to set the correct execution stage for us. We
could explicitly specify statement_scheduling_group, but inheriting the
scheduling group allows us to have multiple statment scheduling groups, later.
After ac27d1c93b if a read executor has just enough targets to
achieve request's CL and a connection to one of them will be dropped
during execution ReadFailed error will be returned immediately and
client will not have a chance to issue speculative read (retry). The
patch changes the code to not return ReadFailed error immediately, but
wait for timeout instead and give a client chance to issue speculative
read in case read executor does not have additional targets to send
speculative reads to by itself.
Fixes#3699.
Message-Id: <20180819131646.GK2326@scylladb.com>
std::random_device() uses the relatively slow /dev/urandom, and we rarely if
ever intend to use it directly - we normally want to use it to seed a faster
random_engine (a pseudo-random number generator).
In many places in the code, we first created a random_device variable, and then
using it created a random_engine variable. However, this practice created the
risk of a programmer accidentally using the random_device object, instead of the
random_engine object, because both have the same API; This hurts performance.
This risk materialized in just two places in the code, utils/uuid.cc and
gms/gossiper.cc. A patch for to uuid.cc was sent previously by Pawel and is
not included in this patch, and the fix for gossiper.{cc,hh} is included here.
To avoid risking the same mistake in the future, this patch switches across the
code to an idiom where the random_device object is not *named*, so cannot be
accidentally used. We use the following idiom:
std::default_random_engine _engine{std::random_device{}()};
Here std::random_device{}() creates the random device (/dev/urandom) and pulls
a random integer from it. It then uses this seed to create the random_engine
(the pseudo-random number generator). The std::random_device{} object is
temporary and unnamed, and cannot be unintentionally used directly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20180726154958.4405-1-nyh@scylladb.com>
Count operations which were started on one shard and
were performed on another, due to non-shard-aware driver
and/or RPC.
Message-Id: <20180723155118.8545-1-avi@scylladb.com>