input_stream performs a type erasure on seastar::simple_input_stream and
fragmented_input_stream. The main goal is to keep the overhead for the
cases when simple_input_stream is used minimum.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
fragmented_input_stream is an input stream usable by IDL-generated
deserializers which can read from fragmented buffers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The histogram implementation uses sampling to estimate the mean and sum.
This patch adds a method that returns an estimated sum based on the mean
and the total number of events measured.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1467547341-30438-2-git-send-email-amnon@scylladb.com>
"
While periodic mode is a all-bets-off crap-shoot as far as knowing if
data actually reached disk or not, batch mode is supposed to be
somewhat more reliable/deterministic.
Thus, if we get an exception writing/flushing the current buffer,
we should propagate exceptions to all execution paths involved
in this buffer.
Flush queue can now (optionally) propagate exceptions to all clients, and
commit log uses this to ensure that commit log writers in batch mode
all generate exceptions on disk errors.
Also includes some rudimentary tests for flush queue mechanisms.
Note: other main user, sstable flushing, is not affected, as default
mode is still to keep exceptions to individual worker continuations,
not waiters."
Calls like later() and with_gate() may allocate memory, although that is not
very common. This can create a problem in the sense that it will potentially
recurse and bring us back to the allocator during free - which is the very thing
we are trying to avoid with the call to later().
This patch wraps the relevant calls in the reclaimer lock. This do mean that the
allocation may fail if we are under severe pressure - which includes having
exhausted all reserved space - but at least we won't recurse back to the
allocator.
To make sure we do this as early as possible, we just fold both release_requests
and do_release_requests into a single function
Thanks Tomek for the suggestion.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <980245ccc17960cf4fcbbfedb29d1878a98d85d8.1470254846.git.glauber@scylladb.com>
Issue 1510 describes a scenario in which, under load, we allocate memory within
release_requests() leading to a reentry into an invalid state in our
blocked requests' shared_promise.
This is not easy to trigger since not all allocations will actually get to the
point in which they need a new segment, let alone have that happening during
another allocator call.
Having those kinds of reentry is something we have always sought to avoid with
release_requests(): this is the reason why most of the actual routine is
deferred after a call to later().
However, that is a trick we cannot use for updating the state of the blocked
requests' shared_promise: we can't guarantee when is that going to run, and we
always need a valid shared_promise, in a valid state, waiting for new requests
to hook into.
The solution employed by this patch is to make sure that no allocation
operations whatsoever happen during the initial part of release_requests on
behalf of the shared promise. Allocation is now deferred to first use, which
relieves release_requests() from all allocation duties. All it needs to do is
free the old object and signal to the its user that an allocation is needed (by
storing {} into the shared_promise).
Fixes#1510
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <49771e51426f972ddbd4f3eeea3cdeef9cc3b3c6.1470238168.git.glauber@scylladb.com>
Re-worked to use shared_promise<> as signal mechanism
(because we have that now), which also makes it less painful
to implement exceptions propagating not only from "func" to
"post", but also from given func->post chain entry to any
waiters.
v2:
* Remove leading "_" in template types
Useful for triggerring core dump on allocation failure inside LSA,
which makes it easier to debug allocation failures. They normally
don't cause aborts, just fail the current operation, which makes it
hard to figure out what was the cause of allocation failure.
Message-Id: <1470233631-18508-1-git-send-email-tgrabiec@scylladb.com>
We use ::abs(), which has an int parameter, on long arguments, resulting
in incorrect results.
Switch to std::abs() instead, which has the correct overloads.
Fixes#1494.
Message-Id: <1469347802-28933-1-git-send-email-avi@scylladb.com>
gcc 6 complains that deleting a managed_bytes::external isn't defined
because the size isn't known. I'm not sure it's correct, but there's no
way to tell because flexible arrays aren't standardized.
Fix by using an array of zero size.
Message-Id: <1466715187-4125-1-git-send-email-avi@scylladb.com>
From Paweł:
This series introduces streaming_mutations which allow mutations to be
streamed between the producers and the consumers as a series of
mutation_fragments. Because of that the mutation streaming interface
works well with partitions larger than available memory provided that
actual producer and consumer implementations can support this as well.
mutation_fragments are the basic objects that are emitted by
streamed_mutations they can represent a static row, a clustering row,
the beginning and the end of a range tombstone. They are ordered by their
clustering keys (with static rows being always the first emitted mutation
fragment). The beginning of range tombstone is emitted before any
clustering row affected by that tombstone and the end of range tombstone
is emitted after the last clustering row affected by it. Range tombstones
are disjoint.
In this series all producers are converted to fully support the new
interface, that includes cache, memtables and sstables. Mutation queries
and data queries are the only consumers converted so far.
To minimize the per-mutation_fragment overhead streamed_mutations use
batching. The actual producer implementation fills a buffer until
it is full (currently, buffer size is 16, the limit should, however,
be changed to depend on the actual size in memory of the stored elements)
or end of stream is reached.
In order to guarantee isolation of writes reads from cache and memtable
use MVCC. When a reader is created it takes a snapshot of the particular
cache or memtable entry. The snapshot is immutable and if there happen
to be any incoming writes while the read is active a new version of
partition is created. When the snapshot is destroyed partition versions
are merged together as much as possible.
Performance results with perf_simple_query (median of results with
duration 15):
before after diff
write 618652.70 618047.58 -0.10%
read 661712.44 608070.49 -8.11%
From Glauber:
This is my new take at the "Move throttler to the LSA" series, except
this one don't actually move anything anywhere: I am leaving all
memtable conversion out, and instead I am sending just the LSA bits +
LSA active reclaim. This should help us see where we are going, and
then we can discuss all memtable changes in a series on its own,
logically separated (and hopefully already integrated with virtual
dirty).
[tgrabiec: trivial merge conflicts in logalloc.cc]
We now keep the regions sorted by size, and the children region groups as well.
Internally, the LSA has all information it needs to make size-based reclaim
decisions. However, we don't do reclaim internally, but rather warn our user
that a pressure situation is mounted.
The user of a region_group doesn't need to evict the largest region in case of
pressure and is free to do whatever it chooses - including nothing. But more
likely than not, taking into account which region is the largest makes sense.
This patch puts together this last missing piece of the puzzle, and exports the
information we have internally to the user.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Region is implemented using the pimpl pattern (region_impl), and all its
relevant data is present in a private structure instead of the region itself.
That private structure is the one that the other parts of the LSA will refer to,
the region_group being the prime example. To allow classes such as the
region_group the externally export a particular region, we will introduce a
backpointer region_impl -> region.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We are currently just allowing the region_group to specify a throttle_threshold,
that triggers throttling when a certain amount of memory is reached. We would
like to notify the callers that such condition is reached, so that the callers
can do something to alleviate it - like triggering flushes of their structures.
The approach we are taking here is to pass a reclaimer instance. Any user of a
region_group can specialize its methods start_reclaiming and stop_reclaiming
that will be called when the region_group becomes under pressure or ceases to
be, respectively.
Now that we have such facility, it makes more sense to move the
throttle_threshold here than having it separately.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
When we decide to evict from a specific region_group due to excessive memory
usage, we must also consider looking at each of their children (subgroups). It
could very well be that most of memory is used by one of the subgroups, and
we'll have to evict from there.
We also want to make sure we are evicting from the biggest region of all, and
not the biggest region in the biggest region_group. To understand why this is
important, consider the case in which the regions are memtables associated with
dirty region groups. It could be that a very big memtable was recently flushed,
and a fairly small one took its place. That region group is still quite large
because the memtable hasn't finished flushing yet, but that doesn't mean we
should evict from it.
To allow us to efficiently pick which region is the largest, each root of each
subtree will keep track of its maximal score, defined as the maximum between our
largest region total_space and the maximum maximal score of subtrees.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Currently, the regions in a region group are organized in a simple vector.
We can do better by using a binomial heap, as we do for segments, and then
updating when there is change. Internally to the LSA, we are in good position
to always know when change happens, so that's really the best way to do it.
The end game here, is to easily call for the reclaim of the largest offending
region (potentially asynchronously). Because of that, we aren't really interested
in the region occupancy, but in the region reclaimable occuppancy instead: that's
simply equal to the occupancy if the region is reclaimable, and 0 otherwise. Doing
that effectively lists all non reclaimable regions in the end of the heap, in no
particular order.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The database code uses a throttling function to make sure that memory
used for the dirty region never is over the limit. We track that with
a region group, so it makes sense to move this as generic functionality
into LSA.
This patch implements the LSA-side functionality and a later patch will
convert the current memtable throttler to use it.
Unlike the current throttling mechanism, we'll not use a timer-based
mechanism here. Aside from being more generic and friendlier towards
other users, this is a good change for current memtable by itself.
The constants - 10ms and 1MB chosen by the current throttler are arbitrary, and we
would be better off without them. Let's discuss the merits of each separately:
1) 10ms timer: If we are throttling, we expect somebody to flush the memtables
for memory to be released. Since we are in position to know exactly when a memtable
was written, thus releasing memory, we can just call unthrottle at that point, instead
of using a timer.
2) 1MB release threshold: we do that because we have no idea how much memory a request
will use, so we put the cut somehow. However, because of 1) we don't call unthrottle
through a timer anymore, and do it directly instead. This means that we can just execute
the request and see how much memory it has used, with no need to guess. So we'll call
unthrottle at the end of every request that was previously throttled.
Writing the code this way also has the advantage that we need one less continuation in
the common case of the database not being throttled.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
The main user of this list is MVCC implementation in partition_version.cc.
The reason why boost::intrusive::list<> cannot be used is that tere is no
single owner of the list who could keep boost::intrusive::list<> object
alive. In the MVCC case there is at least one partition_entry object and
possibly multiple partition_snapshot objects which lifetime is independent
and the list must remain in a valid state as long as at least one of them
is alive.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Allocations will still be allowed if made directly, but callers will have the
choice (in an upcoming patch) to proceed only if memory is below this threshold.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Tomek correctly points out that since we are now using "this" in lambda
captures, we should make the region_group not movable. We currently define a
move constructor, but there are no users. So we should just remove them.
copy constructor is already deleted, and so are the copy and move assignment
operators. So by removing the move constructor, we should be fine.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Reclaiming many segments was observed to cause up to multi-ms
latency. With the new setting, the latency of reclamation cycle with
full segments (worst case mode) is below 1ms.
I saw no decrease in throughput compared to the step of 16 segments in
neither of these modes:
- full segments, reclaim by random evicition
- sparse segments (3% occupancy), reclaim by compaction and no eviction
Fixes#1274.
There are various call-sites that explicitly check for EEXIST and
ENOENT:
$ git grep "std::error_code(E"
database.cc: if (e.code() != std::error_code(EEXIST, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
database.cc: if (e.code() != std::error_code(ENOENT, std::system_category())) {
sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) {
sstables/sstables.cc: if (e.code() == std::error_code(ENOENT, std::system_category())) {
Commit 961e80a ("Be more conservative when deciding when to shut down
due to disk errors") turned these errors into a storage_io_exception
that is not expected by the callers, which causes 'nodetool snapshot'
functionality to break, for example.
Whitelist the two error codes to revert back to the old behavior of
io_check().
Message-Id: <1465454446-17954-1-git-send-email-penberg@scylladb.com>
Make storage_io_exception exception error message less cryptic by
actually including the human-readable error message from
std::system_error...
Before:
nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage io error errno: 2
After:
nodetool: Scylla API server HTTP POST to URL '/storage_service/snapshots' failed: Storage I/O error: 2: No such file or directory
We can improve this further by including the name of the file that the I/O
error happened on.
Message-Id: <1465452061-15474-1-git-send-email-penberg@scylladb.com>
The rate_moving_average is used by timed_rate_moving_average to return
its internal values.
If there are no timed event, the mean_rate is not propertly initilized.
To solve that the mean_rate is now initilized to 0 in the structure
definition.
Refs #1306
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1465231006-7081-1-git-send-email-amnon@scylladb.com>
Currently we only shut down on EIO. Expand this to shut down on any
system_error.
This may cause us to shut down prematurely due to a transient error,
but this is better than not shutting down due to a permanent error
(such as ENOSPC or EPERM). We may whitelist certain errors in the future
to improve the behavior.
Fixes#1311.
Message-Id: <1465136956-1352-1-git-send-email-avi@scylladb.com>
This patch adds a few data structure for derived and accumulative
statistics that are similiar to the yammer implementation used by the
JMX.
It also adds a plus operator to histogram which cleans the histogram
usage.
moving_average - An exponentially-weighted moving average. calculate an event rate
on a given interval.
rate_moving_average and timed_rate_moving_average - Calculate 1m, 5m and
15m ewma an all time avrage and a counter.
rate_moving_average_and_histogram and
timed_rate_moving_average_and_histogram - Combines a histogram with a
rate_moving_average. It also expose a histogram API so it will be an
easy task to replace a histogram with a
timed_rate_moving_average_and_histogram.
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
"Conversion/implementation of "authorizer" code from origin, handling
permissions management for users/resources.
Default implementation keeps mapping of <user.resource>->{permissions}
in a table, contents of which is cached for slightly quicker checks.
Adds access control to all (existing) cql statements.
Adds access management support to the CQL impl. (GRANT/REVOKE/LIST)
Verified manually and with dtest auth_test.py. Note that several of these
still fail due to (unrelated) unimplemented features, like index, types
etc.
Fixes#1138"