"The atomic sstable deletion provides exception safety at the cost of
quadratic behavior in the number of sstables awaiting deletion. This
causes high cpu utilization during startup.
Change the code to avoid quadratic complexity, and add some unit tests.
See #1812."
(cherry picked from commit 985d2f6d4a)
Commit 7dcd70124a "tests/sstables: add
test for fast forwarding reader" added a test for skipping parts of
sstable. Unfortunately, it did not include the sstables it was trying to
read.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Originally, cache tests checked how many times a mutation reader was
created from the underlying mutation source to determine whether
continuity flag is working correctly.
This is not going to work with fast forwarding mutation readers so the
test is switched to count number of mutations (+ end of stream markers)
returned from underlying mutaiton readers which is much less fragile.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
That overload was used only by unit test and violated guarantee that
partition range lives until mutation reader is done.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Make split_after() more generic by allowing split_point to be anywhere,
not just within the input range. If the split_point is before, the entire
range is returned; and if it is after, stdx::nullopt is returned.
"before" and "after" are not well defined for wrap-around ranges, so
but we are phasing them out and soon there will not be
wrapping_range::split_after() users.
This is a prerequisite for converting partition_range and friends to
nonwrapping_range.
Message-Id: <1475765099-10657-1-git-send-email-avi@scylladb.com>
Remove inclusions from header files (primary offender is fb_utilities.hh)
and introduce new messaging_service_fwd.hh to reduce rebuilds when the
messaging service changes.
Message-Id: <1475584615-22836-1-git-send-email-avi@scylladb.com>
"Description:
============
Scylla currently suffers from a brick wall behavior of the request throttler.
Requests pile up until we reach the dirty memory limit, at which point we stop
serving them until we have freed enough memory to allow for more requests.
The problem is that freeing dirty memory means writing an SSTable to completion.
That can take a long time, even if we are blessed with great disks. Those long
waiting times can and will translate into timeouts. That is bad behavior.
What this patch does is introduce one form of virtual dirty memory accounting.
Instead of allowing 100 % of the dirty memory to be filled up until we stop
accepting requests, we will do that when we reach 50 % of memory. However,
instead of releasing requests only when an SSTable is fully written, we start
releasing them when some memory was written.
The practical effect of that, is that once we reach 50 % occupancy in our dirty
memory region, we will bring the system from CPU speed to disk speed, and will
start accepting requests only at the rate we are able to write memory back.
Results
=======
With this patchset running a load big enough to easily saturate the disk,
(commitlog disabled to highlight the effects of the memtable writer), I am able
to run scylla for many minutes, with timeouts occurring only when I run out of
disk space, whereas without this patch a swarm of timeouts would start merely 2
seconds after the load started - and would never get stable.
In V2, I have sent a set of graphs illustrating the performance of this solution.
This version does not have any significant differences in that front.
For details, please refer to
https://groups.google.com/d/msg/scylladb-dev/iCvD-3Z-QqY/EM8KUh_MAQAJ
Accuracy of the accounting:
---------------------------
It is important for us to be as accurate as possible when accounting freed
memory, since every byte we mark as freed may allow one or more requests to be
executed. I have measured the accuracy of this approach (ignoring padding,
object size for the mutation fragments) to be 99.83 % of used memory in the
test workload I have ran (large, 65k mutations). Memtables under this circumnstance
tend to have a very high occupancy ratio because throttle breeds idle, and idle
breeds compact-on-idle.
Known Issues:
-------------
A lot of time can be elapsed between destroying the flush_reader and actually
releasing memory. The release of memory only happens when the SSTable is fully
sealed, and we have to flush the files, as well as finish writing all SSTable
components at this point. This happened in practice with a buggy kernel that
would result in flushes taking a long time.
After that is fixed, this is just a theoretical problem and in practice it
shouldn't matter given the time we expect those operations to take."
* 'virtual-dirty-v6' of github.com:glommer/scylla:
database: allow virtual dirty memory management
streamed_mutation: make _buffer private
add accounting of memory read to partition_snapshot_reader
move partition_snapshot_reader code to header file
LSA: allow a group to query its own region group
memtables: split scanning reader in two
sstables: use special reader for writing a memtable
LSA: export information about object memory footprint
LSA: export information about size of the throttle queue
database: export virtual dirty bytes region group
We allocate objects of a certain size, but we use a bit more memory to hold
them. To get a clerer picture about how much memory will an object cost us, we
need help from the allocator. This patch exports an interface that allow users
to query into a specific allocator to get that information.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This patch set ensures we can correctly handle queries
where the minimum token is specified."
* 'min-token/v3' of github.com:duarten/scylla:
cql_query_test: Add test case for min/max token bounds
token_restriction: Deal with minimum tokens
partitioner: Parse token from bytes
When max sstable size is increased, higher levels are suffering from
starvation because we decide to compact a given level if the following
calculation results in a number greater than 1.001:
level_size(L) / max_size_for_level_l(L)
Fixes#1720.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Uniform token range distribution across sstables in a level > 1 was broken,
because we were only choosing sstable with lowest first key, when compacting
a level > 0. This resulted in performance problem because L1->L2 may have a
huge overlap over time, for example.
Last compacted key will now be stored for each level to ensure sort of
"round robin" selection of sstables for compactions at level >= 1.
That's also done by C*, and they were once affected by it as described in
https://issues.apache.org/jira/browse/CASSANDRA-6284.
Fixes#1719.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
leveled strategy uses heavily first and last decorated keys of a
sstable to get overlapping sstables in a given level. By storing
first and last decorated keys in sstable object, it's expected
that performance of leveled strategy (not compaction) will be
improved.
We will set first and last keys in sstable when either loading
or sealing it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <0abca819454ab4c088541bb49714f1f6a7dc4f42.1473959677.git.raphaelsc@scylladb.com>
* seastar 0303e0c...e534401 (6):
> Merge "enable rpc to work on non contiguous memory for receive" from Gleb
> install-dependencies.sh: install python3 for Ubuntu/Debian, which requires for configure.py
> fix tcp stuck when output_stream write more than 212992 bytes once.
> scripts/posix_net_conf.sh: supress 'ls: cannot access /sys/class/net/<NIC>/device/msi_irqs/' error message
> scripts/posix_net_conf.sh: fix 'command not found' error when specifies --cpu-mask
> native_network_stack: Fix use after free/missing wait in dhcp
Includes: "Remove utils::fragmented_input_stream and utils::input_stream in favor of seastar version" from Gleb.
That will be needed for optimization that will store decorated keys
in the sstable object, and also for a subsequent work that will
detect wrong metadata (min/max column names) by looking at columns
in the schema. As schema is stored in sstable, there's no longer
a need to store ks and cf names in it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This reverts commit 1726b1d0cc.
Reverting this patch turns our SSTable access counter into a miss counter only.
The estimated histogram always starts its first bucket at 1, so by marking cache
accesses we will be wrongly feeding "1" into the buckets.
Notice that this is not yet ideal: nodetool is supposed to show a histogram of
all reads, and by doing this we are changing its meaning slightly. Workloads
that serve mostly from cache will be distorted towards their misses.
The real solution is to use a different histogram, but we will need to enforce
a newer version of nodetool for that: the current issue is that nodetool expects
an EstimatedHistogram in a specific format in the other side.
Conflicts:
row_cache.hh
Message-Id: <a599fa9e949766e7c9697450ae34fc28e881e90a.1472742276.git.glauber@scy
lladb.com>
Signed-off-by: Glauber Costa <glauber@scylladb.com>
This patch makes the optional trace_state_ptr arguments introduced in
previous patches mandatory where possible. Functions which are called
internally don't have a trace context, so for those we keep the
argument's default value for convenience.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
If we have a cache hit, we still need to update our sstable histogram - notting
that we have touched 0 SSTables.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"clustering_key_filtering_context is no longer needed.
partition_slice can be used instead so this series removes
clustering_key_filtering_context and passes partition_slice down where
it's needed. Then a static get_ranges method is used to obtain
clustering key ranges for a given partition.
Fixes #1614."
Remove clustering_key_filter_factory and clustering_key_filtering_context.
Use partition_slice directly with a static get_ranges method.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Reversed iterators are adaptors for 'normal' iterators. These underlying
iterators point to different objects that the reversed iterators
themselves.
The consequence of this is that removing an element pointed to by a
reversed iterator may invalidate reversed iterator which point to a
completely different object.
This is what happens in trim_rows for reversed queries. Erasing a row
can invalidate end iterator and the loop would fail to stop.
The solution is to introduce
reversal_traits::erase_dispose_and_update_end() funcion which erases and
disposes object pointed to by a given iterator but takes also a
reference to and end iterator and updates it if necessary to make sure
that it stays valid.
Fixes#1609.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1472080609-11642-1-git-send-email-pdziepak@scylladb.com>
This patch makes append() and write() limit the maximum size of a single
allocation to bytes_ostream::max_chunk_size.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Once unlink_leftmost_without_rebalance() has been called on a bi::set no
other method can be used. This includes clear_and_disposed() used by the
mutation_partition destructor.
We like unlink_leftmost_without_rebalance() because it is efficient, so
the solution is to manually finish destroying clustering row and range
tombstone sets in the reader destructor using that function.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch introduces the nonwrapping_range class. This class is
intended to be used by code that requires non wrapping ranges.
Internally, it uses a wrapping_range. Users are responsible for
ensuring the bounds are correct when creating a nonwrapping_range.
The path proposed here is to incrementally replace usages of
wrapping_range/range by nonwrapping_range, pushing usages of wrapping
ranges as further to the edges as possible.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"
While periodic mode is a all-bets-off crap-shoot as far as knowing if
data actually reached disk or not, batch mode is supposed to be
somewhat more reliable/deterministic.
Thus, if we get an exception writing/flushing the current buffer,
we should propagate exceptions to all execution paths involved
in this buffer.
Flush queue can now (optionally) propagate exceptions to all clients, and
commit log uses this to ensure that commit log writers in batch mode
all generate exceptions on disk errors.
Also includes some rudimentary tests for flush queue mechanisms.
Note: other main user, sstable flushing, is not affected, as default
mode is still to keep exceptions to individual worker continuations,
not waiters."
In this unit test, we create using Scylla C++ code, the same large
partition with 13520 CQL rows as we previously imported from Cassandra
for the large partition test. We then verify that the sstable index file
we just wrote is byte-for-byte identical to the one previously created by
Cassandra. They should indeed be identical, because the data file has the
same layout (even if timestamps are different) and our default promoted-
index block size is the same (64K) so the sample of columns should be
identical.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Currently, the main sstable data parsing entry point data_consume_rows()
takes a contiguous range of bytes to read from disk and parse. This range
is supposed to be an entire partition or contiguous group of partitions.
and is self contained (can be parsed without extra information about the
identity of these partitions).
For the promoted index feature (which we will add in a following patch)
we will want the range to span only a part of a partition, and will need
the caller to provide some information not available to the parser (such
as the partition's key). In the future, we will also want to support a
vector of byte ranges, instead of just one.
So in preparation for this, this patch simply replaces the start/end pair
by a new class disk_read_range, which can be easily extended in later
patches. No new functionality is introduced in this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a test that takes an sstable with one partition of 13,520
clustering rows (spanning 700 KB in the data file), and attempts to read
various slices CQL rows, counting that we got back the expected number
of rows.
The sstable included here was generated by Cassandra, and includes a
promoted index. Promoted index reading is not supported yet (we will
add it in the next patch), so for now the code will always read the
entire partition from disk; But still the clustering-key filtering is
already functional, and will drop some of the rows as requested,
so this test will pass.
Later, when we add promoted index support, we should check that this test
still passes - promoted index will make the reads in this test more
efficient (which the test cannot verify), but the important thing to check
is that it doesn't break any of these tests.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>