Commit 7dcd70124a "tests/sstables: add
test for fast forwarding reader" added a test for skipping parts of
sstable. Unfortunately, it did not include the sstables it was trying to
read.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
The current incarnation of commitlog establishes a maximum amount of
writes that can be in-flight, and blocks new requests after that limit
is reached.
That is obviously something we must do, but the current approach to it
is problematic for two main reasons:
1) It forces the requests that trigger a write to wait on the current
write to finish. That is excessive; ideally we would wait for one
particular write to finish, not necessarily the current one. That
is made worse by the fact that when a write is followed by a flush
(happens when we move to a new segment), then we must wait for
*all* writes in that segment to finish.
1) it casts concurrency in terms of writes instead of memory, which
makes the aforementioned problem a lot worse: if we have very big
buffers in flight and we must wait for them to finish, that can
take a long time, often in the order of seconds, causing timeouts.
The approach taken by this patch is to replace the _write_semaphore
with a request_controller. This data structure will account the amount
of memory used by the buffers and set a limit on it. New allocations
will be held until we go below that limit, and will be released
as soon as this happens.
This guarantees that the latencies introduced by this mechanism are
spread out a lot better among requests and will keep higher percentile
latencies in check.
To test this, I have ran a workload that times out frequently. That
workload use 10 threads to write 100 partitions (to isolate from the
effects of the memtable introduced latencies) in a loop and each
partition is 2MB in size.
After 10 minutes running this load, we are left with the following
percentiles:
latency mean : 51.9 [WRITE:51.9]
latency median : 9.8 [WRITE:9.8]
latency 95th percentile : 125.6 [WRITE:125.6]
latency 99th percentile : 1184.0 [WRITE:1184.0]
latency 99.9th percentile : 1991.2 [WRITE:1991.2]
latency max : 2338.2 [WRITE:2338.2]
After this patch:
latency mean : 54.9 [WRITE:54.9]
latency median : 43.5 [WRITE:43.5]
latency 95th percentile : 126.9 [WRITE:126.9]
latency 99th percentile : 253.9 [WRITE:253.9]
latency 99.9th percentile : 364.6 [WRITE:364.6]
latency max : 471.4 [WRITE:471.4]
Signed-off-by: Glauber Costa <glauber@scylladb.com>
In a subsequent patch, I'll use this code in a different place. To
prepare for that, we move it out as a method. It also fits a lot better
inside the segment manager, so move it there.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Goal is to calculate a size that is lesser or equal than the
segment-dependent size.
This was originally written by Tomasz, and featured in his submission
"commitlog: Handle overload more gracefully"
Extracted here so it sits clearly in a different patch.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
It is mostly an optimization, and while it makes sense in this context,
it won't soon as we'll stop waiting for the current cycle specifically
to finish.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We'll do that so we can, in following patches, use static members from
the segment. Those are not defined at this point.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
We track the amount of pending allocations but we don't really export
it. It will be crucial when we stop tracking pending writes.
This patch exports it through a method instead of the totals structure,
so we can easily change it. Current code probing pending_allocations
(the api code) is also converted to use the public method instead of the
totals struct.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
"This patchset enables mutation readers to be fast forwarded to a different
partition range. The main reason for introducing such feature are range
queries served from cache. If the cache is partially populated in the
requested range the reader will end up with multiple subranges that have
to be read from the sstables. Originally, each of these subranges would
require a new reader to be created, but with fast forwarding we can have
just one sstable reader. This is better since there is a chance that buffers
kept by the reader may be still useful after fast forwarding it.
In this series there are also patches that clean up cache readers in order
to make integration with fast forwarding easier. Namely, continuity flag is
changed to store information about range before the entry which significantly
simplifies the logic.
Fixes #1299."
* 'pdziepak/fast-forward-mutation-readers/v5' of github.com:cloudius-systems/seastar-dev: (24 commits)
sstables: keep separate stream history for single and range reads
sstables: drop sstable::{lower, upper}_bound()
row_cache: rework cache to use fast forwarding reader
row_cache: put cache entry flags in a struct
row_cache: add do_find_or_create_entry() to reduce code duplication
mutation_reader: forward fast_forward_to() calls
tests/row_cache: add fast_forward_to() to throttled reader
tests/row_cache: count mutations read from _underlying
memtable: add support for fast_forward_to()
drop key readers
tests/mutation_reader: test fast forwarding combined reader
database: enable fast forwarding of range_sstable_reader
combined_mutation_reader: implement fast_forward_to()
mutation_reader: make combinded_reader public
tests/sstables: add test for fast forwarding reader
tests: add more helpers to mutation reader assertions
sstables: enable fast forwarding for range readers
mutation_reader: introduce fast_forward_to()
sstables: implement mutation_reader::impl::fast_forward_to()
sstables: introduce index_reader
...
Single partition and partition range reads are expected to behave
considerably different so it is worth to have them use separate file
stream history. This also makes reads use different history for each
sstable which is also a good thing.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This uncomfortably large patch overhauls cache range reader so that it
can take advantage of fast forwarding mutation readers.
A significant change in the cache itself is that the continuity flag now
is used to determine whether cache is contiguous between the previous
entry and the current one. This allows for a significant simplification
of the cache code and easier integration with reader fast forwarding.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Flags are easier to manage if they are in a single structure.
Especially, default initialization and move contstructors are simpler
and less error prone.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Originally, cache tests checked how many times a mutation reader was
created from the underlying mutation source to determine whether
continuity flag is working correctly.
This is not going to work with fast forwarding mutation readers so the
test is switched to count number of mutations (+ end of stream markers)
returned from underlying mutaiton readers which is much less fragile.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Fast forwarding of memtable readers is needed only for unit tests which
often use memtables as underlying data source for cache and the cache
readers.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
When fast forwarding a reader that combines sstable reader we must also
remember that the set of sstables for the new range may be different
than for the previous one. The reader introduced in this patch makes
sure that we read from correct sstables.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
We want to be able to fast forward sstable readers. However, just
implementing fast_forward_to() for combined_reader is not enough as the
sstables we are reading from may need to change.
Following patches are going to introduce a combined sstable reader that
derives from combined_reader. To make that possible we first need to
make combined_reader public.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch introduces the interface for fast forwarding mutation
readers. The main user of this feature is going to be cache which, while
serving range query, may need to read multiple small ranges from the
sstables to populate itself with the missing entries.
Fast forwarding is an alternative to recreating a reader with different
range. Its main advantage is fact that it avoids dropping data that has
already been read.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
This patch allows sstable readers to be fast forwarded without making it
necessary to recreate the reader (and dropping all buffers in the
process). It is built on top of index_reader and ability of
data_consume_context to be fast forwarded.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
index_reader is a helper that implements index lookups. Its goal is to
avoid dropping read buffers if they still may be needed (for example to
get end bound of the range or after fast forwarding the reader).
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
That overload was used only by unit test and violated guarantee that
partition range lives until mutation reader is done.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
is_compactible() will pass on very small regions. full_compaction() is
only used in tests to force objects to be moved due to compaction, so
we want all reclaimable regions to be compacted.
The move constructor of partition_version was not invoking move
constructor of anchorless_list_base_hook. As a result, when
partition_version objects were moved, e.g. during LSA compaction, they
were unlinked from their lists.
This can make readers return invalid data, because not all versions
will be reachable.
It also casues leaks of the versions which are not directly attached
to memtable entry. This will trigger assertion failure in LSA region
destructor. This assetion triggers with row cache disabled. With cache
enabled (default) all segments are merged into the cache region, which
currently is not destroyed on shutdown, so this problem would go
unnoticed. With cache disabled, memtable region is destroyed after
memtable is flushed and after all readers stop using that memtable.
Fixes#1753.
Message-Id: <1476778472-5711-1-git-send-email-tgrabiec@scylladb.com>
This patch uses cf_properties instead to add the missing attributes to
the create_view_statement class.
Fixes#1766
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch adds the VIEWS element to the cause enum so we can
mark failures due to incomplete support of materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts the definition of the default compressor into the
compression_parameters class, so that the table and view creation
statements don't have to explicitly deal with it.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts the cf_properties class, which contains common
attributes of tables and materialized views.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Since we are exiting Scylla process in engine().at_exit() using
::_exit(0), even verify_seastar_io_scheduler() throwing an exception,
scylla always exit with 0.
Systemd misunderstands scylla-server.service was shutdown successfully
because of this, so we need to pass correct exit code to ::_exit() here.
Fixes#1674
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1475065607-15486-1-git-send-email-syuu@scylladb.com>
* seastar 207bf3d...ccd8649 (3):
> Merge "Augment semaphore with non-blocking operations" from Glauber
> Merge "More dynamic fstream patches" from Paweł
> Merge "fstream: add dynamic adjustments based on stream history" from Paweł
A 1MB response will require 2000 allocations with the current 512-byte
chunk size. Increase it exponentially to reduce allocation count for
larger responses (still respecting the upper limit).
Message-Id: <1476369152-1245-1-git-send-email-avi@scylladb.com>
Memory accounting code was attaching partition_snapshot to
partition_entry in order to calculate the size of partition_version
object. However, it is only allowed if partition_entry doesn't have
any snapshot attached already. In this case it always has one, created
by the flushing reader.
Change the accounting code to reuse existing partition_snapshot reference.
Fixes#1746
Message-Id: <1476449160-9252-1-git-send-email-tgrabiec@scylladb.com>
LSA tries to allocate zones as large as possible (while still leaving
enough free space for the standard allocator). It uses the amount of
free memory in order to guess how much it can get, but that obviously
doesn't account for fragmentation and the allocation attempt may fail.
This patch changes the LSA code so that it doesn't throw in case zone
couldn't be created but just returns a null pointer which should be
more performant if the LSA memory cannot grow any more.
Fixes#1394.
Signed-off-by: Paweł Dziepak <pdziepak@scylladb.com>
Message-Id: <1476435031-5601-1-git-send-email-pdziepak@scylladb.com>
The expected behaviour in the scylla_setup script is that a question
will be followed by the answer.
For example, after asking if the scylla should be run as a service the
relevant actions will be taken before the following question.
This patch address two such mis-orders:
1. the scylla-housekeeping depends on the scylla-server, but the
setup should first setup the scylla-server service and only then ask
(and install if needed) the scylla-housekeeping.
2. The node_exporter should be placed after the io_setup is done.
Fixes#1739
Signed-off-by: Amnon Heiman <amnon@scylladb.com>
Message-Id: <1476370098-25617-1-git-send-email-amnon@scylladb.com>
Change abstract_replication_strategy::create_replication_strategy() to
throw exceptions::configuration_error if replication strategy class
lookup to make sure the error is converted to the correct CQL response.
Fixes#1755
Message-Id: <1476361262-28723-1-git-send-email-penberg@scylladb.com>