"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.
Fixes#3304.
"
* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Test reads with no clustering ranges and no static columns
tests: simple_schema: Allow creating schema with no static column
clustering_ranges_walker: Stop after static row in case no clustering ranges
(cherry picked from commit 054854839a)
In gossiper::handle_major_state_change() we set the endpoint_state for
a particular endpoint and replicate the changes to other cores.
This is totally unsynchronized with the execution of
gossiper::evict_from_membership(), which can happen concurrently, and
can remove the very same endpoint from the map (in all cores).
Replicating the changes to other cores in handle_major_state_change()
can interleave with replicating the changes to other cores in
evict_from_membership(), and result in an undefined final state.
Another issue happened in debug mode dtests, where a fiber executes
handle_major_state_change(), calls into the subscribers, of which
storage_service is one, and ultimately lands on
storage_service::update_peer_info(), which iterates over the
endpoint's application state with deferring points in between (to
update a system table). gossiper::evict_from_membership() was executed
concurrently by another fiber, which freed the state the first one is
iterating over.
Fixes#3299.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180318123211.3366-1-duarte@scylladb.com>
(cherry picked from commit 810db425a5)
Currently Ubuntu 18.04 uses distribution provided g++ and boost, but it's easier
to maintain Scylla package to build with same version toolchain/libraries, so
switch to them.
Signed-off-by: Takuya ASADA <syuu@scylladb.com>
Message-Id: <1521075576-12064-1-git-send-email-syuu@scylladb.com>
(cherry picked from commit 945e6ec4f6)
* 'debian-ubuntu-build-fixes-v2' of https://github.com/syuu1228/scylla:
dist/debian: build only scylla, iotune
dist/debian: switch to boost-1.65
dist/debian: switch to gcc-7.3
(cherry picked from commit bb4b1f0e91)
If there are a lot of ranges, e.g., num_tokens=2048, 10 ranges per
stream plan will cause tons of stream plan to be created to stream data,
each having very few data. This cause each stream plan has low transfer
bandwidth, so that the total time to complete the streaming increases.
It makes more sense to send a percentage of the total ranges per stream
plan than a fixed ranges.
Here is an example to stream a keyspace with 513 ranges in
total, 10 ranges v.s. 10% ranges:
Before:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 51
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 107 seconds
After:
[shard 0] range_streamer - Bootstrap with 127.0.0.1 for
keyspace=system_traces, 510 out of 513 ranges: ranges = 10
[shard 0] range_streamer - Bootstrap with ks for keyspace=127.0.0.1
succeeded, took 22 seconds
Message-Id: <a890b84fbac0f3c3cc4021e30dbf4cdf135b93ea.1520992228.git.asias@scylladb.com>
This reverts commit f792c78c96.
With the "Use range_streamer everywhere" (7217b7ab36) series,
all the user of streaming now do streaming with relative small ranges
and can retry streaming at higher level.
Reduce the time-to-recover from 5 hours to 10 minutes per stream session.
Even if the 10 minutes idle detection might cause higher false positive,
it is fine, since we can retry the "small" stream session anyway. In the
long term, we should replace the whole idle detection logic with
whenever the stream initiator goes away, the stream slave goes away.
Message-Id: <75f308baf25a520d42d884c7ef36f1aecb8a64b0.1520992219.git.asias@scylladb.com>
"
Terms
-----
querier: A class encapsulating all the logic and state needed to fill a
page. This Includes the reader, the compact_mutation object and all
associated state.
Preamble
--------
Currently for paged-queries we throw away all readers, compactors and
all associated state that contributed to filling the page and on the
next page we create them from scratch again. Thus on each page we throw
away a considerable amount of work, only to redo it again on the next
page. This has been one of the major contributors to latencies as from
the point of view of a replica each page is as much work as a fresh
query.
Solution
--------
The solution presented in this patch-series is to save queriers after
filling a page and reuse them on the next pages, thus doing the
considerable amount of work involved with creating the them only once.
On each page the coordinator will generate a UUID that identifies this
page. This UUID is used as the key, under which the contributing
queriers will be saved in the cache. On the next page the UUID from the
previous page will be used to lookup saved queriers, and the one from
the current one to saved them afterwards (if the query isn't finished).
These UUIDs (reader_recall_uuid and reader_save_uuid) are attached to
the page-state. Also attached to the page state is the list of replicas
hit on the last page. On the next page this list will be consulted to
hit the same replicas again, thus reusing the queriers saved on them.
Cached queriers will be evicted after a certain period of time to avoid
unecessary resource consumption by abandoned reads.
Cached queriers may also be evicted when the shard faces
resource-pressure, to free up resources.
Splitting up the work
---------------------
This series only fixes the singular-mutation query path, that is queries
that either fetch a single partition, or severeal single partitions (IN
queries). The fix for the scanning query path will be done in a
follow-up series, however much of the infrastructure needed for the
general querier reuse is already introduced by this series.
Ref #1865
Tests: unit-tests(debug, release), dtests(paging_test, paging_additional_test)
Benchmarking summary (read-from-disk)
-------------------------------------
1) Latency
BEFORE
latency mean : 58.0
latency median : 57.4
latency 95th percentile : 68.8
latency 99th percentile : 79.9
latency 99.9th percentile : 93.6
latency max : 93.6
AFTER
latency mean : 41.3
latency median : 40.5
latency 95th percentile : 50.8
latency 99th percentile : 68.9
latency 99.9th percentile : 89.2
latency max : 89.2
2) Throughput (single partition query)
sum(scylla_cql_reads):
BEFORE: 173'567
AFTER: 427'774
+246%
3) Throughput (IN query, 2 partitions)
sum(scylla_cql_reads):
BEFORE: 85'637
AFTER: 127'431
+148%
"
* '1865/singular-mutations/v8.2' of https://github.com/denesb/scylla: (23 commits)
Add unit test for resource based cache eviction
Add unit tests for querier_cache
Add counters to monitor querier-cache efficiency
Memory based cache eviction
Add buffer_size() to flat_mutation_reader
Resource-based cache eviction
Time-based cache eviction
Save and restore queriers in mutation_query() and data_query()
Add the querier_cache_context helper
Add querier_cache
Add querier
Add are_limits_reached() compact_mutation_state
Add start_new_page() to compact_mutation_state
Save last key of the page and method to query it
Make compact_mutation reusable
Add the CompactedFragmentsConsumer
Use the last_replicas stored in the page_state
query_singular(): return the used replicas
Consider preferred replicas when choosing endpoints for query_singular()
Add preferred and last replicas to the signature of query()
...
Specifically for the reader-permit based eviction. This test lives in a
separate executable as it uses with_cql_test_env() and thus needs a
main() of it's own.
"
This patchset is a part of a bigger effort for bringing our
microbenchmarking tests from the source tree to be used for regression
testing purposes with CI.
Now, it is possible to export results of tests run into JSON format that
can be stored in ElasticSearch and compared among runs to detect
performance degradation should it happen.
Example of JSON output (formatted for readability):
{
"results" :
{
"parameters" :
{
"read" : "64",
"read,skip,test_run_count" : "64,256,1",
"skip" : "256",
"test_run_count" : 1
},
"stats" :
{
"(KiB)" : 126960,
"aio" : 993,
"blocked" : 208,
"c blk" : 1,
"c hit" : 0,
"c miss" : 1,
"cpu" : 99.779365539550781,
"dropped" : 0,
"frag/s" : 311939.61559016741,
"frags" : 200000,
"idx blk" : 0,
"idx hit" : 0,
"idx miss" : 0,
"time (s)" : 0.641149729
}
},
"test_group_properties" :
{
"message" : "Testing scanning large partition with skips.\nReads whole range interleaving reads with skips according to read-skip pattern",
"name" : "large-partition-skips",
"needs_cache" : false,
"partition_type" : "large"
},
"versions" :
{
"scylla-server" :
{
"commit_id" : "4acfa17f4",
"date" : "20180306",
"run_date_time" : "2018-16-06 12:16:41",
"version" : "666.development"
}
}
}
"
* 'issues/2947/v6' of https://github.com/argenet/scylla:
Add support for JSON output format for perf_fast_forward results.
Wrap output for customization. Move all output handling to a single managing class.
Add the following counters:
(1) querier_cache_lookups
(2) querier_cache_misses
(3) querier_cache_drops
(4) querier_cache_time_based_evictions
(5) querier_cache_resource_based_evictions
(6) querier_cache_memory_based_evictions
(6) querier_cache_population
(1) counts the total number of querier cache lookups. Not all
page-fetches will result in a querier lookup. For example the first page
of a query will not do a lookup as there was no previous page to reuse
the querier from. The second, and all subsequent pages however should
attempt to reuse the querier from the previous page.
(2) counts the subset of (1) where the read have missed the querier
cache (failed to find a matching saved querier).
(3) counts the subset of (1) where the querier was recalled and dropped
immediately. This can happen for example if the querier was at the wrong
position.
(4) counts the cached queriers that were evicted due to their TTL
expiring.
(5) counts the cached queriers that were evicted due to reader-resource
(those limited by reader-concurrency limits) shortage.
(6) counts the cached queriers that were evicted due to reaching the
cache's memory limits (currently set to 4% of the shards' memory).
(7) is the current number of entries in the cache
Note:
* The count of cache hits can be derived from these counters as
(1) - (2).
* cache_drop (3) also implies a cache hit (see above). This means that
the number of actually reused queriers is:
(1) - (2) - (3)
To bound the memory consumption of the querier-cache the total memory
consumption of the cached queriers is limited to 4% of the shard's total
memory.
When inserting a new querier it is first checked whether it's insertion
would cause the limit to be crossed. If this is the case existing
entries are evicted until the memory consumption is sufficiently reduced
so that after inserting the querier it stays below the limit.
Cached queriers are evicted in LRU order as the oldest queriers are the
most likely to be evicted based on their TTL anyway.
To calculate the memory consumption of the cached queriers
flat_mutation_reader::buffer_size() is used. While this is not very
precise as it doesn't include object sizes and member containers it
gives a good picture of the memory consumption of the queriers.
Memory based cache eviction overlaps with resource-based cache eviction
but only to some degree as that only accounts the memory consumption of
sstable readers.
buffer_size() exposes the collective size of the external memory
consumed by the mutattion-fragments in the flat reader's buffer. This
provides a basis to build basic memory accounting on. Altought this is
not the entire memory consumption of any given reader it is the most
volatile component and usually by far the largest one too.
Readers serving user-reads need to obtain a permit to start reading.
There exists a restriction on how much active readers can be admitted
based on their count and their memory onsumption.
Since the saved readers of cached queriers are techically active (they
hold a permit) they can block new readers from obtaining a permit.
New readers have a higher priority because a cached reader might be
abandoned or used later at best so in the face of memory pressure we
evict cached readers to free up permits for new readers.
Cached queriers are evicted in LRU order as the oldest queriers are the
most likely to be evicted based on their TTL anyway.
Cached queriers should not sit in the cache indefinitely otherwise
abandoned reads would cause excess and unncessary resource-usage. Attach
an expiry timer to each cache-entry which evicts it after the TTL
passes.
Use the querier_cache (represented by the passed-in
querier_cache_context) object to lookup saved queriers at the start of
the page and save them at the end of it if it is likely that there will
be more page requests.
querier_cache_context is supposed to make propagating the cache and the
key down the layers. It comes bundled with some of the required
parameters (the lookup and save state) and aso hides all of the
boiler-plate of dealing with the cache (checking whether the key is
non-empty, etc.). It also makes it possible to not use the cache and
hide this from the lower layers.
This is the cache where suspended queriers are going to be saved between
pages. This is not a general purpose cache. It caters to the specific
needs of the querier recall mechanism. More specifically:
(1) Cache entries are of single-use, they are inserted once and the first
lookup removes them. Multiple items may be stored under a single key.
Identifying the correct one happens based on additional information like
the query range. Lookup knows to drop queriers when they cannot be used
to serve the next page.
(2) Cache entries are evicted after a certain time to avoid the
depletion of resources due to abandoned reads.
(3) Cache entries are evicted when facing reader-permit shortage, until
either enough permits are freed up or all entries are evicted.
(4) A memory limiter is set up which keeps the total memory consumption
of the cache under a limit (4% of memory) by evicting the oldest entries
when inserting a new one would cause the total memory consumption to go
above the limit.
(5) It updates the relevant counters of the db_stats.
This patch only implements (1), the other features will be implemented
in their own patches.
The querier encapsulates all objects needed to serve queries, except
result builders. It is designed to be suspendable, savable and
resumable. It contains all logic needed to suspend, resume and determine
whether the querier can be resumed or not.
It is the foundation upon which the "reader-reuse" mechanism is built.
are_limits_reached() allows querying whether the compactor reached
the page's limits. This is needed to determine whether there will be
more pages and thus whether the compact_mutation_state has to be kept
around.
start_new_page() resets the limits to the current page's ones and
sets the _empty_partition flag so that the partition header (if the last
page finished inside a partition) will be reemitted.
Make a copy of the current decorated-key in consume_end_of_stream() so
that it persists while the compaction state is suspended.
Also add current_partition() to allow client code to query the partition
the compaction is positioned in. This is needed to determine whether
the start position of the next page matches that of the
compact_mutation_state.
Currently compact_mutation is used as a use-once-then-throw-away object.
After it satisfies its consumer it's destroyed together with the
consumer. This conflicts with the effort to save and reuse readers and
associated infrastructure between pages of a query.
To resolve this conflict compact_mutation is split into two classes:
(1) compact_mutation_state
(2) compact_mutation
compact_mutation_state encapsulates all the compaction logic and state,
while compact_mutation continues to provide the same API using
compact_mutation_state behind the scenes.
compact_mutation_state doesn't store the consumer, instead its
consume_* methods are templated on the consumer and take it as an
argument. This allows compact_mutation_state to be independent of the
consumer's type.
Additionally compact_mutation can now be constructed from a shared
pointer to compact_mutation_state. This allows client code to
pre-construct a compaction state and retain it after the
compact_mutation object is destroyed.
These changes allow the state of a compaction to be saved and restored
later while code that is only interested in storing the saved state
can stay independent of the consumer's type.
This patch only contains the splitting of compact_mutation into
compact_mutation and compact_mutation_state. The next patches will add
the missing functionality that is needed to make compact_mutation_state
truly reusable across pages.
Pass the last_replicas from the page_state as the preferred_replicas
for query() and save the returned last_replicas as the last_replicas
field of the next page_state. The circle is now complete. The first page
of any query will pass an empty list as the preferred replicas (having
no previous paging_state) so the replicas will be selected according to
the load-balancing strategy. Any subsequent page will use the last
replicas from the last page as the preferred ones for the current one.
Thus if all goes well all pages of a query will hit the same replicas.
This patch implements the last_replicas returning part of the query()
signature changes for singular queries. It allows for client code to
save the last returned replicas and pass it to query() on the next page
as the preferred-replicas parameter, thus faciliate the read requests
for the next page hitting the same replicas.
Propagate the preferred_replicas to db::filter_for_query() and consider
them when selecting the endpoints. The algoritm for selecting the
endpoints is as follows:
* Compute the intersection of the endpoint candidates and the
preferred endpoints.
* If this yields a set of endpoints that already satisfies the CL
requirements use this set.
* Otherwise select the remaining endpoints according to the
load-balancing strategy, just like before.
preferred_replicas are added to the parameters and last_replicas are
added to the return type. The preferred replicas will be used as a hint
for the selection of the replicas to send the read requests to. The last
replicas (returned) are the replicas actually selected for the read.
This will allow queries to consistently hit the same replicas for each
page thus reusing readers created on these replicas.
For convenience a query() overload is provided that doesn't take or
return the preferred and last replicas.
This patch only adds the parameters and propagates them down to
query_singular() and query_partition_key_range(). The code to actually
use these preferred-replicas will be added in later patches.
This reason for separating this is to reduce noise and improve
reviewability for those functional changes later.
Helps paged queries consistently hit the same replicas for each
subsequent page. Replicas that already served a page will keep the
readers used for filling it around in a cache. Subsequent page request
hitting the same replicas can reuse these readers to fill the pages
avoiding the work of creating these readers from scratch on every page.
In a mixed cluster older coordinators will ignore this value.
The value of last_replicas may change between pages as nodes may become
available/unavailable or the coordinator may decide to send the read
requests to different replicas at its discretion.
Replicas are identified by an opaque uuid which should only make sense
to the storage-proxy.
This patch adds the parameter to read_command which is needed for
caching of readers during multiple pages of a paged queries, which
we will introduce in the next patches.
The query_uuid is a UUID of a previously saved reader, which
the replica is now asked to recall and resume (if this saved reader is
no longer in the cache, it is fine, a new reader will be started).
Additionally a helper flag is_first_page is added so that the replica
can avoid doing any cache lookups (and incrementing miss counters) for
the first page.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds to the "paging_state", the opaque cookie that clients are
supposed to provide when asking for the next page on a paged query, a
unique id field. This new field will be used to tell that a new request
for a page really continues the previous page, and doesn't just by chance
start at the same position the previous page stopped.
We need to support setups with mixed versions - a client may get a paging
state from a coordinator running a new version of Scylla and send it to
a different coordinator running an old version - or vice versa. So the new
uuid field is set up to have a default uuid of UUID() (a recognizable
invalid uuid 0), so new versions receiving no uuid from an old version will
set this invalid uuid, and old versions receiving a uuid from a new version
will simply ignore it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Refactor some includes to reduce dependencies on messaging_service.hh,
which can change quite a lot as it includes many unrelated items itself.
Tests: build
* tag 'includes/messaging_service.hh/v1' of https://github.com/avikivity/scylla:
tests: reduce dependencies in test_services.hh
migration_manager: remove dependency on messaging_service.hh in header
messaging_service: move msg_addr into its own header file
Convert storage_service_for_test to a pimpl implementation to
reduce dependencies. Tests that depended on those includes were
fixed to include their dependencies directly.
Try to avoid recompilations by reducing inclusions of system_keyspace.hh
in other header files.
Tests: unit (release)
* tag 'system_keyspace.hh/v1' of https://github.com/avikivity/scylla:
storage_service: remove system_keyspace.hh include
locator: de-inline reconnectable_snitch_helper
locator: de-inline production_snitch_base
cql3: remove #include of system_keyspace.hh
De-inlining allows us to remove some dependencies, and those functions
are too complex to inline anyway.
A few always-throwing functions get the [[noreturn]] attribute to
avoid damaging code generation.