"
This fixes an abort in an sstable reader when querying a partition with no
clustering ranges (happens on counter table mutation with no live rows) which
also doesn't have any static columns. In such case, the
sstable_mutation_reader will setup the data_consume_context such that it only
covers the static row of the partition, knowing that there is no need to read
any clustered rows. See partition.cc::advance_to_upper_bound(). Later when
the reader is done with the range for the static row, it will try to skip to
the first clustering range (missing in this case). If clustering_ranges_walker
tells us to skip to after_all_clustering_rows(), we will hit an assert inside
continuous_data_consumer::fast_forward_to() due to attempt to skip past the
original data file range. If clustering_ranges_walker returns
before_all_clustering_rows() instead, all is fine because we're still at the
same data file position.
Fixes#3304.
"
* 'tgrabiec/fix-counter-read-no-static-columns' of github.com:scylladb/seastar-dev:
tests: mutation_source_test: Test reads with no clustering ranges and no static columns
tests: simple_schema: Allow creating schema with no static column
clustering_ranges_walker: Stop after static row in case no clustering ranges
(cherry picked from commit 054854839a)
"
Terms
-----
querier: A class encapsulating all the logic and state needed to fill a
page. This Includes the reader, the compact_mutation object and all
associated state.
Preamble
--------
Currently for paged-queries we throw away all readers, compactors and
all associated state that contributed to filling the page and on the
next page we create them from scratch again. Thus on each page we throw
away a considerable amount of work, only to redo it again on the next
page. This has been one of the major contributors to latencies as from
the point of view of a replica each page is as much work as a fresh
query.
Solution
--------
The solution presented in this patch-series is to save queriers after
filling a page and reuse them on the next pages, thus doing the
considerable amount of work involved with creating the them only once.
On each page the coordinator will generate a UUID that identifies this
page. This UUID is used as the key, under which the contributing
queriers will be saved in the cache. On the next page the UUID from the
previous page will be used to lookup saved queriers, and the one from
the current one to saved them afterwards (if the query isn't finished).
These UUIDs (reader_recall_uuid and reader_save_uuid) are attached to
the page-state. Also attached to the page state is the list of replicas
hit on the last page. On the next page this list will be consulted to
hit the same replicas again, thus reusing the queriers saved on them.
Cached queriers will be evicted after a certain period of time to avoid
unecessary resource consumption by abandoned reads.
Cached queriers may also be evicted when the shard faces
resource-pressure, to free up resources.
Splitting up the work
---------------------
This series only fixes the singular-mutation query path, that is queries
that either fetch a single partition, or severeal single partitions (IN
queries). The fix for the scanning query path will be done in a
follow-up series, however much of the infrastructure needed for the
general querier reuse is already introduced by this series.
Ref #1865
Tests: unit-tests(debug, release), dtests(paging_test, paging_additional_test)
Benchmarking summary (read-from-disk)
-------------------------------------
1) Latency
BEFORE
latency mean : 58.0
latency median : 57.4
latency 95th percentile : 68.8
latency 99th percentile : 79.9
latency 99.9th percentile : 93.6
latency max : 93.6
AFTER
latency mean : 41.3
latency median : 40.5
latency 95th percentile : 50.8
latency 99th percentile : 68.9
latency 99.9th percentile : 89.2
latency max : 89.2
2) Throughput (single partition query)
sum(scylla_cql_reads):
BEFORE: 173'567
AFTER: 427'774
+246%
3) Throughput (IN query, 2 partitions)
sum(scylla_cql_reads):
BEFORE: 85'637
AFTER: 127'431
+148%
"
* '1865/singular-mutations/v8.2' of https://github.com/denesb/scylla: (23 commits)
Add unit test for resource based cache eviction
Add unit tests for querier_cache
Add counters to monitor querier-cache efficiency
Memory based cache eviction
Add buffer_size() to flat_mutation_reader
Resource-based cache eviction
Time-based cache eviction
Save and restore queriers in mutation_query() and data_query()
Add the querier_cache_context helper
Add querier_cache
Add querier
Add are_limits_reached() compact_mutation_state
Add start_new_page() to compact_mutation_state
Save last key of the page and method to query it
Make compact_mutation reusable
Add the CompactedFragmentsConsumer
Use the last_replicas stored in the page_state
query_singular(): return the used replicas
Consider preferred replicas when choosing endpoints for query_singular()
Add preferred and last replicas to the signature of query()
...
Specifically for the reader-permit based eviction. This test lives in a
separate executable as it uses with_cql_test_env() and thus needs a
main() of it's own.
"
This patchset is a part of a bigger effort for bringing our
microbenchmarking tests from the source tree to be used for regression
testing purposes with CI.
Now, it is possible to export results of tests run into JSON format that
can be stored in ElasticSearch and compared among runs to detect
performance degradation should it happen.
Example of JSON output (formatted for readability):
{
"results" :
{
"parameters" :
{
"read" : "64",
"read,skip,test_run_count" : "64,256,1",
"skip" : "256",
"test_run_count" : 1
},
"stats" :
{
"(KiB)" : 126960,
"aio" : 993,
"blocked" : 208,
"c blk" : 1,
"c hit" : 0,
"c miss" : 1,
"cpu" : 99.779365539550781,
"dropped" : 0,
"frag/s" : 311939.61559016741,
"frags" : 200000,
"idx blk" : 0,
"idx hit" : 0,
"idx miss" : 0,
"time (s)" : 0.641149729
}
},
"test_group_properties" :
{
"message" : "Testing scanning large partition with skips.\nReads whole range interleaving reads with skips according to read-skip pattern",
"name" : "large-partition-skips",
"needs_cache" : false,
"partition_type" : "large"
},
"versions" :
{
"scylla-server" :
{
"commit_id" : "4acfa17f4",
"date" : "20180306",
"run_date_time" : "2018-16-06 12:16:41",
"version" : "666.development"
}
}
}
"
* 'issues/2947/v6' of https://github.com/argenet/scylla:
Add support for JSON output format for perf_fast_forward results.
Wrap output for customization. Move all output handling to a single managing class.
Convert storage_service_for_test to a pimpl implementation to
reduce dependencies. Tests that depended on those includes were
fixed to include their dependencies directly.
De-inlining allows us to remove some dependencies, and those functions
are too complex to inline anyway.
A few always-throwing functions get the [[noreturn]] attribute to
avoid damaging code generation.
unsigned type was incorrectly used for keeping track of min and max
timestamp, so a negative number would be treated as a very high
number that would *incorrectly* end up as max timestamp in sstable
metadata.
Fixes#3000.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20180308162217.18963-1-raphaelsc@scylladb.com>
Partitions corresponding to keys have 40k rows. With row-level
eviction touching them inside the loop became a serious performance
issue, because touch() now needs to walk over all rows.
The JSON output is arranged in a way that makes it easier to upload
results to ElasticSearch.
All the tests results are placed under the perf_forward_data_output/ directory
For test groups, we create separate subdirectories where we save results
from runs of tests in those groups.
For each test run, we store results in a separate file named:
<dash-separated-param-list>.<run-number>.json
where
<dash-separated-param-list> is a dash-separated list of parameters of the current
test, e.g., 1-64 (for read-skip pattern).
<run-number> is the number of run of this test with the specified
parameters. This is needed as the same list of parameters can be
used more than once (for instance, when cache is enabled).
Those numbers start with 1, i.e., 1, 2, 3.
So, the path to a resulting JSON file may look like:
perf_fast_forward_output/large-partition-skips/64-4096.1.json
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Instead of passing the output parameters to std::cout straight away, use
helper wrappers. This will allow us to add more formats for gathered
tests results.
Introduce helper writer classes hierarchy that can be extended to
support different output formats (JSON, XML, etc).
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Instead of evicting whole partitions, evicts whole rows.
As part of this, invalidation of partition entries was changed to not
evict from snapshots right away, but unlink them and let them be
evicted by the reclaimer.
This change is a preparation for introducing row-level eviction, such that entries
can be evicted from older versions without having to touch other versions.
Currently continuity flags on entries are interpreted relative to the
combined view merged from all entries. For example:
v2: <key=2, cont=1>
v1: <key=1, cont=1>
In v2, the flag on entry key=2 marks the range (1, 2) as
continuous. This is problematic because if the old version is evicted, continuity
will change in an incorrect way:
v2: <key=2, cont=1>
Here, the range (-inf, 1) would be marked as continuous, which is not true.
To solve this problem, we change the rules for continuity
interpretation in MVCC. Each version will have its own continuity,
fully specified in that version, independent of continuity of other
versions. Continuity of the snapshot will be a union of continuous
ranges in each version.
It is assumed that continuous intervals in different versions are non-
overlapping, except for points corresponding to complete rows, in
which case a later version may overlap with an older version
(overwrite). We make use of this assumption to make calculation of the
union of intervals on merging easier. I make use of the above
assumption in mutation_partition::apply_monotonically().
MVCC population of incomplete entries already almost maintains the
non-overlapping invariant, because population intervals correspond to
intervals which are incomplete in the old snapshot. The only change
needed is to ensure that both population bounds will have entries in
the latest version. Population from memtables doesn't mark any
intervals as continuous, so also conforms. The only change needed
there is to not inherit continuity flags from the old snapshot,
effectively making the new version internally discontinuous except for
row points.
The example from the beginning will become:
v2: <key=1, cont=0> <key=2, cont=1>
v1: <key=1, cont=1>
When marking a range as continuous with some rows present only in
older versions, we need to insert entries in the latest version, so
that we can mark the range as continuous. The easiest solution is to
copy the entry from the old version. Another option would be to add
support for incomplete rows and insert such instead. This way we would
avoid duplicating row contents. This optimization is deferred.
Simply copying mutations which are not fully continuous may violate
MVCC invariants, like the one about non-overlapping continuity which
will be added later. Use apply_to_incomplete() instead.
This unfortunately reduces strenght of the test, since the continuity
of the entry is now completely determined by the first version. We should
use populate() instead, but it doesn't exist yet. It could be extracted
from cache_streamed_mutation, but that's not an easy change.
This is alleviated by adding a similar test to row_cache_test_g, in a
later patch.
This patch fixes an issue with test_propagation(), where the test
assumed that after the future returned from wait_for_pending(0)
resolved, the continuations set for the post operation had already
run, which is not true.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180305131908.7667-1-duarte@scylladb.com>
reader_wrapper's _timeout defaults to now(), which means to time
out immediately rather than no timeout.
Fix by switching to a time_point, defaulting to no_timeout, and
provide a compatible constructor (with a duration parameter) for
callers that do want a duration-based timeout.
Tests: mutation_reader_test (debug, release)
Message-Id: <20180305111739.31972-1-avi@scylladb.com>
The message in question is printed with printf() which is bad by itself.
And most importantly this test uses a single .property file so this message
doesn't add any interesting information to begin with. Therefore it makes
more sense to drop it than to fix it.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
Message-Id: <1519661059-13325-1-git-send-email-vladz@scylladb.com>
Since quoted names are allowed for role names, we add a more descriptive
error message when a quoted name is (erroneously) used for a user name.
This behavior is consistent with Apache Cassandra.
This patch changes the syntax for CQL statements related to roles to
favor a form like
CREATE ROLE sam WITH PASSWORD = 'shire' AND LOGIN = false;
instead of
CREATE ROLE sam WITH PASSWORD 'shire' NOLOGIN;
This new syntax has the benefit of not imposing any ordering constraints
on the modifiers for roles and being consistent with other parts of the
CQL grammar. It is also consistent with syntax in Apache Cassandra.
The old USER-based statements (CREATE USER and ALTER USER) still have
the old forms for backwards compatibility.
A previous change modified the USER-related statements to allow for the
OPTIONS option. However, this was a mistake; only the PASSWORD option
should have been allowed. This patch also corrects this mistake.
These are quick-running tests for verifying the accepted forms of CQL
statements (and fragments) related to access-control: users, roles, and
permissions.
Establishing the allowed forms of statements is helpful for reference,
but also makes syntax changes (like those expected in later patches)
clearer and more safe.
"
Adds extension points to schema/sstables to enable hooking in
stuff, like, say, something that modifies how sstable disk io
works. (Cough, cough, *encryption*)
Extensions are processed as property keywords in CQL. To add
an extension, a "module" must register it into the extensions
object on boot time. To avoid globals (and yet don't),
extensions are reachable from config (and thus from db).
Table/view tables already contain an extension element, so
we utilize this to persist config.
schema_tables tables/views from mutations now require a "context"
object (currently only extensions, but abstracted for easier
further changes.
Because of how schemas currently operate, there is a super
lame workaround to allow "schema_registry" access to config
and by extension extensions. DB, upon instansiation, calls
a thread local global "init" in schema_registry and registers
the config. It, in turn, can then call table_from_mutations
as required.
Includes the (modified) patch to encapsulate compression
into objects, mainly because it is nice to encapsulate, and
isolate a little.
"
* 'calle/extensions-v5' of github.com:scylladb/seastar-dev:
extensions: Small unit test
sstables: Process extensions on file open
sstables::types: Add optional extensions attribute to scylla metadata
sstables::disk_types: Add hash and comparator(sstring) to disk_string
schema_tables: Load/save extensions table
cql: Add schema extensions processing to properties
schema_tables: Require context object in schema load path
schema_tables: Add opaque context object
config_file_impl: Remove ostream operators
main/init: Formalize configurables + add extensions to init call
db::config: Add extensions as a config sub-object
db::extensions: Configuration object to store various extensions
cql3::statements::property_definitions: Use std::variant instead of any
sstables: Add extension type for wrapping file io
schema: Add opaque type to represent extensions
sstables::compress/compress: Make compression a virtual object
"This series adds the GoogleCloudSnitch.
Fixes#1619"
* 'google-cloud-snitch-v4' of https://github.com/vladzcloudius/scylla:
config: uncomment/add the supported snitches description
tests: added gce_snitch_test
locator::gce_snitch: implementation of the GoogleCloudSnitch
locator::snitch_base: properly log the failure during the snitch startup
The test inserts some values with a TTL of 1 second and then
reads them back expecting them not to be expired yet. That may not
always be the case if the machine is slow and we are running in the
debug mode. Increasising the TTLs by x100 should help avoid these
false positives.
Message-Id: <20180219133816.17452-1-pdziepak@scylladb.com>
Operations on a append_challenged_posix_file_impl schedule asynchronous
operations when they are executed, which capture the file object. To
synchronize with them and prevent use-after-free, we need to call
close() before destroying the file.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217170556.27330-1-duarte@scylladb.com>
This test relied on task execution order to work correctly. Namely, it
relied on parent regions being reclaimed before child regions
(reclaiming is an asynchronous process started by a call to
start_reclaiming()). This order is necessary because child regions
don't know about parent regions when calculating the biggest region
that should be reclaimed.
We fix this by forcing the reclaim order.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180217121655.26057-1-duarte@scylladb.com>