Before this patch only ranges between returned row fragments were
marked as continuous. In the extreme case, there could be no such
fragments, in which case next read would miss as well. To avoid this,
mark whole query range as continuous by inserting dummy entries when
necessary.
Refs #2579.
"When reading a single row it is possible that the read will be satisfied
by just reading from one of the data source candidates. To exploit this
an optimization is employed which sorts data source candidates by their
timestamp and reads mutations from the most recent to the oldest. When
all needed cells are present and their earliest timestamp is still
later than the latest one of the remaining data source the read can be
terminated early.
However this optimization also has the possibility to backfire as the
data sources are read sequentially, so if all of them has to be read
eventually then we will end up worse then without it.
Thus the optimization can be disabled up-front or enabled to only run
until its efficiency degrades below a certain threshold.
Also counters are added to column-families to make it possible to
observe how well it performs.
Benchmarking
Benchmarking was done with disabled cache and at a constant op rate of
4k (1/3 of the max op rate on my box), against 3 sstables containing the
same 10000 rows.
1) Optimization turned off (all sstables read paralelly)
latency mean : 1.3 [simple:1.3]
latency median : 1.0 [simple:1.0]
latency 95th percentile : 2.4 [simple:2.4]
latency 99th percentile : 2.9 [simple:2.9]
latency 99.9th percentile : 8.0 [simple:8.0]
latency max : 13.5 [simple:13.5]
2) Optimization turned on, best case (1 of 3 sstables read)
latency mean : 0.6 [simple:0.6]
latency median : 0.6 [simple:0.6]
latency 95th percentile : 1.0 [simple:1.0]
latency 99th percentile : 1.2 [simple:1.2]
latency 99.9th percentile : 4.4 [simple:4.4]
latency max : 13.4 [simple:13.4]
3) Optimization turned on, best case, IN query (1 of 3 sstables read)
latency mean : 0.7 [simple_in:0.7]
latency median : 0.6 [simple_in:0.6]
latency 95th percentile : 1.1 [simple_in:1.1]
latency 99th percentile : 1.4 [simple_in:1.4]
latency 99.9th percentile : 5.4 [simple_in:5.4]
latency max : 16.8 [simple_in:16.8]
4) Optimization turned on, worst case (3 of 3 sstables read sequentally)
latency mean : 2.8 [simple:2.8]
latency median : 2.3 [simple:2.3]
latency 95th percentile : 5.4 [simple:5.4]
latency 99th percentile : 6.5 [simple:6.5]
latency 99.9th percentile : 13.5 [simple:13.5]
latency max : 19.2 [simple:19.2]
5) Optimization turned on, mid case (2 of 3 sstables read sequentally)
latency mean : 1.4 [simple:1.4]
latency median : 1.1 [simple:1.1]
latency 95th percentile : 2.7 [simple:2.7]
latency 99th percentile : 3.2 [simple:3.2]
latency 99.9th percentile : 7.7 [simple:7.7]
latency max : 15.1 [simple:15.1]"
Ref #324
* 'bdenes/optimize_single_row_read_v6' of github.com:denesb/scylla:
Add unit tests for single_key_sstable_reader
Add counters for the single-key reader optimization
Add single_key_parallel_scan_threshold option
single_key_sstable_reader: optimize single-row queries
single_key_sstable_reader: move reading code into it's own method
Add selects_only_full_rows() and selects_only_full_rows_with_atomic_columns()
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.
Refs #2885
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
"This changeset is the first step to flatten mutation_reader.
Then it introduces new mutation_fragment types for partition header and end of partition.
Using those a new flat_mutation_reader is defined.
Finally it introduces converters between new flat_mutation_reader and
old mutation_reader."
* 'haaawk/flattened_mutation_reader_v12' of github.com:scylladb/seastar-dev:
Add tests for flat_mutation_reader
Introduce conversion from flat_mutation_reader to mutation_reader
Introduce conversion from mutation_reader to flat_mutation_reader
Introduce flat_mutation_reader
Extract FlattenedConsumer concept using GCC6_CONCEPT
Introduce partition_end mutation_fragment
Introduce a position for end of partition
Introduce partition_start mutation_fragment
Introduce FragmentConsumer
Introduce a position for partition start
streamed_mutation: Extract concepts using GCC6_CONCEPT macro
Those tests run mutation source test for all sources
using conversion to and from flat_mutation_reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
For every finished compaction, we were calculating shards for all
existing tables. With ignore_msb set to 0, it's probably not a big
deal, but if ignore_msb is like 12 and LCS is used (meaning thousands
of tables possibly), the operation may stall the reactor for a
considerable amount of time. That's fixed by caching shards.
Fixes#2875.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20171011053424.22308-1-raphaelsc@scylladb.com>
This type of mutation_fragment will be used in new mutation_reader
to signal the beginning of the next partition.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"This series implements CAST AS functions in scylla.
It allows to use expressions of the form CAST(x AS type) in select statements.
Primary motivation for this functions came from aggregate functions, because
function avg(.) gives rounded results for interger columns. Now it is possible
to convert such column to float/double and obtain floating point results:
SELECT ... avg(cast(x as double)), ...
Fixes #2280."
* 'danfiala/2280-patch-series-v2' of https://github.com/hagrid-the-developer/scylla:
tests: Add test for CAST AS functions.
cql3: Add support for CAST AS functions to ANTLR grammar.
cql3/selectable: Add selectable::with_cast for CAST AS functions.
cql3/functions: Add support for CAST AS functions.
types:: Add support for CAST AS functions.
types: Moved code that implements conversion of types' values to string.
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed.
Fixes #2692."
* 'bdenes/restricting_mutation_reader-v5' of https://github.com/denesb/scylla:
Update reader restriction related metrics
Add restricted_reader_test unit test
restricted_mutation_reader: restrict based-on memory consumption
mutation_reader.hh: Move restricted_reader related code
"
The original motivation for the "utils: introduce a loading_shared_values" series was a hinted handoff work where
I needed an on-demand asynchronously loading key-value container (a replica address to a commitlog instance map).
It turned out that we already have the classes that do almost what I needed:
- utils::loading_cache
- sstables::shared_index_lists
Therefore it made sense to find a common ground, unify this functionality and reuse the code both in the classes above and in the
new hinted handoff code.
This series introduces the utils::loading_shared_values that generalizes the sstables::shared_index_lists
API on top of bi::unordered_set with the rehashing logic from the utils::loading_cache triggered by an addition
of an entry to the set (PATCH1).
Then it reworks the sstables::shared_index_lists and utils::loading_cache on top of the new class (PATCH2 and PATCH3).
PATCH4 optimizes the loading_cache for the long timer period use case.
But then we have discovered that we have another "customer" for the loading_cache. Apparently our prepared statements cache
had a birth flaw - it was unlimited in size - unless the corresponding keyspace and/or table are modified/dropped the entries
are never evicted. We clearly need to limit its size and it would also make sense to evict the cache entries that haven't been
used long enough.
This seems like a perfect match for a utils::loading_cache except for prepared statements don't need to be reloaded after
they are created.
Patches starting from PATCH5 are dealing with adding the utils::loading_cache the missing functionality (like making the "reloading"
conditional and adding the synchronous methods like find(key)) and then transitioning the CQL and Thrift prepared statements
caches to utils::loading_cache.
This also fixes #2474."
* 'evict_unused_prepared-v5' of https://github.com/vladzcloudius/scylla:
tests: loading_cache_test: initial commit
cql3::query_processor: implement CQL and Thrift prepared statements caches using cql3::prepared_statements_cache
cql3: prepared statements cache on top of loading_cache
utils::loading_cache: make the size limitation more strict
utils::loading_cache: added static_asserts for checking the callbacks signatures
utils::loading_cache: add a bunch of standard synchronous methods
utils::loading_cache: add the ability to create a cache that would not reload the values
utils::loading_cache: add the ability to work with not-copy-constructable values
utils::loading_cache: add EntrySize template parameter
utils::loading_cache: rework on top of utils::loading_shared_values
sstables::shared_index_list: use utils::loading_shared_values
utils: introduce loading_shared_values
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed."
Fixes#2692.
* 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla:
Update reader restriction related metrics
Add restricted_reader_test unit test
restricted_mutation_reader: restrict based-on memory consumption
mutation_reader.hh: Move restricted_reader related code
The reason to do that is because compaction can deadlock if refresh
disables write which waits for compaction, and compaction in turn
waits for dirty memory[1] that would be released by memtable write.
Dirty memory manager for non-system cfs was being used for system cfs,
which was useful for exposing this problem.
[1]: when updating compaction history.
Fixes#2769.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>
"Fixes the problem of concurrent populations of clustering row ranges
leading to some readers skipping over some of the rows.
Spotted during code review.
Fixes #2834."
* tag 'tgrabiec/fix-cache-reader-skipping-rows-v2' of github.com:scylladb/seastar-dev:
tests: mvcc: Add test for partition_snapshot_row_cursor
tests: row_cache: Add test for concurrent population
tests: row_cache: Make populate_range() accept partition_range
tests: Add simple_schema::make_ckey_range()
cache_streamed_mutation: Add missing _next_row.maybe_refresh() call
mvcc: partition_snapshot_row_cursor: Fix cursor skipping over rows added after its position
mvcc: partition_snapshot_row_cursor: Rename up_to_date() to iterators_valid()
mvcc: Keep track of all iterators in partition_snapshot_row_cursor
mvcc: Make partition_snapshot_row_cursor printable
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
When incremental_reader_selector is used for compaction, it will
first call incremental selector of partitioned sstable set with
minimum token that will result in first interval being skipped,
which means not everything being compacted. The interval is
skipped because iterator is incorrectly advanced when token
lies before it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918021446.15920-1-raphaelsc@scylladb.com>
Forward-declare untyped_result_set and untyped_result_set_row, and remove
the include from query_processor.hh.
Message-Id: <20170916170859.27612-3-avi@scylladb.com>
log_histogram is not really a histogram, it is a heap-like container.
Rename to log_heap in case we do want a log_histogram one day.
Message-Id: <20170916172137.30941-1-avi@scylladb.com>
- Transition the prepared statements caches for both CQL and Trhift to the cql3::prepared_statements_cache class.
- Add the corresponding metrics to the query_processor:
- Evictions count.
- Current entries count.
- Current memory footprint.
Fixes#2474
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"When there are at least 2 nodes upgraded to 2.0, and the two exchanged schema
for some reason, reads or writes which involve both 1.7 and 2.0 nodes may
start to fail with the following error logged:
storage_proxy - Exception when communicating with 127.0.0.3: Failed to load schema version 58fc9b89-74ab-37ca-8640-8b38a1204f8d
The situation should heal after whole cluster is upgraded.
Table schema versions are calculated by 2.0 nodes differently than 1.7 nodes
due to change in the schema tables format. Mismatch is meant to be avoided by
having 2.0 nodes calculate the old digest on schema migration during upgrade,
and use that version until next time the table is altered. It is thus not
allowed to alter tables during the rolling upgrade.
Two 2.0 nodes may exchange schema, if they detect through gossip that their
schema versions don't match. They may not match temporarily during boot, until
the upgraded node completes the bootstrap and propagates its new schema
through gossip. One source of such temporary mismatch is construction of new
tracing tables, which didn't exist on 1.7. Such schema pull will result in a
schema merge, which cause all tables to be altered and their schema version to
be recalculated. The new schema will not match the one used by 1.7 nodes,
causing reads and writes to fail, because schema requesting won't work during
rolling upgrade from 1.7 to 2.0.
The main fix employed here is to hold schema pulls, even among 2.0 nodes,
until rolling upgrade is complete."
* 'tgrabiec/fix-schema-mismatch' of github.com:scylladb/seastar-dev:
tests: schema_change_test: Add test_merging_does_not_alter_tables_which_didnt_change test case
tests: cql_test_env: Enable all features in tests
schema_tables: Make make_scylla_tables_mutation() visible
migration_manager: Disable pulls during rolling upgrade from 1.7
storage_service: Introduce SCHEMA_TABLES_V3 feature
schema_tables: Don't alter tables which differ only in version
schema_mutations: Use mutation_opt instead of stdx::optional<mutation>
evict() doesn't guarantee that the whole partition is discontinuous.
In particular, partition tombstone cannot be marked as discontinuous.
The parts which are still continuous must be updated.
Broken after c78047fa5b.
Message-Id: <1505375684-28574-1-git-send-email-tgrabiec@scylladb.com>