Before flat mutation readers sstable::read_row() returned a
future<streamed_mutation>. That required a helper reader that would wait
for the streamed_mutations from all relevant sstables to be created and
then construct a mutation merger.
With flat mutation readers sstable::read_row_flat() returns a
flat_mutation_reader (no futures) so that the code can be simplified by
collecting all the relevant readers and creating a combined reader
without suspension points.
The unfortunate disadvantage of the flat_mutation_reader-based approach
is the fact that combined reader now needlessly compares the partition
keys even though we know that we read only a single partition, but
optimising that is out of scope of this patch.
All users of the filtering reader need only the decorated key of a
partition, but currently the predicate is given a reference to
streamed_mutations which are obsolete now.
We have had an issue recently where failed SSTable writes left the
generated SSTables dangling in a potentially invalid state. If the write
had, for instance, started and generated tmp TOCs but not finished,
those files would be left for dead.
We had fixed this in commit b7e1575ad4,
but streaming memtables still have the same isse.
Note that we can't fix this in the common function
write_memtable_to_sstable because different flushers have different
retry policies.
Fixes#3062
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Message-Id: <20171213011741.8156-1-glauber@scylladb.com>
"Didn't affect any release. Regression introduced in 301358e.
Fixes#3041"
* 'resharding_fix_v4' of github.com:raphaelsc/scylla:
tests: add sstable resharding test to test.py
tests: fix sstable resharding test
sstables: Fix resharding by not filtering out mutation that belongs to other shard
db: introduce make_range_sstable_reader
rename make_range_sstable_reader to make_local_shard_sstable_reader
db: extract sstable reader creation from incremental_reader_selector
db: reuse make_range_sstable_reader in make_sstable_reader
introduce reader variant that will allow its caller to read a range
in a given table without any filter applied.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Tomek says:
"I think that the least surprising behavior for a function named like this
is to read the sstables unfiltered (it just reads them), and the filtering
should be indicated specially in the name or by accepting a parameter."
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For now only the interface is converted, behind the scenes the previous
implementation remains, it's output is simply converted by
flat_mutation_reader_from_mutation_reader. The implementation will be
converted in the following patches.
The wrapper is no longer needed because
read_range_rows returns ::mutation_reader instead of
sstables::mutation_reader and the reader returned from
it keeps the pointer to shared_sstable that was used to
create the reader.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Since we started accounting virtual dirty memory we no longer have a cap
on real dirty memory. In most situations that is not needed, since real
dirty will just be at most twice as much as virtual dirty (current
flushing memtable plus new memtable).
However, due to things like cache updates and component flushing we can
end up having a lot of memtables that are virtually freed but not yet
fully released, leading real dirty memory to explode using all the box'
memory.
This patch adds a cap on real dirty memory as well. Because of the
hierarchical nature of region_group, if the parent blocks due to memory
depletion, so will the child (virtual dirty region group).
A next step is to add a controller that will increase the priority of
the tasks involving in releasing real dirty memory if we get dangerously
close to the threshold.
Signed-off-by: Glauber Costa <glauber@scylladb.com>
Add two counters, one to determine how many of the reads fall into the
optimization, and a second one to determine it's effectiveness.
The first one is single_key_reader_optimization_hit_rate. It contains
the rate of reads that the optimization applies to out of all the reads
that go into the single_key_sstable_reader.
The second one, single_key_reader_optimization_extra_read_proportion is
a histogram of the efficiency of the optimization. It contains the
proportion of extra data-sources read. It's a number between 0 and 1,
where 0 is the best case (only one data-source was read) and 1 is the
worst case (all data-sources were read eventually). This is the same
number that is used for the threshold option (see previous patch).
Each of the histogram's buckets cover a chunk of 0.1 from the [0, 1]
range.
Note that single_key_parallel_scan_threshold effectively provides an
upper bound for the proportion as the optimization is turned off as soon
as it goes above that number.
The counters are disabled if single_key_parallel_scan_threshold is set
to 0 disabling the optimization entirely.
This option regulates when exactly the single-key optimization is
considered ineffective and turned off.
The threshold is the proportion of the extra data source candidates that
can be read before the optimization is considered ineffective and
disabled. The proportion is calculated as follows:
(read_data_sources - 1) / (total_data_sources - 1)
We substract 1 from the read_data_sources and total_data_sources to
effectively measure the rate of *extra* data sources we read. This
makes sure that the proportion is meaningful even if e.g. we have only
have a total of 2 data-sources and we read only 1 (best case).
Whenever this number goes above the threshold the optimization is
disabled. The threshold is number between 0 and 1, 0 forces the
optimization off and 1 forces it on. Increase the treshold to favor
throughput over latency for single-row reads, decrease the treshold to
improve latency at the expense of throughput.
If the threshold is > 0 (it's not force disabled) and the optimization
is disabled due to a read crossing the threshold, we will issue
"probing" reads (every 100th read) to determine if the optimization is
worth re-enabling. Probing reads are allowed to run through the
optimization path and if they go below the threshold the optimization is
re-enabled.
For single-row queries that only query atomic cells one can put a lower
bound on the timestamps which may affect the query results and thus rule
out entire data sources. This allows the query to read only those
sstables that actually contribute to the result.
To do this we incrementally move through the sstables overlapping with
the query range, checking after each read mutation whether we already
have a value for all required cells and whether the lower-bound of their
timestamps is higher than the upper-bound of the timestamps of all the
remaining data-sources. When this condition is met we terminate the
read.
query::full_slice doesn't select any regular or static columns, which
is at odds with the expectations of its users. This patch replaces it
with the schema::full_slice() version.
Refs #2885
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <1507732800-9448-2-git-send-email-duarte@scylladb.com>
gossiper::get_endpoint_state_for_endpoint() returns a copy of
endpoint_state, which we've seen can be very expensive.
This patch adds a similar function which returns a pointer instead,
and changes the call sites where using the pointer-returning variant
is deemed safe (the pointer neither escapes the function, nor crosses
any defer point).
Fixes#764
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
"This patch series adds backing materialized view for secondary indices.
When a new index is created with the 'CREATE INDEX' statement, a backing
materialized view is created automatically.
For example, assuming the following table:
CREATE TABLE ks1.users (
userid uuid,
email text,
PRIMARY KEY (userid)
);
When the following index is created:
CREATE INDEX user_email ON ks1.users (email);
The following materialized view is also created:
cqlsh> DESCRIBE ks1.users;
<snip>
CREATE MATERIALIZED VIEW ks1.user_email_index AS
SELECT email, userid
FROM ks1.users
WHERE email IS NOT NULL
PRIMARY KEY (email, userid)
WITH CLUSTERING ORDER BY (userid ASC)
AND bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'SizeTieredCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1.0
AND dclocal_read_repair_chance = 0.1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND read_repair_chance = 0.0
AND speculative_retry = '99.0PERCENTILE';
CQL queries will use the backing materialized view as part of queries on
indexed columns to fetch the primary keys."
* 'penberg/cql-2i-backing-view/v3' of github.com:scylladb/seastar-dev:
schema_tables: Create backing view for indices
database: Kill obsolete secondary index manager stub
cql3: Wire up secondary index manager
cql3/restrictions: Add term_slice::is_supported_by() function
index: Add secondary_index_manager::create_view_for_index()
index: Add target_parser::parse() helper
cql3/statements: Add index_target::from_sstring() helper
index: Add secondary_index_manager::get_dependent_indices()
index: Add secondary_index_manager::reload()
index: Add secondary_index_manager::list_indexes()
index: Add index class
index: Pass column_family to secondary_index_manager constructor
database: Make secondary index manager per-column family
This patch wires calls to secondary index manager reload() in
merge_tables_and_views() and changes make_update_indices_mutations() to
also create mutations for the backing materialized view. After this
patch, "CREATE INDEX" CQL statement also creates a materialized view.
Update description of existing reader count metrics, add memory
consumption metrics. Use labels to distinguish between system, user and
streaming reads related metrics.
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.
"Currently restricting_mutation_reader restricts mutation_readears on a
count basis. This is inaccurate on multiple levels. The reader might be
a combined_mutation_reader, which might be composed of multiple
individual readers, whose number might change during the lifetime of the
reader. The memory consumption of the readers can vary and may change
during the lifetime of the reader as well.
To remedy this, make the restriction memory-consumption based. The
restricting semaphore is now configured with the amound of memory
(bytes) that its readers are allowed to consume in total. New readers
consume 128k units up-front to account for read-ahead buffers, and then
consume additional units for any buffer (returned
from input_stream<>::read()) they keep around.
Like before, readers already allowed to read will not be blocked,
instead new readers will be blocked on their first read if all the units
all consumed."
Fixes#2692.
* 'bdenes/restricting_mutation_reader-v4' of https://github.com/denesb/scylla:
Update reader restriction related metrics
Add restricted_reader_test unit test
restricted_mutation_reader: restrict based-on memory consumption
mutation_reader.hh: Move restricted_reader related code
Dirty memory manager for non-system column families was being used
when applying mutations to system cfs.
That previously lead to deadlock when updating history. Basically,
write disable waits on compaction, and compaction waits on a write
that would release dirty memory for updating compaction history.
Only using the correct dirty manager wouldn't solve this problem
if write is disabled for system cf, but the problem is completely
solved in addition to previous change which updates history
outside the sstable lock.
Refs #2769.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-3-raphaelsc@scylladb.com>
The reason to do that is because compaction can deadlock if refresh
disables write which waits for compaction, and compaction in turn
waits for dirty memory[1] that would be released by memtable write.
Dirty memory manager for non-system cfs was being used for system cfs,
which was useful for exposing this problem.
[1]: when updating compaction history.
Fixes#2769.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170918215238.9810-2-raphaelsc@scylladb.com>
Restrict readers based on their memory consumption, instead of the count
of the top-level readers. To do this an interposer is installed at the
input_stream level which tracks buffers emmited by the stream. This way
we can have an accurate picture of the readers' actual memory
consumption.
New readers will consume 16k units from the semaphore up-front. This is
to account their own memory-consumption, apart from the buffers they
will allocate. Creating the reader will be deferred to when there are
enough resources to create it. As before only new readers will be
blocked on an exhausted semaphore, existing readers can continue to
work.