"This series introduces partial support for range deletions. This allows
deletion operations such as
delete from cf where p=1 and c > 0 and c <= 3.
This series only adds support for single-column range restrictions.
We enforce that both range bounds be specified, because we can't represent
infinite bounds in the current sstable format. Such bounds are represented
as a prefix with no components, with the bound_kind informing whether they
are a bottom of top bound.
We're currently unable to serialize an infinite bound in such a way that it
would be correctly interpreted by Cassandra 2.2.x. A serialized bound is a
composite with a (<length><value><EOC>)+ format. While we could technically
represent the bottom bound, the top bound, if written as a single component
with 0 bytes in size and some EOC, would always sort before other values.
The same would happen if represented as an empty (no components) composite,
because in Cassandra 2.2.x those always have EOC = NONE.
This limitation should stay in place until we can properly represent range
tombstones in the storage format."
* 'range-deletions/v2' of https://github.com/duarten/scylla:
mutation: Set cell using clustering_key_prefix
mutation_partition: Harmonize apply_delete overloads
prefix_compound_view_wrapper: Add is_full and is_empty functions
tests/cql_query_test: Add range deletion tests
cql3: Partially support ranged deletions
single_column_primary_key_restrictions: Implement has_bound()
modification_statement: Use statement_restrictions for where clause
statement_restrictions: Expose primary key restrictions
to_string: Add missing include
Change the clustering key argument in mutation::set_cell from
exploded_clustering_prefix to clustering_key_prefix, which allows for
some overall code simplification and fewer copies. This mostly affects
the cql3 layer.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
after 'compaction: make major compaction go through compaction manager',
the test fails because task is preempted in debug mode before it reaches
intruction to increase stat.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170501183255.6191-1-raphaelsc@scylladb.com>
Currently, fully expired sstable[1] is unconditionally chosen for compaction
by DTCS, but that may lead to a compaction loop under certain conditions.
Let's consider that an almost expired sstable is compacted, and it's not
deleted yet, and that the new sstable becomes expired before its ancestor is
deleted.
Because this new sstable is expired, it will be chosen by DTCS, but it will
not be purged because 'compacted undeleted' sstables are taken into account
by calculation of max purgeable timestamp and prevents expired data from
being purged. The problem is that this sequence of events can keep happening
forever as reported by issue #2260.
NOTE: This problem was easier to reproduce before improvement on compaction
of expired cells, because fully expired sstable was being converted into a
sstable full of tombstones, which is also considered fully expired.
Fixes#2260.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428233554.13744-1-raphaelsc@scylladb.com>
"The logic responsible for converting counter updates to counter shards was
not covered by unit tests and didn't transform counter cells inside static
rows.
This series fixes the problem and makes sure that the tests cover both
static rows and transformation logic."
* tag 'pdziepak/static-counter-updates/v1' of github.com:cloudius-systems/seastar-dev:
tests/counter: test transform_counter_updates_to_shards
tests/counter: test static columns
counters: transform static rows from updates to shards
"Fixes #2326."
* 'tgrabiec/fix-range-tombstones-missing-when-slicing' of github.com:cloudius-systems/seastar-dev:
tests: mutation_source_test: Cover single-ranged queries in test_streamed_mutation_slicing_returns_only_relevant_tombstones()
tests: mutation_source_test: Add test for slicing of clustered rows
tests: mutation_reader_assertions: Log expectations
tests: mutation_reader_assertions: Add produces_eos_or_empty_mutation()
tests: sstables: Use read_row() for single-key reads
tests: sstables: Test more configutaions of sstable writer in test_sstable_conforms_to_mutation_source()
sstables: Improve logging
sstables: index_reader: Fix advance_to() to include relevant range tombstones
So that as_mutation_reader() will create the same kind of reader which
database::make_sstable_reader() does.
Before this change, all readers were range readers.
Test different versions of the format, and different promoted index
block sizes. The size of 1 is especially important, it will put each
fragment in a separate block, exposing various issues with promoted
index handling.
This patch replaces the current row tombstone representation by a
row_tombstone.
The intent of the patch is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be.
We need to distinguish shadowable from non-shadowable row tombstones
to support scenarios such as, when inserting to a table with a
materialzied view:
1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1
2. delete from base using timestamp 2 where p = 3
3. insert into base (p, v1) values (3, 1) using timestamp 3
These should yield a view row where v2 is definitely null, but with
the current implementation, v2 will pop back with its value v2=3@TS=1,
even though its dead in the base row. This is because the row
tombstone inserted at 2) is a shadowable one.
This patch only addresses the memory representation of such
row_tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Some gcc versions incorrectly complain:
tests/log_histogram_test.cc:87:22: error: ‘opts1’ is not a valid template argument for type ‘const log_histogram_options&’ because object ‘opts1’ has not external linkage
size_t hist_key<node<opts1>>(const node<opts1>& n) { return n.v; }
Apparently this is a bug in gcc:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52036Fixes#2307.
Message-Id: <1493108791-11247-1-git-send-email-tgrabiec@scylladb.com>
"This series fixes some more errors found by clang, with the aim of enabling
clang/zapcc as a supported compiler. A single issue remains, but it's
probably in std::experimental::optional::swap(); not in our code."
* tag 'clang/2/v1' of https://github.com/avikivity/scylla:
sstable_test: avoid passing negative non-type template arguments to unsigned parameters
UUID: add more comparison operators
sstable_datafile_test: avoid string_view user-defined literal conversion operator
mutation_source_test: avoid template function without template keyword
cql_query_test: define static variable
cql_query_test: add braces for single-item collection initializers
storage_service: don't use typeid(temporary)
logalloc: remove unused max_occupancy_for_compaction
storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic
storage_proxy: drop unused member access from return value
storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare
read_repair_decision: fix operator<<(std::ostream&, ...)
Every lsa-allocated object is prefixed by a header that contains information
needed to free or migrate it. This includes its size (for freeing) and
an 8-byte migrator (for migrating). Together with some flags, the overhead
is 14 bytes (16 bytes if the default alignment is used).
This patch reduces the header size to 1 byte (8 bytes if the default alignment
is used). It uses the following techniques:
- ULEB128-like encoding (actually more like ULEB64) so a live object's header
can typically be stored using 1 byte
- indirection, so that migrators can be encoded in a small index pointing
to a migrator table, rather than using an 8-byte pointer; this exploits
the fact that only a small number of types are stored in LSA
- moving the responsibility for determining an object's size to its
migrator, rather than storing it in the header; this exploits the fact
that the migrator stores type information, and object size is in fact
information about the type
The patch improves the results of memory_footprint_test as following:
Before:
- in cache: 976
- in memtable: 947
After:
mutation footprint:
- in cache: 880
- in memtable: 858
A reduction of about 10%. Further reductions are possible by reducing the
alignment of lsa objects.
logalloc_test was adjusted to free more objects, since with the lower
footprint, rounding errors (to full segments) are different and caused
false errors to be detected.
Missing: adjustments to scylla-gdb.py; will be done after we agree on the
new descriptor's format.
This patch fixes a failure of virtual_reader_test, where both the test
itself and the cql_test_env initialize the messaging_service to listen
on the same address and port, triggering an assert in
posix_ap_server_socket_impl::accept().
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170423104240.21275-1-duarte@scylladb.com>
A large token span sstable may find its way into high level due to resharding,
which means the strategy invariant is broken. The invariant is restored by
compacting first set of overlapping sstables, meaning that the restoration
is done incrementally for multiple overlapping sets.
Invariant is restored by regular compaction after resharding puts new unshared
sstables into their original level, where level > 0.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.
I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:
lsa-timing - Reclamation cycle took 12.934 us.
lsa-timing - Reclamation cycle took 47.771 us.
lsa-timing - Reclamation cycle took 125.946 us.
lsa-timing - Reclamation cycle took 144356 us.
lsa-timing - Reclamation cycle took 655.765 us.
lsa-timing - Reclamation cycle took 693.418 us.
lsa-timing - Reclamation cycle took 509.869 us.
lsa-timing - Reclamation cycle took 1139.15 us.
The 144ms pause is when large eviction is necessary.
Statistics for reclamation pauses for a read workload over
larger-than-memory data set:
Before:
avg = 865.796362
stdev = 10253.498038
min = 93.891000
max = 264078.000000
sum = 574022.988000
samples = 663
After:
avg = 513.685650
stdev = 275.270157
min = 212.286000
max = 1089.670000
sum = 340573.586000
samples = 663
Refs #1634."
* tag 'tgrabiec/lsa-reduce-reclaim-latency-v3' of github.com:cloudius-systems/seastar-dev:
lsa: Reduce reclamation latency
tests: Add test for log_histogram
log_histogram: Allow non-power-of-two minimum values
lsa: Use regular compaction threshold in on-idle compaction
tests: row_cache_test: Induce update failure more reliably
lsa: Add getter for region's eviction function
Currently eviction is performed until occupancy of the whole region
drops below the 85% threshold. This may take a while if region had
high occupancy and is large. We could improve the situation by only
evicting until occupancy of the sparsest segment drops below the
threshold, as is done by this change.
I tested this using a c-s read workload in which the condition
triggers in the cache region, with 1G per shard:
lsa-timing - Reclamation cycle took 12.934 us.
lsa-timing - Reclamation cycle took 47.771 us.
lsa-timing - Reclamation cycle took 125.946 us.
lsa-timing - Reclamation cycle took 144356 us.
lsa-timing - Reclamation cycle took 655.765 us.
lsa-timing - Reclamation cycle took 693.418 us.
lsa-timing - Reclamation cycle took 509.869 us.
lsa-timing - Reclamation cycle took 1139.15 us.
The 144ms pause is when large eviction is necessary.
Statistics for reclamation pauses for a read workload over
larger-than-memory data set:
Before:
avg = 865.796362
stdev = 10253.498038
min = 93.891000
max = 264078.000000
sum = 574022.988000
samples = 663
After:
avg = 513.685650
stdev = 275.270157
min = 212.286000
max = 1089.670000
sum = 340573.586000
samples = 663
Refs #1634.
Message-Id: <1484730859-11969-1-git-send-email-tgrabiec@scylladb.com>
After changing region evicitability condition to be less strict, cache
update stopped failing because reclamation was able to compact dense
region. Induce failure by installing evictor which refuses to evict
from cache beyond few elements.
There are instantiations of binary_search() used in sstables.cc, but
defined in partition.cc. The instantiations are explicitly declared in
partition.cc, but the types changed and they became obsolete. The
thing worked because partition.cc also instantiated it with the right
type. But after that code will be removed, it no longer would, and we
would get a linker error. To avoid such problems, define
binary_search() in a header.
streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.
Fixes#2287.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
The blocked task detector introduced in
113ed9e963 was seeing
the initialization phase of perf_ssttable as a blocked
task.
Tranform this part of the code in a futurized loop
to make to blocked task detector happy.
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170413132506.17806-1-benoit@scylladb.com>
When compacting a fully expired sstable, we're not allowing that sstable
to be purged because expired cell is *unconditionally* converted into a
dead cell. Why not check if the expired cell can be purged instead using
gc before and max purgeable timestamp?
Currently, we need two compactions to get rid of a fully expired sstable
which cells could have always been purged.
look at this sstable with expired cell:
{
"partition" : {
"key" : [ "2" ],
"position" : 0
},
"rows" : [
{
"type" : "row",
"position" : 120,
"liveness_info" : { "tstamp" : "2017-04-09T17:07:12.702597Z",
"ttl" : 20, "expires_at" : "2017-04-09T17:07:32Z", "expired" : true },
"cells" : [
{ "name" : "country", "value" : "1" },
]
now this sstable data after first compaction:
[shard 0] compaction - Compacted 1 sstables to [...]. 120 bytes to 79
(~65% of original) in 229ms = 0.000328997MB/s.
{
...
"rows" : [
{
"type" : "row",
"position" : 79,
"cells" : [
{ "name" : "country", "deletion_info" :
{ "local_delete_time" : "2017-04-09T17:07:12Z" },
"tstamp" : "2017-04-09T17:07:12.702597Z"
},
]
now another compaction will actually get rid of data:
compaction - Compacted 1 sstables to []. 79 bytes to 0 (~0% of original)
in 1ms = 0MB/s. ~2 total partitions merged to 0
NOTE:
It's a waste of time to wait for second compaction because the expired
cell could have been purged at first compaction because it satisfied
gc_before and max purgeable timestamp.
Fixes#2249, #2253
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170413001049.9663-1-raphaelsc@scylladb.com>
"sstable_streamed_mutation::fast_forward_to() is changed to use promoted index
(via index_reader) to optimize skipping in large partitions.
In addition to that, sstable mutation_reader is changed to use the index
to skip to the next partition.
Performance impact was evaluated using newly added tests/perf/perf_fast_forward
What's beyond this series:
- Using index_reader for single-partition reads as well
- Using index_reader for skipping across ranges in clustering restrictions"
* tag 'tgrabiec/skip-within-partition-using-index-v2' of github.com:cloudius-systems/seastar-dev: (47 commits)
tests: Add performance test for fast forwarding of sstable readers
tests: Allow starting cql_test_env on pre-existing data
config: Allow specifying source when setting value
tests: sstable: Add test for fast forwarding within partition using index
sstables: sstable_streamed_mutation: use index in fast_forward_to()
sstables: Store parsed promoted index in index_entry
sstables: Add trace-level logging for sstable consumption
sstables: Define deletion_time earlier
sstables: Make parsing throw exception on malformed promoted index
tests: Add tests for ordering of position_in_partition relative to composites
position_range: Introduce all_clustered_rows() factory method
position_in_partition: Introduce for_key()/after_key() factory methods
position_in_partition: Add factory methods for positions around all rows
position_in_partition: Introduce for_range_start()/for_range_end()
position_in_partition: Fix friendship declaration
keys: Introduce is_empty() for prefixes
position_in_partition: Make comparable with composites
types: Enhance lexicographical comparators
compound_compat: Accept marker value in serialize_value()
compound_compat: Add trichotomic comparator
...
quick introduction to level starvation:
high levels may be left uncompacted (thus starved) for a long time if user
makes something that make they contain little data, such as cleanup or change
of max sstable size (default 160M). Leveled strategy handles this problem as
follow: consider we're compacting L1 to L2. If L3 is starved, we look for one
of its sstable that is fully contained in token range of candidates L1->L2,
so that we won't end up with an overlapping in L2.
now the problem:
the functionality isn't working properly now because range of candidates is
being incorrectly calculated due to an accident when converting the code to
C++. It won't cause an overlap because it's actually being more restrictive
about which sstable from starved level can be used.
A test case was added to confirm the problem.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170328223753.15398-1-raphaelsc@scylladb.com>