Test different versions of the format, and different promoted index
block sizes. The size of 1 is especially important, it will put each
fragment in a separate block, exposing various issues with promoted
index handling.
After 4742008b70, _read_partial_row is
never set, and we will fail here in case the consumer will exhoust the
range. That would be the case if the end bound of the slice aligns
with the end of the index page.
Fix by assuming that if we're out of range in the middle of partition,
we sliced.
Message-Id: <1493121249-18847-1-git-send-email-tgrabiec@scylladb.com>
"This series introduces the row_tombstone class, which represents a
tombstone applied to a clustering row. It distinguishes itself from a
normal tombstone by the fact that it contains a regular tombstone and
a shadowable one, which can be erased by a row marker.
The intent of the series is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be, leading to incorrect results."
* 'materialized-views/shadowable/v5' of https://github.com/duarten/scylla:
sstables: Read and write shadowable tombstones
mutation_partion: Use row_tombstone
mutation_partion: Introduce row_tombstone
mutation_partition: Introduce shadowable tombstones
idl-compiler: Support optional fields in views
tombstone: Extract out relational operators
row_marker: Mark constructors explicit
This patch replaces the current row tombstone representation by a
row_tombstone.
The intent of the patch is thus to reify the idea of shadowable
tombstones, that up until now we considered all materialized view row
tombstones to be.
We need to distinguish shadowable from non-shadowable row tombstones
to support scenarios such as, when inserting to a table with a
materialzied view:
1. insert into base (p, v1, v2) values (3, 1, 3) using timestamp 1
2. delete from base using timestamp 2 where p = 3
3. insert into base (p, v1) values (3, 1) using timestamp 3
These should yield a view row where v2 is definitely null, but with
the current implementation, v2 will pop back with its value v2=3@TS=1,
even though its dead in the base row. This is because the row
tombstone inserted at 2) is a shadowable one.
This patch only addresses the memory representation of such
row_tombstones.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch introduces the row_tombstone class, which represents a
tombstone made up of a regular tombstone and a shadowable one.
The rules for row_tombstones are as follows:
- The shadowable tombstone is always >= than the regular one;
- The regular tombstone works as expected;
- The shadowable tombstone doesn't erase or compact away the regular
row tombstone, nor dead cells;
- The shadowable tombstone can erase live cells, but only provided
they can be recovered (e.g., by including all cells in a MV update,
both updated cells and pre-existing ones);
- The shadowable tombstone can be erased or compacted away by a newer
row marker.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
A shadowable tombstone is a tombstone that can be replaced by a
smaller one if provided a row_marker with a bigger timestamp than the
shadowable tombstone.
In the context of a row, it is only valid as long as no newer insert
is done (thus setting a live row marker; note that if the row
timestamp set is lower than the tombstone's, then the tombstone
remains in effect as usual). If a row has a shadowable tombstone with
timestamp Ti and that row is updated with a timestamp Tj, such that
Tj > Ti (and that update sets the row marker), then the shadowable
tombstone is shadowed by that update. A concrete consequence is that
if the update has cells with timestamp lower than Ti, then those cells
are preserved (since the deletion is removed), and this is contrary to
a regular, non-shadowable row tombstone where the tombstone is
preserved and such cells are removed.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
This patch extracts out the relational operators in struct tombstone
to a class capable of generating them from a tri-compare function.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Some gcc versions incorrectly complain:
tests/log_histogram_test.cc:87:22: error: ‘opts1’ is not a valid template argument for type ‘const log_histogram_options&’ because object ‘opts1’ has not external linkage
size_t hist_key<node<opts1>>(const node<opts1>& n) { return n.v; }
Apparently this is a bug in gcc:
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=52036Fixes#2307.
Message-Id: <1493108791-11247-1-git-send-email-tgrabiec@scylladb.com>
"This series fixes some more errors found by clang, with the aim of enabling
clang/zapcc as a supported compiler. A single issue remains, but it's
probably in std::experimental::optional::swap(); not in our code."
* tag 'clang/2/v1' of https://github.com/avikivity/scylla:
sstable_test: avoid passing negative non-type template arguments to unsigned parameters
UUID: add more comparison operators
sstable_datafile_test: avoid string_view user-defined literal conversion operator
mutation_source_test: avoid template function without template keyword
cql_query_test: define static variable
cql_query_test: add braces for single-item collection initializers
storage_service: don't use typeid(temporary)
logalloc: remove unused max_occupancy_for_compaction
storage_proxy: drop overzealous use of __int128_t in recently-modified-no-read-repair logic
storage_proxy: drop unused member access from return value
storage_proxy: fix reference bound to temporary in data_read_resolver::less_compare
read_repair_decision: fix operator<<(std::ostream&, ...)
Fixes the following error in "scylla segment-descs" and a similar one in "scylla lsa-segment":
Traceback (most recent call last):
File "scylla-gdb.py", line 530, in invoke
gdb.write('0x%x: lsa free=%d region=0x%x zone=0x%x\n' % (addr, desc['_free_space'], desc['_region'], desc['_zone']))
TypeError: %x format: an integer is required, not gdb.Value
Message-Id: <1493029465-6482-1-git-send-email-tgrabiec@scylladb.com>
Every lsa-allocated object is prefixed by a header that contains information
needed to free or migrate it. This includes its size (for freeing) and
an 8-byte migrator (for migrating). Together with some flags, the overhead
is 14 bytes (16 bytes if the default alignment is used).
This patch reduces the header size to 1 byte (8 bytes if the default alignment
is used). It uses the following techniques:
- ULEB128-like encoding (actually more like ULEB64) so a live object's header
can typically be stored using 1 byte
- indirection, so that migrators can be encoded in a small index pointing
to a migrator table, rather than using an 8-byte pointer; this exploits
the fact that only a small number of types are stored in LSA
- moving the responsibility for determining an object's size to its
migrator, rather than storing it in the header; this exploits the fact
that the migrator stores type information, and object size is in fact
information about the type
The patch improves the results of memory_footprint_test as following:
Before:
- in cache: 976
- in memtable: 947
After:
mutation footprint:
- in cache: 880
- in memtable: 858
A reduction of about 10%. Further reductions are possible by reducing the
alignment of lsa objects.
logalloc_test was adjusted to free more objects, since with the lower
footprint, rounding errors (to full segments) are different and caused
false errors to be detected.
Missing: adjustments to scylla-gdb.py; will be done after we agree on the
new descriptor's format.
We both move names_ to its destination, and call names_.size() in the same
expression; this has undefined evaluation order, and fails with clang.
With this patch as well as the clang build fixes, Scylla starts and is
able to serve requests (light cassandra-stress load).
Message-Id: <20170423121727.1948-1-avi@scylladb.com>
This patch fixes a failure of virtual_reader_test, where both the test
itself and the cql_test_env initialize the messaging_service to listen
on the same address and port, triggering an assert in
posix_ap_server_socket_impl::accept().
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20170423104240.21275-1-duarte@scylladb.com>
Clang warns that the expression will be evaluated (doh). While the warning
seems dubious, keep it and change the code to call the function outside
typeid(), in case it does help someone one day.
Clang's std::abs() doesn't support __int128_t, so use __int64_t instead. With
this change, it's possible that a read repair 252,700 years after a write
will be interpreted as a recent write and the read repair will incorrectly
be skipped; hopefully by that time __int128_t will be standardized.
Argument-dependent lookup requires that the operator be declared in the
same namespace as the class; move it there.
While at it, de-static it, it only causes bloat.
"Currently, a shared sstable is rewritten at all shards it belongs to, and only
after that, it's deleted.
This new algorithm adds the ability to reshard a set of sstables together at a
single shard and produce unshared sstable for all shards involved.
That's important for the leveled compaction strategy issue, in which the number
of sstables growing considerably after resharding. What happened is that every
sstable was being split into N ones, so we could end up with tons of small
sstables. Now, we will reshard together a set of adjacent sstables."
* 'sstable_resharding_revamp_v9' of github.com:raphaelsc/scylla:
tests: add test for new sstable resharding
database: kill column_family::start_rewrite
database: wire up new resharding algorithm
database: implement new sstable resharding algorithm
database: introduce function to replace new sstables by their ancestors
prevent regular compaction from choosing shared sstables
compaction_strategy: implement resharding strategy for compaction strategies
sstables: store more info in foreign_sstable_open_info
sstables: make it possible to get open info from loaded sstable
database: export column family dir
database: inform if column family has shared tables
sstables: add method to export ancestors
lcs: implement get_level_count
compaction_manager: introduce method to check if manager stopped
lcs: restore invariant instead of sending overlapping sst to L0
sstables: extend compaction for new resharding
sstables: allow shard A to correctly create sstable for shard B
compaction: rework compacting_sstable_writer to work with multiple writers
compaction: prepare compacting_sstable_writer to work with writers
sstables: rework compaction to make it easy to extend
NOTE: it's not wired yet.
Currently, a shared sstable is rewritten at all shards it belongs
to and only after that, it's deleted. With this new algorithm, a
shared sstable will be read only once and N unshared sstables
will be created, each of them with 1/N of the data. After it's
done, each owner shard will receive its new unshared sstable
replacing its ancestors.
Another benefit is that we'll no longer have resharding resulting
in number of sstables growing considerably after resharding.
A full-sized leveled sstable is usually 160MB, so after resharding,
we could have N files of 160MB/N. Now, leveled strategy will help
resharding. N adjacent sstables of same level will be resharded
together, so we'll end up with N files of N*160MB/N.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When resharding, we're working with sstables from all shards. So let's say
we're done with resharding of sstable A that belongs to shard 0 and 1 and
sstable B that belongs to shard 1 and 2. SStables were generated for
shards 0, 1, and 2. So shards 0, 1, and 2 need to load the new sstables
and remove the ancestors. Shard 1 for example will remove sstables A and
B (ancestors) and add the new one. Then it comes this new function.
We'll forward new sstables to their target shards using foreign sstable
open info.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
For new resharding, it's important to exclude resharding sstables
from the list of candidates for regular compaction. That's doesn't
affect current resharding because it marks the sstables as
compacting. That won't work with new resharding which will work
with sstables from multiple shards.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Strategies other than leveled will reshard one shared sstable at
a time, and the target shard, shard at which job will run, for each
job will be chosen in a round-robin fashion.
For leveled strategy, we will reshard together smp::count adjacent
sstables that belong to same level.
The reason for that is because resharding one sstable at a time
may result in creation of file for each shard, meaning after
resharding we could end up with NO_SSTABLES*NO_SHARDS.
These resharding strategies will be used for our new resharding
algorithm.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
We need that info for opening a sstable at different shard, unlike
sstable loader which has everything in entry_descriptor, obtained
from components in sstable filename.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
It will be useful for resharding which will need to move a
sstable across shards, and to do that without reloading the
sstable at target shard, we need to be able to get the open
info and move it to the target shard instead.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's gonna be useful to quickly determine if it's worth resharding
a column family.
Reviewed-by: Nadav Har'El <nyh@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
A large token span sstable may find its way into high level due to resharding,
which means the strategy invariant is broken. The invariant is restored by
compacting first set of overlapping sstables, meaning that the restoration
is done incrementally for multiple overlapping sets.
Invariant is restored by regular compaction after resharding puts new unshared
sstables into their original level, where level > 0.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Extends compaction for new resharding algorithm. Not wired yet.
New resharding will compact shared sstable(s) and create one
sstable for each owner. It's up to the caller to open these
new unshared sstables at their respective column families.
This new approach will save a lot of bandwidth because we'll
no longer read the entire shared sstable #smp::count times.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>