After change in boot, read_filter is called by distributed loader,
so its update to _filter_file_size is lost. The load variant
which receives foreign components that must do it. We were also
not updating it for newly created sstables.
Fixes#2449.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170606151129.5477-1-raphaelsc@scylladb.com>
(cherry picked from commit 0ca1e5cca3)
"fixes a problem in which memory requirement for loading in-memory
components of sstables is very high due to unlimited parallelism."
* 'mem_requirement_sstable_load_v2_2' of github.com:raphaelsc/scylla:
database: fix indentation of distributed_loader::open_sstable
database: reduce memory requirement to load sstables
sstables: loads components for a sstable in parallel
sstables: enable read ahead for read of in-memory components
sstables: make random_access_reader work with read ahead
(cherry picked from commit ef428d008c)
From http://github.com/avikivity/scylla exponential-sharder/v3.
The sharder, which takes a range of tokens and splits it among shards, is
slow with large shard count and the default
murmur3_partitioner_ignore_msb_bits.
This patchset fixes excessive iteration in sstable sharding metadata writer and
nonsignular range scans.
Without this patchset, sealing a memtable takes > 60 ms on a 48-shard
system. With the patchset, it drops below the latency tracker threshold I
used (5 ms).
Fixes#2392.
(cherry picked from commit 84648f73ef)
Currently, fully expired sstable[1] is unconditionally chosen for compaction
by DTCS, but that may lead to a compaction loop under certain conditions.
Let's consider that an almost expired sstable is compacted, and it's not
deleted yet, and that the new sstable becomes expired before its ancestor is
deleted.
Because this new sstable is expired, it will be chosen by DTCS, but it will
not be purged because 'compacted undeleted' sstables are taken into account
by calculation of max purgeable timestamp and prevents expired data from
being purged. The problem is that this sequence of events can keep happening
forever as reported by issue #2260.
NOTE: This problem was easier to reproduce before improvement on compaction
of expired cells, because fully expired sstable was being converted into a
sstable full of tombstones, which is also considered fully expired.
Fixes#2260.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170428233554.13744-1-raphaelsc@scylladb.com>
(cherry picked from commit 687a4bb0c2)
Range tombstones are serialized to cell names in this place:
_sst.maybe_flush_pi_block(_out, start, {});
Note that the column set is empty. This is correct. A range tombstone
only has a clustering part. The cell name is deserialized by promoted
index reader using mp_row_consumer::column, like this:
mp_row_consumer::column col(schema, std::move(col_name),
api::max_timestamp); return std::move(col.clustering);
The problem is, column constructor assumes that there is always a
component corresponding to a cell name if the table is not dense, and
will pop it from the set of components (the clustering field):
, cell(!schema.is_dense() ? pop_back(clustering) : (*(schema.regular_begin())).name())
promoted index block which starts or ends with a range tombstone will
appear as having incorrect bounds. This may result in an incorrect
value for data file range start to be calculated.
Fixes#2327.
Suppose the promoted index looks like this:
block0: start=1 end=2
block1: start=4 end=5
start and end are cell names of the first and last cell in the block.
If there is a range tombstone covering [2,3], it will be only in
block0, because it is no longer in effect when block1 starts. However,
slicing the index for [3, +inf], which intersects with the tombstone,
will yield block1. That's because the slicing looks for a block with
an end which is greater than or equal to the start of the slice:
if (!found_range_start) {
if (!range_start || cmp(range_start->value(), end_ck) <= 0) {
range_start_pos = ie.position() + offset;
We should take into account that any given block may actually contain
information for anything up to the start of the next block, so instead
of using end_ck, effectively use next block's start_ck (exclusive).
Fixes#2326.
Problem is that column family field of task wasn't being set for resharding,
so column family wasn't being properly removed from compaction manager.
In addition to fixing this issue, we'll also interrupt ongoing compactions
when dropping a column family, exactly like we do with shutdown.
Fixes#2291.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170418125807.7712-1-raphaelsc@scylladb.com>
(cherry picked from commit e78db43b79)
streaming generates lots of small sstables with large token range,
which triggers O(N^2) in space in interval map.
level 0 sstables will now be stored in a structure that has O(N)
in space complexity and which will be included for every read.
Fixes#2287.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20170417185509.6633-1-raphaelsc@scylladb.com>
(cherry picked from commit 11b74050a1)
The skip() implementation for the compressed file input stream incorrectly
handled the case of skipping to the end of file: In that case we just need
to update the file pointer, but not skip anywhere in the compressed disk
file; In particular, we must NOT call locate() to find the relevant on-disk
compressed chunk, because there is none - locate() can only be called on
actual positions of bytes, not on the one-past-end-of-file position.
Fixes#2143
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20170308100057.23316-1-nyh@scylladb.com>
(cherry picked from commit 506e074ba4)
Failing to close a file properly before destroying file's object causes
crashes.
[tgrabiec: fixed typo]
Fixes#2122.
Message-Id: <20170221144858.GG11471@scylladb.com>
(cherry picked from commit 0977f4fdf8)
continuous_data_consumer::fast_forward_to() returns a future which was
later ignored by data_consume_context::fast_forward_to().
With the current implementation, the future in question is always ready
and that's why the problem didn't manifest itself in the form of crashes
or invalid results.
Message-Id: <20170120105746.7300-1-pdziepak@scylladb.com>
Add a boolean to short circuit the read path on empty range
hoping for some speedup.
tested in read write with cs using:
cl=QUORUM duration=1m -mode native cql3 -rate threads=700 -node localhost
Will do some additional benchmark.
Fixes#1056
Signed-off-by: Benoît Canet <benoit@scylladb.com>
Message-Id: <20170118194451.16836-1-benoit@scylladb.com>
close() operation is like a destructor, it cannot fail. It just
reports errors, but close itself succeeds. So we should proceed with
the closing even if it fails.
Message-Id: <1484245886-7269-1-git-send-email-tgrabiec@scylladb.com>
"Intended to reduce memory usage when resharding by sharing sstable
components among shards. File descriptors are also shared from now
on, meaning that a much smaller number of file descriptors will be
used during resharding.
Fixes #1951."
branch 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla
* 'excessive_memory_usage_v4' of github.com:raphaelsc/scylla:
db: avoid excessive memory usage during resharding
checked_file_impl: add support to dup
sstables: group sstable components that can be shared among shards
sstables: rename sstable member
After resharding, sstables may be owned by all shards, which
means that file descriptors and memory usage for metadata will
increase by a factor equal to number of shards. That can easily
lead to OOM.
SSTable components are immutable, so they can be stored in one
shard and shared with others that need it. We use the following
formula to decide which shard will open the sstable and share
it with the others: (generation % smp::count), which is the
inverse of how we calculate generation for new sstables.
So if no resharding is performed, everything is shard-local.
With this approach, resource usage due to loaded sstables will
be evenly distributed among shards.
For this approach to work, we now only populate keyspaces from
shard 0. It's now the sole responsible for iterating through
column family dirs. In addition, most of population functions
are now free and take distributed database object as parameter.
Fixes#1951.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Noone overrides file_writer::write() so there is no reason to inhibit
optimisations and cause compiler to emit indirect calls.
Message-Id: <20170104163618.26251-1-pdziepak@scylladb.com>
We intend to share immutable sstable components among shards to
reduce excessive memory usage when resharding shared sstables.
This change is about grouping those components into a structure,
and using foreign ptr to make sure that the structure will be
deleted by whichever shard created it.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Rename _components to _recognized_components because _components
will be used to name a field with shareable components.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
file output streams take the responsibility of closing the file, they
will close the file as part of closing the stream.
During sstable writing we create sstable object and keep file
references there as well. Sstable object also has responsibility for
closing the files, and does so from sstable::~sstable().
Double close was supposed to be avoided by a construct like this:
writer.close().get();
_file = {};
However if close() failed, which can happen when write-ahead failed,
_file would not be cleared, and both the writer and sstable would
close the file. This will result in a crash in
append_challenged_posix_file_impl::close(), which is not prepared to
be closed twice.
Another problem is that if exception happened before we reached that
construct, we still should close the writer. Currently we don't, so
there's no double close on the file, but that's a bug which needs to
be fixed and once that's fixed double close on _file will be even more
likely.
The fix employed here is to not keep files inside sstable object when
writing. As soon as the writer is constructed, it's the only owner of
the file.
Fixes#1764.
Message-Id: <1482428648-22553-1-git-send-email-tgrabiec@scylladb.com>
* 'tgrabiec/calculate-hash-once-compaction' of github.com:cloudius-systems/seastar-dev:
sstables: Calculate key hash only once during compaction
tests: sstables: Add more test cases to tombstone_purge_test
db: Expose column_family::add_sstable
tests: sstables: Ensure timestamps are increasing
tests: sstables: Simplify tombstone_purge_test
Refs #1943.
* 'tgrabiec/optimize-bloom-filter' of github.com:cloudius-systems/seastar-dev:
db: Compute key hash once in partition_presence_checker
bloom_filter: Allow checking presence using pre-hashed key
db: Use incremental selector in partition_presence_checker
"Function to calculate maximum purgeable timestamp is made 10 times faster when
compacting sstables overlap with 10% of all sstables.
That's possible with an incremental selector that will incrementally select
sstables based on key being compacted.
Currently, we iterate through all non-compacting sstables and consult their
bloom filter to determine max purgeable timestamp, and that will be very
expensive for compactions that are frequently deciding whether or not to purge
tombstones."
* 'filter_overhead_fix_v4' of github.com:raphaelsc/scylla:
compaction: reduce bloom filter overhead with incremental selector
tests: add test for sstable set's incremental selector
sstable_set: introduce incremental selector
compatible_ring_position: add function to return token
The procedure to calculate max purgeable timestamp is optimized
by only visiting sstables that overlap with key being currently
compacted. That's done using incremental sstable selector.
Function to calculate maximum purgeable timestamp is made 10 times
faster when compacting sstables overlap with 10% of all sstables.
Fixes#1322.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Incrementally select sstables from sstable set using token
in ascending order.
For leveled strategy, it returns all sstables that belong
to current interval. For other strategies, it just return
all sstables from the set.
Useful for compaction which needs all sstables that overlap
with key being currently compacted to calculate maximum
purgeable timestamp.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The Cassandra derived sstable tools (and likely Cassandra itself) object to
a new sub-component in the Statistics component; create a new Scylla
component instead to host this data.
Allow declaring discriminated unions (with an enum type as the
discriminant and any sstable serializable type as a value) and sets
of these unions, with the disciminant as the key. Parsers and writers
are auto-generated.
The information (last compacted keys) is lost after node is restarted
or schema is updated, which causes strategy to be rebuilt.
We need it for strategy to guarantee uniform distribution of token
range across sstables, or we could end up with 1 sstable of level L
overlapping with lots of sstables of level L+1, and that results in
a compaction of undesired length.
That information can be generated from scratch by getting last key
of newest sstable in each level > 0.
Fixes#1906.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <35ebd15977d5a8418239febb160c796cdc0e98fa.1480533805.git.raphaelsc@scylladb.com>