With strict mode, it could happen that a sstable alone in level 0 is
selected for offstrategy compaction, which means that we could run
into an infinite reshape process.
This is fixed by respecting the offstrategy threshold. Unit test is
added.
Fixes#8573.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210506181324.49636-1-raphaelsc@scylladb.com>
The timestamp_type is an int64_t. So, it has to be explicitly
initialized before using it.
This missing inicialization prevented the major compactation
from happening when a time window finishes, as described in #8569.
Fixes#8569
Signed-off-by: Lauro Ramos Venancio <lauro.venancio@incognia.com>
Closes#8590
Using integer division lose accuracy by rounding down the result.
Each time we calculate:
```
auto total_size = bucket.size() * old_average_size;
auto new_average_size = (total_size + size) / (bucket.size() + 1);
```
We accumulate the rounding error.
total_size might be too small since old_average_size was previously
rounded down, and then new_average_size is rounded down again.
Rather than trying to compensate for the rounding errors
by e.g. adding size / 2 to the dividend, simply keep the average
as a double precision number.
Note that we multiply old_average_size by options.bucket_{low,high},
that are double precision too so the size comparisons
are already using FP instructions implicitly.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since now it became a reference used to update the bucket's average size
after a new sstable is inserted into it.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Since the sstables are sorted in increasing size order
there is no need to consider all buckets to find a matching one.
Instead, just consider the most recently inserted bucket.
Once we see a sstable size outside the allowed range for this bucket,
create a new bucket and consider this one for the next sstable.
Note, `old_average_size` should be renamed since this change
turns it into a reference and it's assigned with the new average_size.
This patch keeps the old name to reduce the churn. The following
patch will do only the rename.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The checksum sink carries another sink on board and forwards
the put buffers lower, so there's no point in making these
two have different buffer sizes. This is what really happens
now, but this change makes this more explicit and makes the
checksumming code conform to the new output stream API.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This change just moves the place from which the output_stream
knows the compression::uncompressed_chunk_length() value.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The error message currently complains about "invalid version" and later
says the reason is that the path is not recognized. This is confusing so
change the error message to start with "invalid path" instead. It is the
path that is invalid not the version after all.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20210429092749.52659-1-bdenes@scylladb.com>
The range passed to create_sharding_metadata is supposed to be owned or
at least partially owned by the shard. Log keys, range and split
ranges for debugging if the range does not belong to the shard.
This is helpful for debugging "Failed to generate sharding
metadata for foo.db" issues reported.
Refs #7056Closes#8557
Close both the _index_reader and _context, if they are engaged.
Warn and ignore any erros from close as it may be called either
from the destructor or from f_m_r close.
Call close() for closing in the background if needed when destroyed
and warn about.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We'd like that to simplify the soon-to-be-introduced
sstable_mutation_reader::close error handling path.
close_index_list can be marked noexcept since parallel_for_each is,
with that index_reader::close can be marked noexcept too.
Note that since reader close can not fail
both lower and upper bounds are closed (since
closing lower_bound cannot fail).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
sstable cells are parsed into temporary_buffers, which causes large contiguous allocations for some cells.
This is fixed by storing fragments of the cell value in a fragmented_temporary_buffer instead.
To achieve this, this patch also adds new methods to the fragmented_temporary_buffer(size(), ostream& operator<<()) and adds methods to the underlying parser(primitive_consumer) for parsing byte strings into fragmented buffers.
Fixes#7457Fixes#6376Closes#8182
* github.com:scylladb/scylla:
primitive_consumer: keep fragments of parsed buffer in a small_vector
sstables: add parsing of cell values into fragmented buffers
sstables: add non-contiguous parsing of byte strings to the primitive_consumer
utils: add ostream operator<<() for fragmented_temporary_buffer::view
compound_type: extend serialize_value for all FragmentedView types
"
Memory usage is considerably reduced by making reshape switch to partitioned set,
given that input sstables are disjoint. This will benefit reshape for all
strategies, not only TWCS.
Write amplification is reduced a lot by compacting all input sstables at once,
which is possible given that unbounded memory usage is fixed too.
With both these issues fixed, TWCS reshape will be much more efficient.
tests: mode(dev).
"
* 'twcs_reshape_fixes' of github.com:raphaelsc/scylla:
tests: sstables: Check that TWCS is able to reshape disjoint sstables efficiently
TWCS: Reshape all sstables in a time window at once if they're disjoint
sstables: Extract code to count amount of overlapping into a function
LCS: reshape: Fix overlapping check when determining if a sstable set is disjoint
compaction: Make reshape compaction always use partitioned_sstable_set
compaction: Allow a compaction type to override the sstable_set for input sstables
With repair-based operations, each window will have 256 disjoint
sstables due to data segregation which produces N sstables for each
vnode range, where N = # of existing windows. So each window ends up
with one sstable per vnode range = 256.
Given that reshape now unconditionally uses partitioned set's incremental
selector, all the 256 sstables can be compacted at once as compaction
essentially becomes a copy operation, where only one sstable will be
opened at a time, making its memory usage very efficient.
By compacting all sstables at once, write amplification is a lot
reduced because each byte is now only rewritten once.
Previously, with the initial set of 256 sstables, write amp could be
up to 8, which makes reshape for TWCS very slow.
Refs #8449.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This function will be reused by TWCS reshape when checking if all
sstables in a window are disjoint and can be all compacted together.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Wrong comparison operator is used when checking for overlapping. It
would miss overlapping when last key of a sstable is equal to the first
key of another sstable that comes next in the set, which is sorted by
first key.
Fixes#8531.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Reduce rebuilds and build time by removing unnecessary includes. Along the way,
improve header sanity.
Ref #1.
Test: dev-headers, unit(dev).
Closes#8524
* github.com:scylladb/scylla:
treewide: remove inclusions of storage_proxy.hh from headers
storage_proxy: unnest coordinator_query_result
treewide: make headers self-sufficient
utils: intrusive_btree: add missing #pragma once
Make sure that sstable::unlink will never fail.
It will terminate in the unlikely case toc_filename
throws (e,g, on bad_alloc), otherwise it ignores any other error
and juts warns about it.
Make unlink a coroutine to simplify the implementation
without introducing additional allocations.
Note that remove_by_toc_name and maybe_delete_large_data_entries
are executed asynchronously and concurrently.
Waiting for them to finish is serialized by co_await,
making sure that both are being waited on so not to leave
abandoned futures behind.
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20210420135020.102733-1-bhalevy@scylladb.com>
Reshape compaction potentially works with disjoint sstables, so it will
benefit a lot from using partitioned_sstable_set, which is able to
incrementally open the disjoint sstables. Without it, all sstables are
opened at once, which means unbounded memory usage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
By default, compaction will pick a implementation of sstable_set as
defined by the underlying compaction strategy.
However, reshape compaction potentially works with disjoint sstables
and will benefit a lot from always using partitioned set.
For example, when reshaping a TWCS table, it's better to use the
partitioned set rather than the time window set, as the former will
be much more memory efficient by incrementally selecting sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
When a particular partition exists in at least one sstable, the cache
expects any single-partition query to this partition to return a `partition_start`
fragment, even if the result is empty.
In `time_series_sstable_set::create_single_key_sstable_reader` it could
happen that all sstables containing data for the given query get
filtered out and only sstables without the relevant partition are left,
resulting in a reader which immediately returns end-of-stream (while it
should return a `partition_start` and if not in forwarding mode, a
`partition_end`). This commit fixes that.
We do it by extending the reader queue (used by the clustering reader
merger) with a `dummy_reader` which will be returned by the queue as
the very first reader. This reader only emits a `partition_start` and,
if not in forwarding mode, a `partition_end` fragment.
Fixes#8447.
Closes#8448
Procedure is rewritten using std::partition, making it easier to
maintain and it also fixes a theoretical quadratic behavior because
list is entirely copied when extending it, which isn't harmful
because maintenance set will be rarely populated and there are only
2 sets at most.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210409171412.57729-1-raphaelsc@scylladb.com>
The filter passed to `min_position_reader_queue`, which was used by
`clustering_order_reader_merger`, would incorrectly include sstables as
soon as they passed through the PK (bloom) filter, and would include
sstables which didn't pass the PK filter (if they passed the CK
filter). Fortunately this wouldn't cause incorrect data to be returned,
but it would cause sstables to be opened unnecessarily (these sstables
would immediately return eof), resulting in a performance drop. This commit
fixes the filter and adds a regression test which uses statistics to
check how many times the CK filter was invoked.
Fixes#8432.
Closes#8433
compound set isn't overriding create_single_key_sstable_reader(), so
default implementation is always called. Although default impl will
provide correct behavior, specialized ones which provides better perf,
which currently is only available for TWCS, were being ignored.
compound set impl of single key reader will basically combine single key
readers of all sets managed by it.
Fixes#8415.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210406205009.75020-1-raphaelsc@scylladb.com>
When we want to parse a linearized buffer of bytes, we're copying them
into the first and only element of the _read_bytes vector. Thus
_read_bytes often contains only one element, which makes a small_vector
a better alternative.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
The entire sstable cell value is currently stored in a single
temporary_buffer. Cells may be very large, so to avoid large
contiguous allocations, the buffer is changed to
a fragmented_temporary_buffer.
Fixes#7457Fixes#6376
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Currently, the primitive_consumer parses all values in contiguous buffers.
A string of bytes may be very long, so parsing it in a single buffer
can cause a big allocation. This patch allows parsing into
fragmented_temporary_buffers instead of temporary_buffers.
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
In order to maintain backward compatibility wrt. cluster features,
two boolean variables were kept in sstable writers:
- correctly_serialize_non_compound_range_tombstones
- correctly_serialize_static_compact_in_mc
Since these features are assumed to always be present now,
the above variables are no longer needed and can be purged.
LCS reshape is currently inefficient for repair-based operation, because
the disjoint run of 256 sstables is reshaped into bigger L0 files, which
will be then integrated into the main sstable set.
On reshape completion, LCS has to compact those big L0 files onto higher
levels, until last level is reached, producing bad write amplification.
A much better approach is to instead compact that disjoint run into the
best possible level L, which can be figured out with:
log (base fan_out) of (total_size / max_sstable_size)
This compaction will be essentially a copy operation. It's important
to do it rather than only mutating the level of sstables because we have
to reshape the input run according to LCS parameters like sstable size.
For repair-based bootstrap/replace, the input disjoint run is now efficiently
reshaped into an ideal level L, so there's no compaction backlog once
reshape completes.
This behavior will manifest in the log as this:
LeveledManifest - Reshaping 256 disjoint sstables in level 0 into level 2
For repair-based decommission/removenode though, which reshape wasn't
wired on yet, level L may temporarily hold 2 disjoint runs, which overlap
one another, but LCS itself will incrementally merge them through either
promotion of L-1 into L, or by detecting overlapping in level L and
merging the overlapping sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210329171826.42873-1-raphaelsc@scylladb.com>
"
When a permit is destroyed we check if it still holds on to any
resources in the destructor. Any resources the permit still holds on are
leaked resources, as users should have released these. Currently we just
invoke `on_internal_error_noexcept()` to handle this, which -- depending
on the configuration -- will result in an error message or an assert. In
the former case, the resources will be leaked for good. This mini-series
fixes this, by signaling back these resources to the semaphore. This
helps avoid an eventual complete dry-up of all semaphore resources and a
subsequent complete shutdown of reads.
Tests: unit(release, debug)
"
* 'reader-permit-signal-leaked-resources/v1' of https://github.com/denesb/scylla:
reader_permit: signal leaked resources
test: test_reader_lifecycle_policy: keep semaphores alive until all ops cease
sstables: generate_summary(): extend the lifecycle of the reader concurrency semaphore
Constructors of classes inherited from file_impl copy alignment
values by hands, but miss the overwrite one, thus on a new file
it remains default-initialized.
To fix this and not to forget to properly initalize future fields
from file_impl, use the impl's copy constructor.
tests: unit(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210325104830.31923-1-xemul@scylladb.com>
LCS reshape is basically 'major compacting' level 0 until it contains less than
N sstables.
That produces terrible write amplification, because any given byte will be
compacted (initial # of sstables / max_threshold (32)) times. So if L0 initially
contained 256 ssts, there would be a WA of about 8.
This terrible write amplification can be reduced by performing STCS instead on
L0, which will leave L0 in a good shape without hurting WA as it happens
now.
Fixes#8345.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210322150655.27011-1-raphaelsc@scylladb.com>
Row marker has a cell name which sorts after the row tombstone's start
bound. The old code was writing the marker first, then the row
tombstone, which is incorrect.
This was harmeless to our sstable reader, which recognized both as
belonging to the current clustering row fragment, and collects both
fine.
However, if both atoms trigger creation of promoted index blocks, the
writer will create a promoted index with entries wich violate the cell
name ordering. It's very unlikely to run into in practice, since to
trigger promoted index entries for both atoms, the clustering key
would be so large so that the size of the marker cell exceeds the
desired promoted index block size, which is 64KB by default (but
user-controlled via column_index_size_in_kb option). 64KB is also the
limit on clustering key size accepted by the system.
This was caught by one of our unit tests:
sstable_conforms_to_mutation_source_test
...which runs a battery of mutation reader tests with various
desired promoted index block sizes, including the target size of 1
byte, which triggers an entry for every atom.
The test started to fail for some random seeds after commit ecb6abe
inside the
test_streamed_mutation_forwarding_is_consistent_with_slicing test
case, reporting a mutation mismatch in the following line:
assert_that(*sliced_m).is_equal_to(*fwd_m, slice_with_ranges.row_ranges(*m.schema(), m.key()));
It compares mutations read from the same sstable using different
methods, slicing using clustering key restricitons, and fast
forwarding. The reported mismatch was that fwd_m contained the row
marker, but sliced_m did not. The sstable does contain the marker, so
both reads should return it.
After reverting the commit which introduced dynamic adjustments, the
test passes, but both mutations are missing the marker, both are
wrong!
They are wrong because the promoted index contians entries whose
starting positions violate the ordering, so binary search gets confused
and selects the row tombstone's position, which is emitted after the
marker, thus skipping over the row marker.
The explanation for why the test started to fail after dynamic
adjustements is the following. The promoted index cursor works by
incrementally parsing buffers fed by the file input stream. It first
parses the whole block and then does a binary search within the parsed
array. The entries which cursor touches during binary search depend on
the size of the block read from the file. The commit which enabled
dynamic adjustements causes the block size to be different for
subsequent reads, which allows one of the reads to walk over the
corrupted entries and read the correct data by selecting the entry
corresponding to the row marker.
Fixes#8324
Message-Id: <20210322235812.1042137-1-tgrabiec@scylladb.com>
"
Scylla suffers with aggressive compaction after repair-based operation has initiated. That translates into bad latency and slowness for the operation itself.
This aggressiveness comes from the fact that:
1) new sstables are immediately added to the compaction backlog, so reducing bandwidth available for the operation.
2) new sstables are in bad shape when integrated into the main sstable set, not conforming to the strategy invariant.
To solve this problem, new sstables will be incrementally reshaped, off the compaction strategy, until finally integrated into the main set.
The solution takes advantage there's only one sstable per vnode range, meaning sstables generated by repair-based operations are disjoint.
NOTE: off-strategy for repair-based decommission and removenode will follow this series and require little work as the infrastructure is introduced in this series.
Refs #5226.
"
* 'offstrategy_v7' of github.com:raphaelsc/scylla:
tests: Add unit test for off-strategy sstable compaction
table: Wire up off-strategy compaction on repair-based bootstrap and replace
table: extend add_sstable_and_update_cache() for off-strategy
sstables/compaction_manager: Add function to submit off-strategy work
table: Introduce off-strategy compaction on maintenance sstable set
table: change build_new_sstable_list() to accept other sstable sets
table: change non_staging_sstables() to filter out off-strategy sstables
table: Introduce maintenance sstable set
table: Wire compound sstable set
table: prepare make_reader_excluding_sstables() to work with compound sstable set
table: prepare discard_sstables() to work with compound sstable set
table: extract add_sstable() common code into a function
sstable_set: Introduce compound sstable set
reshape: STCS: preserve token contiguity when reshaping disjoint sstables
enable_if is hard to understand, especially its error messages. Convert
enable_if in sstable code to concepts.
A new concept is introduced, self_describing, for the case of a type
that follows the obj.describe_type() protocol. Otherwise this is quite
straightforward.
Closes#8315
* github.com:scylladb/scylla:
sstables: vector write: convert to concepts
sstables: check_truncated_and_assign: convert to concept
sstables: convert write() to concepts
sstables: convert write_vint() to concepts
sstables: vector parse(): convert to concept
sstables: convert parse() for a self-describing type to concept
sstables: read_vint(): convert enable_if to concepts
sstables: add concept for self-describing type
We have an integral and a non-integral overload, each constrained
with enable_if. We use std::integral to constrain the integral
overload and leave the other unconstrained, as C++ will choose the
more constrained version when applicable.