"
Not emitting partition_end for a partition is incorrect. SStable
writer assumes that it is emitted. If it's not, the sstable will not
be written correctly. The partition index entry for the last partition
will be left partially written, which will result in errors during
reads. Also, statistics and sstable key ranges will not include the
last partition.
It's better to catch this problem at the time of writing, and not
generate bad sstables.
Another way of handling this would be to implicitly generate a
partition_end, but I don't think that we should do this. We cannot
trust the mutation stream when invariants are violated, we don't know
if this was really the last partition which was supposed to be
written. So it's safer to fail the write.
Enabled for both mc and la/ka.
Passing --abort-on-internal-error on the command line will switch to
aborting instead of throwing an exception.
The reason we don't abort by default is that it may bring the whole
cluster down and cause unavailability, while it may not be necessary
to do so. It's safer to fail just the affected operation,
e.g. repair. However, failing the operation with an exception leaves
little information for debugging the root cause. So the idea is that the
user would enable aborts on only one of the nodes in the cluster to
get a core dump and not bring the whole cluster down.
"
* 'catch-unclosed-partition-sstable-write' of https://github.com/tgrabiec/scylla:
sstables: writer: Validate that partition is closed when the input mutation stream ends
config, exceptions: Add helper for handling internal errors
utils: config_file: Introduce named_value::observe()
(cherry picked from commit 95c0804731)
(cherry picked from commit cf4c238b28)
allocate_segment() can fail even though we're not out of memory, when
it's invoked inside an allocating section with the cache region
locked. That section may later succeed after retried after memory
reclamation.
We should ignore bad_alloc thrown inside allocating section body and
fail only when the whole section fails.
Fixes#2924
Message-Id: <1550597493-22500-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit dafe22dd83)
compact_and_evict gets memory_to_release in bytes while
reclamation step is in segments.
Broken in f092decd90.
It doesn't make much difference with the current default step of 1
segment since we cannot reclaim less than that, so shouldn't cause
problems in practice.
Ref #4445
Message-Id: <1556013920-29676-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 21fbf59fa8)
When we start the LSA reclamation it can be that
segment_pool::_free_segments is 0 under some conditions and
segment_pool::_current_emergency_reserve_goal is set to 1. The
reclamation step is 1 segment, and compact_and_evict_locked() frees 1
segment back into the segment_pool. However,
segment_pool::reclaim_segments() doesn't free anything to the standard
allocator because the condition _free_segments >
_current_emergency_reserve_goal is false. As a result,
tracker::impl::reclaim() returns 0 as the amount of released memory,
tracker::reclaim() returns
memory::reclaiming_result::reclaimed_nothing and the seastar allocator
thinks it's a real OOM and throws std::bad_alloc.
The fix is to change compact_and_evict() to make sure that reserves
are met, by releasing more if they're not met at entry.
This change also allows us to drop the variant of allocate_segment()
which accepts the reclamation step as a means to refill reserves
faster. This is now not needed, because compact_and_evict() will look
at the reserve deficit to increase the amount of memory to reclaim.
Fixes#4445
Message-Id: <1555671713-16530-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit f092decd90)
"
This series fixes a problem in the commitlog cycle() function that
confused in-memory and on-disk size of chunks it wrote to disk. The
former was used to decide how much data needs to be actually written,
and the latter was used to compute the offset of the next chunk. If two
chunk writes happened concurrently one the one positioned earlier in
the file could corrupt the header of the next one.
Fixes#4231.
Tests: unit(dev), dtest(commitlog_test.py:TestCommitLog.test_commitlog_replay_on_startup,test_commitlog_replay_with_alter_table)
"
* tag 'fix-commitlog-cycle/v1' of https://github.com/pdziepak/scylla:
commitlog: write the correct buffer size
utils/fragmented_temporary_buffer_view: add remove suffix
(cherry picked from commit d95dec22d9)
read_exactly(), when given a stream that does not contain the amount of data
requested, will loop endlessly, allocating more and more memory as it does, until
it fails with an exception (at which point it will release the memory).
Fix by returning an empty result, like input_stream::read_exactly() (which it
replaces). Add a test case that fails without a fix.
Affected callers are the native transport, commitlog replay, and internal
deserialization.
Fixes#4233.
Branches: master, branch-3.0
Tests: unit(dev)
Message-Id: <20190216150825.14841-1-avi@scylladb.com>
(cherry picked from commit 03531c2443)
gen_crc_combine_table is now executed on every build, so it should not
fail on unsupported archs. The generated file will not contain data,
but this is fine since it should not be used.
Another problem is that u32 and u64 aliases were not visible in the #else
branch in crc_combine.cc
Message-Id: <1543864425-5650-1-git-send-email-tgrabiec@scylladb.com>
(cherry picked from commit 9a4c00beb7)
"
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.
This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.
Performance results using perf_checksum and second buffer of length 64 KiB:
zlib CRC32 combine: 38'851 ns
libdeflate CRC32: 4'797 ns
fast_crc32_combine(): 11 ns
So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.
Tested on i7-5960X CPU @ 3.00GHz
Performance was also evaluated using sstable writer benchmark:
perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
--value-size=10000 --rows 1000000 --datasets small-part
It yielded 9% improvement in median frag/s (129'055 vs 117'977).
Refs #3874
"
* tag 'fast-crc32-combine-v2' of github.com:tgrabiec/scylla:
tests: perf_checksum: Test fast_crc32_combine()
tests: Rename libdeflate_test to checksum_utils_test
tests: libdeflate: Add more tests for checksum_combine()
tests: libdeflate: Check both libdeflate and default checksummers
sstables: Use fast_crc_combine() in the default checksummer
utils/gz: Add fast implementation of crc32_combine()
utils/gz: Add pre-computed polynomials
utils/gz: Import Barett reduction implementation from libdeflate
utils: Extract clmul() from crc.hh
(cherry picked from commit b098b5b987)
"
Tested with perf_fast_forward from:
github.com/tgrabiec/scylla.git perf_fast_forward-for-sst3-opt-write-v1
Using the following command line:
build/release/tests/perf/perf_fast_forward_g --populate --sstable-format=mc \
--data-directory /tmp/perf-mc --rows=10000000 -c1 -m4G \
--datasets small-part
The average reported flush throughput was (stdev for the avergages is around 4k):
- for mc before the series: 367848 frag/s
- for lc before the series: 463458 frag/s (= mc.before +25%)
- for mc after the series: 429276 frag/s (= mc.before +16%)
- for lc after the series: 466495 frag/s (= mc.before +26%)
Refs #3874.
"
* tag 'sst3-opt-write-v2' of github.com:tgrabiec/scylla:
sstables: mc: Avoid serialization of promoted index when empty
sstables: mc: Avoid double serialization of rows
tests: sstable 3.x: Do not compare Statistics component
utils: Introduce memory_data_sink
schema: Optimize column count getters
sstables: checksummed_file_data_sink_impl: Bypass output_stream
(cherry picked from commit 4aa5d83590)
Currently, when advance_and_await() fails to allocate the new gate
object, it will throw bad_alloc and leave the phased_barrier object in
an invalid state. Calling advance_and_await() again on it will result
in undefined behavior (typically SIGSEGV) beacuse _gate will be
disengaged.
One place affected by this is table::seal_active_memtable(), which
calls _flush_barrier.advance_and_await(). If this throws, subsequent
flush attempts will SIGSEGV.
This patch rearranges the code so that advance_and_await() has strong
exception guarantees.
Message-Id: <1542645562-20932-1-git-send-email-tgrabiec@scylladb.com>
Fixes#3931.
(cherry picked from commit 57e25fa0f8)
The linker uses an opt-in system for non-executable stack: if all object files
opt into a non-executable stack, the binary will have a non-executable stack,
which is very desirable for security. The compiler cooperates by opting into
a non-executable stack whenever possible (always for our code).
However, we also have an assembly file (for fast power crc32 computations).
Since it doesn't opt into a non-executable stack, we get a binary with
executable stack, which Gentoo's build system rightly complains about.
Fix by adding the correct incantation to the file.
Fixes#3799.
Reported-by: Alexys Jacob <ultrabug@gmail.com>
Message-Id: <20181002151251.26383-1-avi@scylladb.com>
(cherry picked from commit aaab8a3f46)
Reusable buffers are meant to be used when protocol or third-party
library limiations force us to allocate large contiguous buffers. There
isn't much that can be done about this so there is little point in
warning about that.
Fixes#3788.
Message-Id: <20180928085141.6469-1-pdziepak@scylladb.com>
This ensures that row::external_memory_usage() is invariant to
insertion order of cells.
It should be so, so that accounting of a clustering_row, merged from
multiple MVCC versions by the partition_snapshot_flat_reader on behalf
of a memtable flush, doesn't give a greater result than what is used
by the memtable region. Overaccounting leads to assertion failure in
~flush_memory_accounter.
Fixes#3625 (hopefully).
Message-Id: <1535982513-19922-1-git-send-email-tgrabiec@scylladb.com>
Reloading may hold value in the underlying loading_shared_values while
the corresponding cache values have already been deleted.
This may create weird situations like this:
<populate cache with 10 entries>
cache.remove(key1);
for (auto& e : cache) {
std::out << e << std::endl;
}
<all 10 entries are printed, including the one for "key1">
In order to avoid such situations we are going to make the loading_cache::iterator
to be a transform_iterator of lru_list::iterator instead of loading_shared_values::iterator
because lru_list contains entries only for cached items.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
reloading flow may hold the items in the underlying loading_shared_values
after they have been removed (e.g. via remove(key) API) thereby loading_shared_values.size()
doesn't represent the correct value for the loading_cache. lru_list.size() on the other hand - does.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
"This series introduces a few improvements related to a reload flow.
From now on the callback may assume that the "key" parameter value
is kept alive till the end of its execution in the reloading flow.
It may also safely evict as many items from the cache as needed."
Fixes#3606
* 'loading_cache_improve_reload-v1' of https://github.com/vladzcloudius/scylla:
utils::loading_cache: hold a shared_value_ptr to the value when we reload
utils::loading_cache::on_timer(): remove not needed capture of "this"
utils::loading_cache::on_timer(): use chunked_vector for storing elements we want to reload
The flusher picks the memtable list which contains the largest region
according to region_impl::evictable_occupancy().total_space(), which
follows region::occupancy().total_space(). But only the latest
memtable in the list can start flushing. It can happen that the
memtable corresponding to the largest region was already flushed to an
sstable (flush permit released), but not yet fsynced or moved to
cache, so it's still in the memtable list.
The latest memtable in the winning list may be small, or empty, in
which case the soft pressure flusher will not be able to make much
progress. There could be other memtable lists with non-empty
(flushable) latest memtables. This can lead to writes unnecessarily
blocking on dirty.
I observed this for the system memtable group, where it's easy for the
memtables to overshoot small soft pressure limits. The flusher kept
trying to flush empty memtables, while the previous non-empty memtable
was still in the group.
The CPU scheduler makes this worse, because it runs memtable_to_cache
in a separate scheduling group, so it further defers in time the
removal of the flushed memtable from the memtable list.
This patch fixes the problem by making regions corresponding to
memtables which started flushing report evictable_occupancy() as 0, so
that they're picked by the flusher last.
Fixes#3716.
Message-Id: <1535040132-11153-2-git-send-email-tgrabiec@scylladb.com>
Let the user specify which scheduling group should run the
releaser, since it is running functions on the user's behalf.
Perhaps a cleaner interface is to require the user to call
a long-running function for the releaser, and so we'd just
inherit its scheduling group, but that's a much bigger change.
"
This series changes the native CQL3 protocl layer so that it works with
fragmented buffers instead of a single temporary_buffer per request.
The main part is fragmented_temporary_buffer which represents a
fragmented buffer consisting of multiple temporary_buffers. It provides
helpers for reading fragmented buffer from an input_stream, interpreting
the data in the fragmented buffer as well as view that satisfy
FragmentRange concept.
There are still situations where a fragmented buffer is linearised. That
includes decompressing client requests (this uses reusable buffers in a
similar way to the code that sends compressed responses), CQL statement
restrictions and values that are hard-coded in prepared statements
(hopefully, the values in those cases will be small), value validation
in some cases (blobs are not validated, irrelevant for many fixed-size
small types, but may be a problem for large text cells) as well as
operations on collections.
Tests: unit(release), dtests(cql_prepared_test.py, cql_tests.py, cql_additional_tests.py)
"
* tag 'fragmented-cql3-receive/v1' of https://github.com/pdziepak/scylla: (23 commits)
types: bytes_view: override fragmented validate()
cql3: value_view: switch to fragmented_temporary_buffer::view
types: add validate that accepts fragmented_temporary_buffer::view
cql3 query_options: add linearize()
cql3: query_options: use bytes_ostream for temporaries
cql3: operation: make make_cell accept fragmented_temporary_buffer::view
atomic_cell: accept fragmented_temporary_buffer::view values
cql3: avoid ambiguity in a call to update_parameters::make_cell()
transport: switch to fragmented_temporary_buffer
transport: extract compression buffers from response class
tests/reusable_buffer: test fragmented_temporary_buffer support
utils: reusable_buffer: support fragmented_temporary_buffer
tests: add test for fragmented_temporary_buffer
util fragment_range: add general linearisation functions
utils: add fragmented_temporary_buffer
tests: add basic test for transport requests and responses
tests/random-utils: print seed
tests/random-utils: generate sstrings
cql3: add value_view printer and equality comparison
transport: move response outside of cql_server class
...
The overloaded_functor class template can be used to encompass multiple
lambdas accepting different types into a single callable object that can
be used with any of those types.
One application is visitors for std::variant where different handling is
required for different types.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
reusable_buffer already supports bytes_ostream which is often used for
handling data sent from Scylla. This patch adds support for
fragmented_temporary_buffer which is going to be mainly used for data
received by Scylla.
Seastar output_streams produce temporary_buffer<char>s.
fragmented_temporary_buffer represents a single fragmented buffer that
consists of, possibly multiple, temporary_buffer<char>s.
This allows to remove the requirement to hold the key value inside the
_load callback if its value is needed in the asynchronous continuation
inside the callback in the context of a reload.
This also resolves the use-after-free issue when a _load() callback removes
the item for a given key.
See a9b72db34d.1528794135.git.bdenes%40scylladb.com
for a discussion about this.
In addition this patch makes the loading_cache more robust for any existing
and potential situations when cached entries are being removed from inside the
callback. This is achieved by extending the idea implemented by Duarte in the
"utils/loading_cache: Avoid using invalidated iterators" by capturing timestamped_val_ptr
(which is essentially a lw_shared_ptr to an intrusive set entry which holds both the key
and the cached value) instead of a naked pointer.
Tests {debug, release}:
- Unit tests:
- loading_cache_test
- view_build_test
- auth_test
- auth_resource_test
- dtest:
- auth_test.py
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
The list of elements that needs to be reloaded may be rather large.
Use chunked_vector in order to make the allocator's life easier.
Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>
When periodically reloading the values in the loading_cache, we would
iterate over the list of entries and call the load() function for
those which need to be reloaded.
For some concrete caches, load() can remove the entry from the LRU set,
and can be executed inline from the parallel_for_each(). This means we
could potentially keep iterating using an invalidated iterator.
Fix this by using a temporary container to hold those entries to be
reloaded.
Spotted when reading the code.
Also use if constexpr and fix the comment in the function containing
the changes.
Signed-off-by: Duarte Nunes <duarte@scylladb.com>
Message-Id: <20180712124143.13638-1-duarte@scylladb.com>
The continuation attached to _load() needs the key of the loaded entry
to check whether it was disposed during the load. However if _load()
invalidates the entry the continuation's capture line will access
invalid memory while trying to obtain the key.
To avoid this save a copy of the key before calling _load() and pass it
to both _load() and the continuation.
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <b571b73076ca863690f907fbd3fb4ff54e597b28.1531393608.git.bdenes@scylladb.com>
"
If there is a lot of partitions in the index page, index_list may grow large
and require large contiguous blocks of memory, because it's based on
std::vector. That puts pressure on the memory allocator, and if memory is
fragmented, may not be possible to satisfy without a lot of eviction. Switch
to chunked_vector to avoid this.
Refs #3597
"
* 'tgrabiec/avoid-large-alloc-in-index-reader' of github.com:tgrabiec/scylla:
sstables: Switch index_list to chunked_vector to avoid large allocations
utils: chunked_vector: Do not require T to be default-constructible for clear()
utils: chunked_vector: Implement front()
resize(), used by clear(), requires T to be default-constructible in
case the vector is expanded. It's not actually needed for clearing,
and there will be users which use clear() with
non-default-constructible T, so implement clear() without using
resize().
std::function's move constructor is not noexcept, so observer's move
constructor and assignment operator also cannot be. Switch to Seastar's
noncopyable_function which provides better guarantees.
Tests: observer_tests (release)
Message-Id: <20180710073628.30702-1-avi@scylladb.com>
An observable is used to decouple an information producer from a consumer
(in the same way as a callback), while allowing multiple consumers (called
observers) to coexist and to manage their lifetime separately.
Two classes are introduced:
observable: a producer class; when an observable is invoked all observers
receive the information
observer: a consumer class; receives information from a observable
Modelled after boost::signals2, with the following changes
- all signals return void; information is passed from the producer to
the consumer but not back
- thread-unsafe
- modern C++ without preprocessor hacks
- connection lifetime is always managed rather than leaked by default
- renamed to avoid the funky "slot" name
Message-Id: <20180709172726.5079-1-avi@scylladb.com>
"
Partition snapshots go away when the last read using the snapshot is done.
Currently we will synchronously attempt to merge partition versions on this event.
If partitions are large, that may stall the reactor for a significant amount of time,
depending on the size of newer versions. Cache update on memtable flush can
create especially large versions.
The solution implemented in this series is to allow merging to be preemptable,
and continue in the background. Background merging is done by the mutation_cleaner
associated with the container (memtable, cache). There is a single merging process
per mutation_cleaner. The merging worker runs in a separate scheduling group,
introduced here, called "mem_compaction".
When the last user of a snapshot goes away the snapshot is slided to the
oldest unreferenced version first so that the version is no longer reachable
from partition_entry::read(). The cleaner will then keep merging preceding
(newer) versions into it, until it merges a version which is referenced. The
merging is preemtable. If the initial merging is preempted, the snapshot is
enqueued into the cleaner, the worker woken up, and merging will continue
asynchronously.
When memtable is merged with cache, its cleaner is merged with cache cleaner,
so any outstanding background merges will be continued by the cache cleaner
without disruption.
This reduces scheduling latency spikes in tests/perf_row_cache_update
for the case of large partition with many rows. For -c1 -m1G I saw
them dropping from >23ms to 1-2ms. System-level benchmark using scylla-bench
shows a similar improvement.
"
* tag 'tgrabiec/merge-snapshots-gradually-v4' of github.com:tgrabiec/scylla:
tests: perf_row_cache_update: Test with an active reader surviving memtable flush
memtable, cache: Run mutation_cleaner worker in its own scheduling group
mutation_cleaner: Make merge() redirect old instance to the new one
mvcc: Use RAII to ensure that partition versions are merged
mvcc: Merge partition version versions gradually in the background
mutation_partition: Make merging preemtable
tests: mvcc: Use the standard maybe_merge_versions() to merge snapshots
Fixes#3546
Both older origin and scylla writes "known" compressor names (i.e. those
in origin namespace) unqualified (i.e. LZ4Compressor).
This behaviour was not preserved in the virtualization change. But
probably should be.
Message-Id: <20180627110930.1619-1-calle@scylladb.com>