db::config is a global class; changes in any module can cause changes
in db::config. Therefore, it is a cause of needless recompilation.
Remove some of these dependencies by having consumers of db::config
declare an intermediate config struct that is contains only
configuration of interest to them, and have their caller fill it out
(in the case of auth, it already followed this scheme and the patchset
only moves the translation function).
In addition, some outright pointless inclusions of db/config.hh are
removed.
The result is somewhat shorter compile times, and fewer needless
recompiles.
* https://github.com/avikivity/scylla unconfig-1/v1:
config: remove inclusions of db/config.hh from header files
repair: remove unneeded config.hh inclusion
batchlog_manager: remove dependency on db::config
auth: remove permissions_cache dependency on db::config
auth: remove auth::service dependency on db::config
auth: remove unneeded db/config.hh includes
_ck_blocks_header is a 64-bit variable, so the mask should be 64 bits too.
Otherwise, a shift in the range 32-63 will produce wrong results.
Fix by using a 64-bit mask.
Found by Fedora 29's ubsan.
Fixes#3973.
Message-Id: <20181209120549.21371-1-avi@scylladb.com>
"
Make major compaction aware of compaction strategy, by using an
optimal approach which suits the strategy needs.
Refs #1431.
"
* 'compaction_strategy_aware_major_compaction_v2' of github.com:raphaelsc/scylla:
tests: add test for compaction-strategy-aware major compaction
compaction: implement major compaction heuristic for leveled strategy
compaction: introduce notion of compaction-strategy-aware major compaction
Instead, distribute those inclusions to .cc files that require them. This
reduces rebuilds when config.hh changes, and makes it easier to locate files
that need config disaggregation.
Major compaction for leveled strategy will now create a run of
non-overlapping sstables at the highest level. Until now, a single
sstable would be created at level 0 which was very suboptimal because
all data would need to climb up the levels again, making it a very
expensive I/O process.
Refs #1431.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's only the very first step which introduces the machinery for making
major compaction aware of all strategies. By the time being, default
implementation is used for them all which only suits size tiered.
Refs #1431.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Commit cc6c383249 has fixed an issue with
incorrectly tracking max_local_deletion_time and the check in
validate_max_local_deletion_time was called to work around old files.
This fix relaxes conditions for enforcing defaut max_local_deletion_time
so that they don't apply to SSTables in 'mc' format because the original
problem has been resolved before 'mc' format have been introduced.
This is needed to be able to read correct values from
Cassandra-generated SSTables that don't have a Scylla.db component.
Its presence or absence is used as an indicator of possibly affected
files.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
For tiny index files (< 8 bytes long) it could turn to zero and trigger
an assertion in prepare_summary().
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
Before the reader was just ignoring such columns but this creates a risk of data loss.
Refs #2598
"
* 'haaawk/2598/v3' of github.com:scylladb/seastar-dev:
sstables: Add test_sstable_reader_on_unknown_column
sstables: Exception on sstable's column not present in schema
sstables: store column name in column_translation::column_info
sstables: Make test_dropped_column_handling test dropped columns
Fixes the condition which determines whether a row ttl should be used for a cell
and adds a test that uses each generated mutation to populate mutation source
and then verifies that it can read back the same mutation.
* seastar-dev.git haaawk/sst3/write-read-test/v3:
Fix use_row_ttl condition
Add test_all_data_is_read_back
Previous condition was wrong and was using row ttl too often.
We also have to change test_dead_row_marker to compare
resulting sstable with sstable generated by Origin not
by sstableupgrade.
This is because sstableupgrade transmits information about deleted row
marker automatically to cells in that row.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
"
This is a small step in fixing issue #2347. It is mostly tests and
testing infrastructure, but it does include a fix for a case where we
were missing the filename in the malformed_sstable_exception.
"
* 'espindola/sstable-corruption-v2' of https://github.com/espindola/scylla:
Add a filename to a malformed_sstable_exception.
Try to read the full sst in broken_sst.
Convert tests to SEASTAR_THREAD_TEST_CASE.
Check the exception message.
Move some tests to broken_sstable_test.cc
It is reasonable for parse() to throw when it finds something wrong
with the format. This seems to be the best spot to add the filename
and rethrow.
Also add a testcase to make sure we keep handling this error
gracefully.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
These loops have the structure :
while (true) {
switch (state) {
case state1:
...
break;
case state2:
if (...) { ... break; } else {... continue; }
...
}
break;
}
There a couple things I find a bit odd on that structure:
* The break refers to the switch, the continue to the loop.
* A while (true) loop always hits a break or a continue.
This patch uses early returns to simplify the logic to
while (true) {
switch (state) {
case state1:
...
return
case state2:
if (...) { ... return; }
...
}
}
Now there are no breaks or continues.
Tests: unit (release)
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181126171726.84629-1-espindola@scylladb.com>
"
zlib's crc32_combine() is not very efficient. It is faster to re-combine
the buffer using crc32(). It's still substantial amount of work which
could be avoided.
This patch introduces a fast implementation of crc32_combine() which
uses a different algorithm than zlib. It also utilizes intrinsics for
carry-less multiplication instruction to perform the computation faster.
The details of the algorithm can be found in code comments.
Performance results using perf_checksum and second buffer of length 64 KiB:
zlib CRC32 combine: 38'851 ns
libdeflate CRC32: 4'797 ns
fast_crc32_combine(): 11 ns
So the new implementation is 3500x faster than zlib's, and 417x faster than
re-checksumming the buffer using libdeflate.
Tested on i7-5960X CPU @ 3.00GHz
Performance was also evaluated using sstable writer benchmark:
perf_fast_forward --populate --sstable-format=mc --data-directory /tmp/perf-mc \
--value-size=10000 --rows 1000000 --datasets small-part
It yielded 9% improvement in median frag/s (129'055 vs 117'977).
Refs #3874
"
* tag 'fast-crc32-combine-v2' of github.com:tgrabiec/scylla:
tests: perf_checksum: Test fast_crc32_combine()
tests: Rename libdeflate_test to checksum_utils_test
tests: libdeflate: Add more tests for checksum_combine()
tests: libdeflate: Check both libdeflate and default checksummers
sstables: Use fast_crc_combine() in the default checksummer
utils/gz: Add fast implementation of crc32_combine()
utils/gz: Add pre-computed polynomials
utils/gz: Import Barett reduction implementation from libdeflate
utils: Extract clmul() from crc.hh
Before this patch we were writing offset map enteies in unspecified
order, the one returned by std::unorderd_map. Cassandra writes them
sorted by metadata_type. Use the same order for improved
compatibility.
Fixes#3955.
Message-Id: <1543846649-22861-1-git-send-email-tgrabiec@scylladb.com>
When writing the sstable, create a temporary directory
for creating all components so that each sstable files' will be
assigned a different allocaton groups on xfs.
Files are immediately renamed to their default location after creation.
Temp directory is removed when the sstable is sealed.
Additional work to be introduced in the following patches:
- When populating tables, temp directories need to be looked up and removed.
Fixes#3167
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for per-sstable sub directory.
Also, these functions get most of their parameters from the sst at hand so they might
as well be first class members.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In case if summary is missing and we attempt to re-generate it,
statistics must be already read to provide us with values stored in
serialization header to facilitate clustering prefixes deserialization.
Fixes#3947
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
Previously such column was ignored but it's better to be explicit
about this situation.
Refs #2598
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
As far as I can tell the old sstable reading code required reading the
data into a contiguous buffer. The function data_consume_rows_at_once
implemented the old behavior and incrementally code was moved away
from it.
Right now the only use is in two tests. The sstables used in those
tests are already used in other tests with data_consume_rows.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181127024319.18732-2-espindola@scylladb.com>
"
One part of the improvement comes from replacing zlib's CRC32 with the one
from libdeflate, which is optimized for modern architecture and utilizes the
PCLMUL instruction.
perf_checksum test was introduced to measure performance of various
checksumming operations.
Results for 514 B (relevant for writing with compression enabled):
test iterations median mad min max
crc_test.perf_deflate_crc32_combine 58414 16.711us 3.483ns 16.708us 16.725us
crc_test.perf_adler_combine 165788278 6.059ns 0.031ns 6.027ns 7.519ns
crc_test.perf_zlib_crc32_combine 59546 16.767us 26.191ns 16.741us 16.801us
---
crc_test.perf_deflate_crc32_checksum 12705072 83.267ns 4.580ns 78.687ns 98.964ns
crc_test.perf_adler_checksum 3918014 206.701ns 23.469ns 183.231ns 258.859ns
crc_test.perf_zlib_crc32_checksum 2329682 428.787ns 0.085ns 428.702ns 510.085ns
Results for 64 KB (relevant for writing with compression disabled):
test iterations median mad min max
crc_test.perf_deflate_crc32_combine 25364 38.393us 17.683ns 38.375us 38.545us
crc_test.perf_adler_combine 169797143 5.842ns 0.009ns 5.833ns 6.901ns
crc_test.perf_zlib_crc32_combine 26067 38.663us 95.094ns 38.546us 40.523us
---
crc_test.perf_deflate_crc32_checksum 202821 4.937us 14.426ns 4.912us 5.093us
crc_test.perf_adler_checksum 44684 22.733us 206.263ns 22.492us 25.258us
crc_test.perf_zlib_crc32_checksum 18839 53.049us 36.117ns 53.013us 53.274us
The new CRC32 implementation (deflate_crc32) doesn't provide a fast
checksum_combine() yet, it delegates to zlib so it's as slow as the latter.
Because for CRC32 checksum_combine() is several orders of magnitude slower
than checksum(), we avoid calling checksum_combine() completely for this
checksummer. We still do it for adler32, which has combine() which is faster
than checksum().
SStable write performance was evaluated by running:
perf_fast_forward --populate --data-directory /tmp/perf-mc \
--rows=10000000 -c1 -m4G --datasets small-part
Below is a summary of the average frag/s for a memtable flush. Each result is
an average of about 20 flushes with stddev of about 4k.
Before:
[1] MC,lz4: 330'903
[2] LA,lz4: 450'157
[3] MC,checksum: 419'716
[4] LA,checksum: 459'559
After:
[1'] MC,lz4: 446'917 ([1] + 35%)
[2'] LA,lz4: 456'046 ([2] + 1.3%)
[3'] MC,checksum: 462'894 ([3] + 10%)
[4'] LA,checksum: 467'508 ([4] + 1.7%)
After this series, the performance of the MC format writer is similar to that
of the LA format before the series.
There seems to be a small but consistent improvement for LA too. I'm not sure
why.
"
* tag 'improve-mc-sstable-checksum-libdeflate-v3' of github.com:tgrabiec/scylla:
tests: perf: Introduce perf_checksum
tests: Add test for libdeflate CRC32 implementation
sstables: compress: Use libdeflate for crc32
sstables: compress: Rename crc32_utils to zlib_crc32_checksummer
licenses: Add libdeflate license
Integrate libdeflate with the build system
Add libdeflate submodule
sstables: Avoid checksum_combine() for the crc32 checksummer
sstables: compress: Avoid unnecessary checksum_combine()
sstables: checksum_utils: Add missing include
Improves memtable flush performance by 10% in a CPU-bound case.
Unlike the zlib implementation, libdeflate is optimized for modern
CPUs. It utilizes the PCLMUL instruction.
checksum_combine() is much slower than re-feeding the buffer to
checksum() for the zlib CRC32 checksummer.
Introduce Checksum::prefer_combine() to determine this and select
more optimal behavior for given checksummer.
Improves performance of memtable flush with compression enabled by 30%.
"
Previously we were checking for schema incompatibility between current schema and sstable
serialization header before reading any data. This isn't the best approach because
data in sstable may be already irrelevant due to column drop for example.
This patchset moves the check after actual data is read and verified that it has
a timestamp new enough to classify it as nonobsolete.
Fixes#3924
"
* 'haaawk/3924/v3' of github.com:scylladb/seastar-dev:
sstables: Enable test_schema_change for MC format
sstables3: Throw error on schema mismatch only for live cells
sstables: Pass column_info to consume_*_column
sstables: Add schema_mismatch to column_info
sstables: Store column data type in column_info
sstables: Remove code duplication in column_translation
Previously we were throwing exception during the creation of
column_translation. This wasn't always correct because sometimes
column for which the mismatch appeared was already dropped and
data present in sstable should be ignored anyway.
Fixes#3924
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
do_process_buffer had two unreachable default cases and a long
if-else-if chain.
This converts the the if-else-if chain to a switch and a helper
function.
This moves the error checking from run time to compile time. If we
were to add a 128 bit integer for example, gcc would complain about it
missing from the switch.
Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com>
Message-Id: <20181125221451.106067-1-espindola@scylladb.com>
This is needed for parallel compaction to work with sstable run based approach.
That's because regular compaction clones a set containing all sstables of its
column family. So compaction A can potentially hold a reference to a compacting
sstable of compaction B, so preventing compacting B from releasing its exhausted
sstable.
So all replacements are propagated to all compactions of a given column family,
and compactions in turn, including the one which initiated the propagation,
will do the replacement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
motivation is that we need a more efficient way to find compactions
that belong to a given column family in compaction list.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Motivation is that it will be useful for catching regression on compaction
when releasing early exhausted sstables. That's because sstable's space
is only released once it's closed. So this will allow us to write a test
case and possibly use it for entities holding exhausted sstable.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Filter out sstable belonging to a partial run being generated by an ongoing
compaction. Otherwise, that could lead to wrong decisions by the compaction
strategy.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
SSTables composing the same run will share the same run identifier.
Therefore, a new compaction strategy will be able to get all sstables belong
to the same run from sstable_set, which now keeps track of existing runs.
Same UUID is passed to writers of a given compaction. Otherwise, a new UUID
is picked for every sstable created by compaction.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
sstable run is a structure that will hold all sstables that has the same
run identifier. All sstables belonging to the same run will not overlap
with one another.
It can be used by compaction strategy to work on runs instead of individual
sstables.
sstable_set structure which holds all sstables for a given column family
will be responsible for providing to its user an interface to work with
runs instead of individual sstables.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
That's important for the reference to sstable to not be kept throughout
the compaction procedure, which would break the goal of releasing
space during compaction.
Manager passes a callback to compaction which calls it whenever
there's sstable replacement.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Motivation is that we want to release space for exhausted sstable and that
will only happen when all references to it are gone *and* that backlog
tracker takes the early replacement into account.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>