Commit Graph

155 Commits

Author SHA1 Message Date
Botond Dénes
e8f3d7dd13 sstables/index_reader: short-circuit fast-forward-to when at EOF
Attempting to call advance_to() on the index, after it is positioned at
EOF, can result in an assert failure, because the operation results in
an attempt to move backwards in the index-file (to read the last index
page, which was already read). This only happens if the index cache
entry belonging to the last index page is evicted, otherwise the advance
operation just looks-up said entry and returns it.
To prevent this, we add an early return conditioned on eof() to all the
partition-level advance-to methods.
A regression unit test reproducing the above described crash is also
added.
2022-05-05 14:42:37 +03:00
Raphael S. Carvalho
791403e4bb sstables: Fix deletion of partial SSTables
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.

After commit e5fc4b6, partial sst cannot be deleted because deletion
procedure is incorrectly assuming all SSTs being deleted unconditionally
have TOC, but partial SSTs only have TMP TOC instead.
That happens because parent_path() requires all path components to
exist due to its usage of fs::path::canonical.

The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.

This is fixed by only calling parent_path() on TMP TOC, which is
guaranteed to exist prior to calling fsync_directory().

Fixes #10410.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-26 11:00:27 -03:00
Avi Kivity
975b0c0b03 Merge "tools/scylla-sstable: add validate-checksums and decompress" from Botond
"
This patchset adds two new operations to scylla-sstable:
* validate-checksums - helps identifying whether an sstable is intact or
  not, but checking the digest and the per-chunk checksums against the
  data on disk.
* decompress - helps when one wants to manually examine the content of a
  compressed sstable.

Refs: #497

Tests: unit(dev)
"

* 'scylla-sstable-validate-checksums-decompress/v3' of https://github.com/denesb/scylla:
  tools/scylla-sstable: consume_sstables(): s/no_skips/use_crawling_reader/
  tools/scylla-sstable: add decompress operation
  tools/scylla-sstables: add validate-checksums operation
  sstables/sstable: add validate_checksums()
  sstables/sstable: add raw_stream option to data_stream()
  sstables/sstable: make data_stream() and data_read() public
  utils/exceptions: add maybe_rethrow_exception()
2022-03-16 18:56:48 +02:00
Botond Dénes
ddf9dee9d8 sstables/sstable: add validate_checksums()
Sstables have two kind of checksums: per-chunk checksums and
full-checksum (digest) calculated over the entire content of Data.db.

The full-checksum (digest) is stored in Digest.crc
(component_type::Digest).

When compression is used, the per-chunk checksum is stored directly
inside Data.db, after each compressed chunk. These are validated on
read, when decompressing the respective chunks.
When no compression is used, the per-chunk checksum is stored separately
in CRC.db (component_type::CRC). Chunk size is defined and stored in said
component as well.

In both compressed and uncompressed sstables, checksums are calculated
on the data that is actually written to disk, so in case of compressed
data, on the compressed data.

This method validates both the full checksum and the per-chunk checksum
for the entire Data.db.
2022-03-15 14:52:15 +02:00
Mikołaj Sielużycki
1d84a254c0 flat_mutation_reader: Split readers by file and remove unnecessary includes.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.

With changes

real	29m14.051s
user	168m39.071s
sys	5m13.443s

Without changes

real	30m36.203s
user	175m43.354s
sys	5m26.376s

Closes #10194
2022-03-14 13:20:25 +02:00
Botond Dénes
959483a2dc test: migrate to the v2 variant of the sstable writer API 2022-03-10 09:16:33 +02:00
Botond Dénes
105bf8888a sstables: convert mx writer to v2
The sstables::sstable class has two methods for writing sstables:
1) sstable_writer get_writer(...);
2) future<> write_components(flat_mutation_reader, ...);

(1) directly exposes the writer type, so we have to update all users of
it (there is not that many) in this same patch. We defer updating
users of (2) to a follow-up commits.
2022-03-10 07:03:49 +02:00
Avi Kivity
cbba80914d memtable: move to replica module and namespace
Memtables are a replica-side entity, and so are moved to the
replica module and namespace.

Memtables are also used outside the replica, in two places:
 - in some virtual tables; this is also in some way inside the replica,
   (virtual readers are installed at the replica level, not the
   cooordinator), so I don't consider it a layering violation
 - in many sstable unit tests, as a convenient way to create sstables
   with known input. This is a layering violation.

We could make memtables their own module, but I think this is wrong.
Memtables are deeply tied into replica memory management, and trying
to make them a low-level primitive (at a lower level than sstables) will
be difficult. Not least because memtables use sstables. Instead, we
should have a memtable-like thing that doesn't support merging and
doesn't have all other funky memtable stuff, and instead replace
the uses of memtables in sstable tests with some kind of
make_flat_mutation_reader_from_unsorted_mutations() that does
the sorting that is the reason for the use of memtables in tests (and
live with the layering violation meanwhile).

Test: unit (dev)

Closes #10120
2022-02-23 09:05:16 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Botond Dénes
64bb48855c flat_mutation_reader: revamp flat_mutation_reader_from_mutations()
Add schema parameter so that:
* Caller has better control over schema -- especially relevant for
  reverse reads where it is not possible to follow the convention of
  passing the query schema which is reversed compared to that of the
  mutations.
* Now that we don't depend on the mutations for the schema, we can lift
  the restriction on mutations not being empty: this leads to safer
  code. When the mutations parameter is empty, an empty reader is
  created.
Add "make_" prefix to follow convention of similar reader factory
functions.

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211115155614.363663-1-bdenes@scylladb.com>
2021-11-15 17:58:46 +02:00
Pavel Emelyanov
a2590368ce test: Generalize all make_sstable_easy()-s
There are already four of them. Those working with the mutation reader
can be folded into one with some default args.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Pavel Emelyanov
28e5307ce2 test: Reuse make_sstable_easy in datafile tests
This patch is two-fold. First it changes the signature of the
local helper to facilitate next patching. Second, it makes more
relevant places in the test use this helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Pavel Emelyanov
5515f7187d range_tombstone, code: Add range_tombstone& getters
Currently all the code operates on the range_tombstone class.
and many of those places get the range tombstone in question
from the range_tombstone_list. Next patches will make that list
carry (and return) some new object called range_tombstone_entry,
so all the code that expects to see the former one there will
need to patched to get the range_tombstone from the _entry one.

This patch prepares the ground for that by introdusing the

    range_tombstone& tombstone() { return *this; }

getter on the range_tombstone itself and patching all future
users of the _entry to call .tombstone() right now.

Next patch will remove those getters together with adding the new
range_tombstone_entry object thus automatically converting all
the patched places into using the entry in a proper way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Benny Halevy
4476800493 flat_mutation_reader: get rid of timeout parameter
Now that the timeout is taken from the reader_permit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
f25aabf1b2 flat_mutation_reader: maybe_timed_out: use permit timeout
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Benny Halevy
46fb7fe68e test: sstable_datafile_test: add sstable_reader_with_timeout
Verify that the sstable reader (for the highest supported version)
times out properly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Michael Livshin
37c9f8f137 tests: get rid of sstable::make_reader_v1() in the trivial cases
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-09 19:20:48 +03:00
Michael Livshin
f07306d75c sstables: make sstable::make_reader() return flat_mutation_reader_v2
Rename the old version to `sstables::make_reader_v1()`, to have a
nicely searcheable eradication target.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-09 19:20:48 +03:00
Raphael S. Carvalho
a869d61c89 tests: Move compaction-related tests into its own unit
With commit 1924e8d2b6, compaction code was moved into a
top level dir as compaction is layered on top of sstables.
Let's continue this work by moving all compaction unit tests
into its own test file. This also makes things much more
organized.

sstable_datafile_test, as its name implies, will only contain
sstable data tests. Perhaps it should be renamed to only
sstable_data_test, as the test also contains tests involving
other components, not only the data one.

BEFORE
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
105

AFTER
$ cat test/boost/sstable_compaction_test.cc | grep TEST_CASE | wc -l
57
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
48

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210802192120.148583-1-raphaelsc@scylladb.com>
2021-08-02 22:26:26 +03:00
Avi Kivity
42e1f318d7 Merge "Respect "bypass cache" in sstable index caching" from Tomasz
"
This series changes the behavior of the system when executing reads
annotated with "bypass cache" clause in CQL. Such reads will not
use nor populate the sstable partition index cache and sstable index page cache.
"

* 'bypass-cache-in-sstable-index-reads' of github.com:tgrabiec/scylla:
  sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads
  sstables: Do not populate partition index cache for "bypass cache" reads
2021-07-28 18:45:39 +03:00
Pavel Emelyanov
05b8cdfd24 mutation_partition: Return immutable collection for rows
Patch the .clustered_rows() method to return the btree of rows
wrapped into the immutable_collection<> so that callers are
guaranteed not to touch the collection itself, but still can
modify the elements in it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Tomasz Grabiec
f4227c303b sstables: Do not populate partition index cache for "bypass cache" reads
Index cursor for reads which bypass cache will use a private temporary
instance of the partition index cache.

Promoted index scanner (ka/la format) will not go through the page cache.
2021-07-15 12:13:20 +02:00
Botond Dénes
5c39a2921e test/boost/sstable_datafile_test: add test for validation compaction
Add a two new unit tests, one which cover the whole stack, and one which
stresses the validation part.
2021-07-12 10:25:15 +03:00
Botond Dénes
8296759cce test/boost/sstable_datafile_test: scrub tests: extract corrupt sst writer code into function
So the two tests having this almost identical code can shared it, and so
that it can be used by new tests.
2021-07-12 10:25:15 +03:00
Avi Kivity
9059514335 build, treewide: enable -Wpessimizing-move warning
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.

Fix the handful of cases (all trivial), and enable the warning.

Closes #8992
2021-07-08 17:52:34 +03:00
Botond Dénes
2d2b9e7b36 test/boost: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
5fff314739 test/lib/simple_schema: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
d520655730 test/lib/sstable_utils: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
3679418e62 test/lib/test_services: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Raphael S. Carvalho
1924e8d2b6 treewide: Move compaction code into a new top-level compaction dir
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.

Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
2021-07-07 23:21:51 +03:00
Tomasz Grabiec
2b673478aa sstables: index_reader: Do not expose index_entry references
index_entry will be an LSA-managed object. Those have to be accessed
with care, with the LSA region locked.

This patch hides most of direct index_entry accesses inside the
index_reader so that users are safe.
2021-07-02 19:02:13 +02:00
Tomasz Grabiec
f537d1a7e5 tests: sstables: Do not call open_data() twice
make_sstable_containing() already calls open_data(), so does
load(). This will trigger assertion failure added in a later patch:

   assert(!_cached_index_file);

There is no need to call load() here.
2021-07-02 10:25:58 +02:00
Piotr Jastrzebski
430fd5cfa9 sstables: move sstable_writer to separate header
This class is used in only few places and does not have to be included
everywhere sstable class is needed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:31 +02:00
Piotr Jastrzebski
2d6608bb88 sstables: stop including metadata_collector.hh in sstables.hh
metadata collector is rarely used so it's better to include it only
in those few places.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:31 +02:00
Piotr Jastrzebski
314bc0e8a5 sstable_datafile_test: switch tests to use latest sstables format
instead of LA. Ability to write LA and KA sstables will be removed
by the following patches so we need to switch all the tests to write
newer sstables.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:30 +02:00
Piotr Jastrzebski
f03ed9b9a7 sstable_datafile_test: switch compaction_with_fully_expired_table to latest sstable version
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:30 +02:00
Piotr Jastrzebski
1ed298b08b test_offstrategy_sstable_compaction: test all writable sstables
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-27 15:12:12 +02:00
Piotr Jastrzebski
995eb8c274 compaction_with_fully_expired_table: Remove some LA specific code
Following patches will switch all sstable writing tests to use
the latest sstables format. compaction_with_fully_expired_table
contains some test for a LA specific behaviour so let's remove it
to make the switch possible.

For more context see https://github.com/scylladb/scylla/issues/2620

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Piotr Jastrzebski
131a0babc0 sstable_datafile_test: Fix schema used by check_compacted_sstables
check_compacted_sstables is used in compact_02 test which uses sstables
created by compact_sstables. The problem is that schema used in
check_compacted_sstables and compact_sstables is not the same.
The type of r1 column is different. This was not a problem when the
test was running on LA sstables but following patches will switch
all the tests to use MC and then sstable schema becomes validated
when reading the sstable and the test will fail such validation.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Piotr Jastrzebski
680e341f54 sstables: Remove LA/KA sstable writting tests that check exact format
Those tests check that created sstables have exactly the expected bytes
inside. This won't work with other sstable formats and writting LA/KA
sstables will be removed by the following patches so there's nothing
we can do with those tests but to remove them. Otherwise they will be
failing after LA/KA writting capability is removed.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Piotr Jastrzebski
2bd6ad1e2f sstables: define writable_sstable_versions
and use it instead of all_sstable_versions in tests that check
writting of sstables. Following patches remove LA/KA writer so we
want tests to be ready for that and not break by trying to write LA/KA
sstables.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-06-25 10:12:00 +02:00
Tomasz Grabiec
a4275cf8bc sstables: Switch the mx reader to flat_mutation_reader_v2
The main difficulty was in making sure that emitted range tombstone
changes reflect range tombstones trimmed to clustering restrictions.
This is handled by mutation_fragment_filter and
clustering_ranges_walker. They return a list of range_tombstone_change
fragments to emit for each hop as the reader walks over the clustering
domain.

Tests which were using a normalizing reader expected range tombstones
to be split around rows. Drop this an adjust the tests accoridngly. No
reader splits range tombstones around rows now.
2021-06-16 00:23:49 +02:00
Raphael S. Carvalho
846f0bd16e sstables: Fix incremental selection with compound sstable set
Incremental selection may not work properly for LCS and ICS due to an
use-after-free bug in partitioned set which came into existence after
compound set was introduced.

The use-after-free happens because partitioned set wasn't taking into
account that the next position can become the current position in the
next iteration, which will be used by all selectors managed by
compound set. So if next position is freed, when it were being used
as current position, subsequent selectors would find the current
position freed, making them produce incorrect results.

Fix this by moving ownership of next pos from incremental_selector_impl
to incremental_selector, which makes it more robust as the latter knows
better when the selection is done with the next pos. incremental_selector
will still return ring_position_view to avoid copies.

Fixes #8802.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>
2021-06-13 16:45:07 +03:00
Tomasz Grabiec
419ee84d86 Merge "sstable: validate first and last keys ordering" from Benny
In #8772, an assert validating first token <= last token
failed in leveled_manifest::overlapping.

It is unclear how we got to that state, so add validation
in sstable::set_first_and_last_keys() that the to-be-set
first and last keys are well ordered.
Otherwise, throw malformed_sstable_exception.

set_first_and_last_keys is called both on the write path
from the sstable writer before the sstable is sealed,
and on the open/load path via update_info_for_opened_data().

This series also fixes issues with unit tests with
regards to first/last keys so they won't fail the
validation.

Refs #8772

Test: unit(dev)
DTest: next-gating(dev), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)

* tag 'validate-first-and-last-keys-ordering-v1':
  sstable: validate first and last keys ordering
  test: lib: reusable_sst: save unexpected errors
  test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
  test: sstable_test: define primary key in schema for compressed sstable
2021-06-09 14:43:02 +02:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Benny Halevy
9452b99b40 test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
Currently the test is using "first_key", "last_key"
literals for the first and last keys and expects them
to sort properly with the murmur3 partitioner.
Also it does that for all generated sstables
which is less interesting for reshape.

Use token_generation_for_current_shard to
generate random, properly ordered keys.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-06-02 12:25:29 +03:00
Raphael S. Carvalho
a7cdd846da compaction: Prevent tons of compaction of fully expired sstable from happening in parallel
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.

This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.

With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.

Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.

Fixes #8710.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>

[avi: drop now unneeded storage_service_for_tests]
2021-05-30 23:22:51 +03:00
Pavel Emelyanov
d2442a1bb3 tests: Ditch storage_service_for_tests
The purpose of the class in question is to start sharded storage
service to make its global instance alive. I don't know when exactly
it happened but no code that instantiates this wrapper really needs
the global storage service.

Ref: #2795
tests: unit(dev), perf_sstable(dev)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210526170454.15795-1-xemul@scylladb.com>
2021-05-27 14:39:13 +03:00
Raphael S. Carvalho
ee39eb9042 sstables: Fix slow off-strategy compaction on STCS tables
Off-strategy compaction on a table using STCS is slow because of
the needless write amplification of 2. That's because STCS reshape
isn't taking advantage of the fact that sstables produced by
a repair-based operation are disjoint. So the ~256 input sstables
were compacted (in batches of 32) into larger sstables, which in
turn were compacted into even larger ones. That write amp is very
significant on large data sets, making the whole operation 2x
slower.

Fixes #8449.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>
2021-05-25 11:24:42 +03:00