Commit Graph

174 Commits

Author SHA1 Message Date
Pavel Emelyanov
1d91914166 sstables: Drop set_generation() method
The method became unused since 70e5252a (table: no longer accept online
loading of SSTable files in the main directory) and the whole concept of
reshuffling sstables was dropped later by 7351db7c (Reshape upload files
and reshard+reshape at boot).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12165
2022-12-01 22:17:10 +02:00
Pavel Emelyanov
bdc47b7717 sstables: Move deletion log manipulations to sstable_directory.cc
The deletion log concept uses the fact that files are on a POSIX
filesystem. Support for another storage type will have to reimplement
this place, so keep the FS-specific code in _directory.cc file.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-11-21 13:16:21 +03:00
Michał Chojnowski
3e0c7a6e9f test: sstable_datafile_test: eliminate a use of std::regex to prevent stack overflow
This usage of std::regex overflows the seastar::thread stack size (128 KiB),
causing memory corruption. Fix that.

Closes #11911
2022-11-08 14:41:34 +02:00
Botond Dénes
4c13328788 Merge 'Return all sstables in table::get_sstable_set()' from Raphael "Raph" Carvalho
This fixes a regression introduced by 1e7a444, where table::get_sstable_set() isn't exposing all sstables, but rather only the ones in the main set. That causes user of the interface, such as get_sstables_by_partition_key() (used by API to return sstable name list which contains a particular key), to miss files in the maintenance set.

Fixes https://github.com/scylladb/scylladb/issues/11681.

Closes #11682

* github.com:scylladb/scylladb:
  replica: Return all sstables in table::get_sstable_set()
  sstables: Fix cloning of compound_sstable_set
2022-10-05 06:55:50 +03:00
Raphael S. Carvalho
eddf32b94c sstables: Fix cloning of compound_sstable_set
The intention was that its clone() would actually clone the content
of an existing set into a new one, but the current impl is actually
moving the sets instead of copying them. So the original set
becomes invalid. Luckily, this problem isn't triggered as we're
not exposing the compound set in the table's interface, so the
compound_sstable_set::clone() method isn't being called.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-10-04 10:43:25 -03:00
Botond Dénes
169a8a66f2 compatible_ring_position_or_view: make it cheap to copy
This class exists for one purpose only: to serve as glue code between
dht::ring_position and boost::icl::interval_map. The latter requires
that keys in its intervals are:
* default constructible
* copyable
* have standalone compare operations

For this reason we have to wrap `dht::ring_position` in a class,
together with a schema to provide all this. This is
`compatible_ring_position`. There is one further requirement by code
using the interval map: it wants to do lookups without copying the
lookup key(s). To solve this, we came up with
`compatible_ring_position_or_view` which is a union of a key or a key
view + schema. As we recently found out, boost::icl copies its keys **a
lot**. It seems to assume these keys are cheap to copy and carelessly
copies them around even when iterating over the map. But
`compatible_ring_position_or_view` is not cheap to copy as it copies a
`dht::ring_position` which allocates, and it does that via an
`std::optional` and `std::variant` to add insult to injury.
This patch make said class cheap to copy, by getting rid of the variant
and storing the `dht::ring_position` via a shared pointer. The view is
stored separately and either points to the ring position stored in the
shared pointer or to an outside ring position (for lookups).

Fixes: #11669

Closes #11670
2022-10-04 12:00:21 +03:00
Raphael S. Carvalho
4bc24acf81 sstable: Extend sstable_run to allow disjointness on the clustering level
After commit 0796b8c97a, sstable_run won't accept a fragment
that introduces key overlapping. But once we split large partitions,
fragments in the same run may store disjoint clustering ranges
of the same partition. So we're extending sstable_run to look
at clustering dimension, so fragments storing disjoint clustering
ranges of the same large partition can co-exist in the same run.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
13942ec947 test: move sstable_run_disjoint_invariant_test into sstable_datafile_test
That's where it belongs.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
5937765009 sstables: Keep track of first partition's first pos and last partition's last pos
With first partition's first position and last partition's last
partition, we'll be able to determine which fragments composing a
sstable run store a large partition that was split.

Then sstable run will be able to detect if all fragments storing
a given large partition are disjoint in the clustering level.

Fixes #10637.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:51 -03:00
Raphael S. Carvalho
9bcad9ffa8 sstables: Add ability to find first or last position in a partition
This new method allows sstable to load the first row of the first
partition and last row of last partition.

That's useful for incremental reading of sstable run which will
be split at clustering boundary.

To get the first row, it consumes the first row (which can be
either a clustering row or range tombstone change) and returns
its position_in_partition.
To get the last row, it does the same as above but in reverse
mode instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:09:48 -03:00
Botond Dénes
be9d1c4df4 sstables: crawling mx-reader: make on_out_of_clustering_range() no-op
Said method currently emits a partition-end. This method is only called
when the last fragment in the stream is a range tombstone change with a
position after all clustered rows. The problem is that
consume_partition_end() is also called unconditionally, resulting in two
partition-end fragments being emitted. The fix is simple: make this
method a no-op, there is nothing to do there.

Also add two tests: one targeted to this bug and another one testing the
crawling reader with random mutations generated for random schema.

Fixes: #11421

Closes #11422
2022-09-04 20:02:50 +03:00
Benny Halevy
7747b8fa33 sstables: define run_identifier as a strong tagged_uuid type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11321
2022-08-18 19:03:10 +03:00
Benny Halevy
813cffc2b5 counters: counter_id: use base class create_random_id
Rather than defining generate_random,
and use respectively in unit tests.
(It was inherited from raft::internal::tagged_id.)

This allows us to shorten counter_id's definition
to just using utils::tagged_uuid<struct counter_id_tag>.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-08-08 08:02:27 +03:00
Michael Livshin
ab13127761 sstables: use generation_type more soundly
`generation_type` is (supposed to be) conceptually different from
`int64_t` (even if physically they are the same), but at present
Scylla code still largely treats them interchangeably.

In addition to using `generation_type` in more places, we
provide (no-op) `generation_value()` and `generation_from_value()`
operations to make the smoke-and-mirrors more believable.

The churn is considerable, but all mechanical.  To avoid even
more (way, way more) churn, unit test code is left untreated for
now, except where it uses the affected core APIs directly.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-06-20 19:37:31 +03:00
Michael Livshin
d137b32994 tests: ms.make_reader() -> ms.make_reader_v2()
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Benny Halevy
6677028212 sstables: mx/writer: auto-scale promoted index
Add column_index_auto_scale_threshold_in_kb to the configuration
(defaults to 10MB).

When the promoted index (serialized) size gets to this
threshold, it's halved by merging each two adjacent blocks
into one and doubling the desired_block_size.

Fixes #4217

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-24 13:32:35 +03:00
Michael Livshin
1e690e6773 tests: remove obsolete utility functions
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-10 22:10:40 +03:00
Michael Livshin
864882253a tests: less trivial flat_reader_assertions{,_v2} conversions
Dealing with the handful of tests that check range tombstones in
interesting ways and need more than search-and-replace.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-10 22:10:40 +03:00
Michael Livshin
3cc2343775 tests: trivial flat_reader_assertions{,_v2} conversions
(Which entails temporary cut-and-pasting some utility functions)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-10 22:10:40 +03:00
Botond Dénes
e8f3d7dd13 sstables/index_reader: short-circuit fast-forward-to when at EOF
Attempting to call advance_to() on the index, after it is positioned at
EOF, can result in an assert failure, because the operation results in
an attempt to move backwards in the index-file (to read the last index
page, which was already read). This only happens if the index cache
entry belonging to the last index page is evicted, otherwise the advance
operation just looks-up said entry and returns it.
To prevent this, we add an early return conditioned on eof() to all the
partition-level advance-to methods.
A regression unit test reproducing the above described crash is also
added.
2022-05-05 14:42:37 +03:00
Raphael S. Carvalho
791403e4bb sstables: Fix deletion of partial SSTables
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.

After commit e5fc4b6, partial sst cannot be deleted because deletion
procedure is incorrectly assuming all SSTs being deleted unconditionally
have TOC, but partial SSTs only have TMP TOC instead.
That happens because parent_path() requires all path components to
exist due to its usage of fs::path::canonical.

The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.

This is fixed by only calling parent_path() on TMP TOC, which is
guaranteed to exist prior to calling fsync_directory().

Fixes #10410.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-04-26 11:00:27 -03:00
Avi Kivity
975b0c0b03 Merge "tools/scylla-sstable: add validate-checksums and decompress" from Botond
"
This patchset adds two new operations to scylla-sstable:
* validate-checksums - helps identifying whether an sstable is intact or
  not, but checking the digest and the per-chunk checksums against the
  data on disk.
* decompress - helps when one wants to manually examine the content of a
  compressed sstable.

Refs: #497

Tests: unit(dev)
"

* 'scylla-sstable-validate-checksums-decompress/v3' of https://github.com/denesb/scylla:
  tools/scylla-sstable: consume_sstables(): s/no_skips/use_crawling_reader/
  tools/scylla-sstable: add decompress operation
  tools/scylla-sstables: add validate-checksums operation
  sstables/sstable: add validate_checksums()
  sstables/sstable: add raw_stream option to data_stream()
  sstables/sstable: make data_stream() and data_read() public
  utils/exceptions: add maybe_rethrow_exception()
2022-03-16 18:56:48 +02:00
Botond Dénes
ddf9dee9d8 sstables/sstable: add validate_checksums()
Sstables have two kind of checksums: per-chunk checksums and
full-checksum (digest) calculated over the entire content of Data.db.

The full-checksum (digest) is stored in Digest.crc
(component_type::Digest).

When compression is used, the per-chunk checksum is stored directly
inside Data.db, after each compressed chunk. These are validated on
read, when decompressing the respective chunks.
When no compression is used, the per-chunk checksum is stored separately
in CRC.db (component_type::CRC). Chunk size is defined and stored in said
component as well.

In both compressed and uncompressed sstables, checksums are calculated
on the data that is actually written to disk, so in case of compressed
data, on the compressed data.

This method validates both the full checksum and the per-chunk checksum
for the entire Data.db.
2022-03-15 14:52:15 +02:00
Mikołaj Sielużycki
1d84a254c0 flat_mutation_reader: Split readers by file and remove unnecessary includes.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.

With changes

real	29m14.051s
user	168m39.071s
sys	5m13.443s

Without changes

real	30m36.203s
user	175m43.354s
sys	5m26.376s

Closes #10194
2022-03-14 13:20:25 +02:00
Botond Dénes
959483a2dc test: migrate to the v2 variant of the sstable writer API 2022-03-10 09:16:33 +02:00
Botond Dénes
105bf8888a sstables: convert mx writer to v2
The sstables::sstable class has two methods for writing sstables:
1) sstable_writer get_writer(...);
2) future<> write_components(flat_mutation_reader, ...);

(1) directly exposes the writer type, so we have to update all users of
it (there is not that many) in this same patch. We defer updating
users of (2) to a follow-up commits.
2022-03-10 07:03:49 +02:00
Avi Kivity
cbba80914d memtable: move to replica module and namespace
Memtables are a replica-side entity, and so are moved to the
replica module and namespace.

Memtables are also used outside the replica, in two places:
 - in some virtual tables; this is also in some way inside the replica,
   (virtual readers are installed at the replica level, not the
   cooordinator), so I don't consider it a layering violation
 - in many sstable unit tests, as a convenient way to create sstables
   with known input. This is a layering violation.

We could make memtables their own module, but I think this is wrong.
Memtables are deeply tied into replica memory management, and trying
to make them a low-level primitive (at a lower level than sstables) will
be difficult. Not least because memtables use sstables. Instead, we
should have a memtable-like thing that doesn't support merging and
doesn't have all other funky memtable stuff, and instead replace
the uses of memtables in sstable tests with some kind of
make_flat_mutation_reader_from_unsorted_mutations() that does
the sorting that is the reason for the use of memtables in tests (and
live with the layering violation meanwhile).

Test: unit (dev)

Closes #10120
2022-02-23 09:05:16 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Botond Dénes
64bb48855c flat_mutation_reader: revamp flat_mutation_reader_from_mutations()
Add schema parameter so that:
* Caller has better control over schema -- especially relevant for
  reverse reads where it is not possible to follow the convention of
  passing the query schema which is reversed compared to that of the
  mutations.
* Now that we don't depend on the mutations for the schema, we can lift
  the restriction on mutations not being empty: this leads to safer
  code. When the mutations parameter is empty, an empty reader is
  created.
Add "make_" prefix to follow convention of similar reader factory
functions.

Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211115155614.363663-1-bdenes@scylladb.com>
2021-11-15 17:58:46 +02:00
Pavel Emelyanov
a2590368ce test: Generalize all make_sstable_easy()-s
There are already four of them. Those working with the mutation reader
can be folded into one with some default args.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Pavel Emelyanov
28e5307ce2 test: Reuse make_sstable_easy in datafile tests
This patch is two-fold. First it changes the signature of the
local helper to facilitate next patching. Second, it makes more
relevant places in the test use this helper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-20 15:44:14 +03:00
Pavel Emelyanov
5515f7187d range_tombstone, code: Add range_tombstone& getters
Currently all the code operates on the range_tombstone class.
and many of those places get the range tombstone in question
from the range_tombstone_list. Next patches will make that list
carry (and return) some new object called range_tombstone_entry,
so all the code that expects to see the former one there will
need to patched to get the range_tombstone from the _entry one.

This patch prepares the ground for that by introdusing the

    range_tombstone& tombstone() { return *this; }

getter on the range_tombstone itself and patching all future
users of the _entry to call .tombstone() right now.

Next patch will remove those getters together with adding the new
range_tombstone_entry object thus automatically converting all
the patched places into using the entry in a proper way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Benny Halevy
4476800493 flat_mutation_reader: get rid of timeout parameter
Now that the timeout is taken from the reader_permit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
f25aabf1b2 flat_mutation_reader: maybe_timed_out: use permit timeout
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Benny Halevy
46fb7fe68e test: sstable_datafile_test: add sstable_reader_with_timeout
Verify that the sstable reader (for the highest supported version)
times out properly.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 14:29:44 +03:00
Michael Livshin
37c9f8f137 tests: get rid of sstable::make_reader_v1() in the trivial cases
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-09 19:20:48 +03:00
Michael Livshin
f07306d75c sstables: make sstable::make_reader() return flat_mutation_reader_v2
Rename the old version to `sstables::make_reader_v1()`, to have a
nicely searcheable eradication target.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2021-08-09 19:20:48 +03:00
Raphael S. Carvalho
a869d61c89 tests: Move compaction-related tests into its own unit
With commit 1924e8d2b6, compaction code was moved into a
top level dir as compaction is layered on top of sstables.
Let's continue this work by moving all compaction unit tests
into its own test file. This also makes things much more
organized.

sstable_datafile_test, as its name implies, will only contain
sstable data tests. Perhaps it should be renamed to only
sstable_data_test, as the test also contains tests involving
other components, not only the data one.

BEFORE
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
105

AFTER
$ cat test/boost/sstable_compaction_test.cc | grep TEST_CASE | wc -l
57
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
48

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210802192120.148583-1-raphaelsc@scylladb.com>
2021-08-02 22:26:26 +03:00
Avi Kivity
42e1f318d7 Merge "Respect "bypass cache" in sstable index caching" from Tomasz
"
This series changes the behavior of the system when executing reads
annotated with "bypass cache" clause in CQL. Such reads will not
use nor populate the sstable partition index cache and sstable index page cache.
"

* 'bypass-cache-in-sstable-index-reads' of github.com:tgrabiec/scylla:
  sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads
  sstables: Do not populate partition index cache for "bypass cache" reads
2021-07-28 18:45:39 +03:00
Pavel Emelyanov
05b8cdfd24 mutation_partition: Return immutable collection for rows
Patch the .clustered_rows() method to return the btree of rows
wrapped into the immutable_collection<> so that callers are
guaranteed not to touch the collection itself, but still can
modify the elements in it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Tomasz Grabiec
f4227c303b sstables: Do not populate partition index cache for "bypass cache" reads
Index cursor for reads which bypass cache will use a private temporary
instance of the partition index cache.

Promoted index scanner (ka/la format) will not go through the page cache.
2021-07-15 12:13:20 +02:00
Botond Dénes
5c39a2921e test/boost/sstable_datafile_test: add test for validation compaction
Add a two new unit tests, one which cover the whole stack, and one which
stresses the validation part.
2021-07-12 10:25:15 +03:00
Botond Dénes
8296759cce test/boost/sstable_datafile_test: scrub tests: extract corrupt sst writer code into function
So the two tests having this almost identical code can shared it, and so
that it can be used by new tests.
2021-07-12 10:25:15 +03:00
Avi Kivity
9059514335 build, treewide: enable -Wpessimizing-move warning
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.

Fix the handful of cases (all trivial), and enable the warning.

Closes #8992
2021-07-08 17:52:34 +03:00
Botond Dénes
2d2b9e7b36 test/boost: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00
Botond Dénes
5fff314739 test/lib/simple_schema: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
d520655730 test/lib/sstable_utils: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Botond Dénes
3679418e62 test/lib/test_services: migrate off the global test reader semaphore 2021-07-08 15:28:39 +03:00
Raphael S. Carvalho
1924e8d2b6 treewide: Move compaction code into a new top-level compaction dir
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.

Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
2021-07-07 23:21:51 +03:00