The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.
As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
Add schema parameter so that:
* Caller has better control over schema -- especially relevant for
reverse reads where it is not possible to follow the convention of
passing the query schema which is reversed compared to that of the
mutations.
* Now that we don't depend on the mutations for the schema, we can lift
the restriction on mutations not being empty: this leads to safer
code. When the mutations parameter is empty, an empty reader is
created.
Add "make_" prefix to follow convention of similar reader factory
functions.
Tests: unit(dev)
Signed-off-by: Botond Dénes <bdenes@scylladb.com>
Message-Id: <20211115155614.363663-1-bdenes@scylladb.com>
There are already four of them. Those working with the mutation reader
can be folded into one with some default args.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch is two-fold. First it changes the signature of the
local helper to facilitate next patching. Second, it makes more
relevant places in the test use this helper.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently all the code operates on the range_tombstone class.
and many of those places get the range tombstone in question
from the range_tombstone_list. Next patches will make that list
carry (and return) some new object called range_tombstone_entry,
so all the code that expects to see the former one there will
need to patched to get the range_tombstone from the _entry one.
This patch prepares the ground for that by introdusing the
range_tombstone& tombstone() { return *this; }
getter on the range_tombstone itself and patching all future
users of the _entry to call .tombstone() right now.
Next patch will remove those getters together with adding the new
range_tombstone_entry object thus automatically converting all
the patched places into using the entry in a proper way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rename the old version to `sstables::make_reader_v1()`, to have a
nicely searcheable eradication target.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
With commit 1924e8d2b6, compaction code was moved into a
top level dir as compaction is layered on top of sstables.
Let's continue this work by moving all compaction unit tests
into its own test file. This also makes things much more
organized.
sstable_datafile_test, as its name implies, will only contain
sstable data tests. Perhaps it should be renamed to only
sstable_data_test, as the test also contains tests involving
other components, not only the data one.
BEFORE
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
105
AFTER
$ cat test/boost/sstable_compaction_test.cc | grep TEST_CASE | wc -l
57
$ cat test/boost/sstable_datafile_test.cc | grep TEST_CASE | wc -l
48
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210802192120.148583-1-raphaelsc@scylladb.com>
"
This series changes the behavior of the system when executing reads
annotated with "bypass cache" clause in CQL. Such reads will not
use nor populate the sstable partition index cache and sstable index page cache.
"
* 'bypass-cache-in-sstable-index-reads' of github.com:tgrabiec/scylla:
sstables: Do not populate page cache when searching in promoted index for "bypass cache" reads
sstables: Do not populate partition index cache for "bypass cache" reads
Patch the .clustered_rows() method to return the btree of rows
wrapped into the immutable_collection<> so that callers are
guaranteed not to touch the collection itself, but still can
modify the elements in it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Index cursor for reads which bypass cache will use a private temporary
instance of the partition index cache.
Promoted index scanner (ka/la format) will not go through the page cache.
This warning prevents using std::move() where it can hurt
- on an unnamed temporary or a named automatic variable being
returned from a function. In both cases the value could be
constructed directly in its final destination, but std::move()
prevents it.
Fix the handful of cases (all trivial), and enable the warning.
Closes#8992
Since compaction is layered on top of sstables, let's move all compaction code
into a new top-level directory.
This change will give me extra motivation to remove all layer violations, like
sstable calling compaction-specific code, and compaction entanglement with
other components like table and storage service.
Next steps:
- remove all layer violations
- move compaction code in sstables namespace into a new one for compaction.
- move compaction unit tests into its own file
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210707194058.87060-1-raphaelsc@scylladb.com>
index_entry will be an LSA-managed object. Those have to be accessed
with care, with the LSA region locked.
This patch hides most of direct index_entry accesses inside the
index_reader so that users are safe.
make_sstable_containing() already calls open_data(), so does
load(). This will trigger assertion failure added in a later patch:
assert(!_cached_index_file);
There is no need to call load() here.
This class is used in only few places and does not have to be included
everywhere sstable class is needed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
instead of LA. Ability to write LA and KA sstables will be removed
by the following patches so we need to switch all the tests to write
newer sstables.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Following patches will switch all sstable writing tests to use
the latest sstables format. compaction_with_fully_expired_table
contains some test for a LA specific behaviour so let's remove it
to make the switch possible.
For more context see https://github.com/scylladb/scylla/issues/2620
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
check_compacted_sstables is used in compact_02 test which uses sstables
created by compact_sstables. The problem is that schema used in
check_compacted_sstables and compact_sstables is not the same.
The type of r1 column is different. This was not a problem when the
test was running on LA sstables but following patches will switch
all the tests to use MC and then sstable schema becomes validated
when reading the sstable and the test will fail such validation.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Those tests check that created sstables have exactly the expected bytes
inside. This won't work with other sstable formats and writting LA/KA
sstables will be removed by the following patches so there's nothing
we can do with those tests but to remove them. Otherwise they will be
failing after LA/KA writting capability is removed.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
and use it instead of all_sstable_versions in tests that check
writting of sstables. Following patches remove LA/KA writer so we
want tests to be ready for that and not break by trying to write LA/KA
sstables.
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
The main difficulty was in making sure that emitted range tombstone
changes reflect range tombstones trimmed to clustering restrictions.
This is handled by mutation_fragment_filter and
clustering_ranges_walker. They return a list of range_tombstone_change
fragments to emit for each hop as the reader walks over the clustering
domain.
Tests which were using a normalizing reader expected range tombstones
to be split around rows. Drop this an adjust the tests accoridngly. No
reader splits range tombstones around rows now.
Incremental selection may not work properly for LCS and ICS due to an
use-after-free bug in partitioned set which came into existence after
compound set was introduced.
The use-after-free happens because partitioned set wasn't taking into
account that the next position can become the current position in the
next iteration, which will be used by all selectors managed by
compound set. So if next position is freed, when it were being used
as current position, subsequent selectors would find the current
position freed, making them produce incorrect results.
Fix this by moving ownership of next pos from incremental_selector_impl
to incremental_selector, which makes it more robust as the latter knows
better when the selection is done with the next pos. incremental_selector
will still return ring_position_view to avoid copies.
Fixes#8802.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210611130957.156712-1-raphaelsc@scylladb.com>
In #8772, an assert validating first token <= last token
failed in leveled_manifest::overlapping.
It is unclear how we got to that state, so add validation
in sstable::set_first_and_last_keys() that the to-be-set
first and last keys are well ordered.
Otherwise, throw malformed_sstable_exception.
set_first_and_last_keys is called both on the write path
from the sstable writer before the sstable is sealed,
and on the open/load path via update_info_for_opened_data().
This series also fixes issues with unit tests with
regards to first/last keys so they won't fail the
validation.
Refs #8772
Test: unit(dev)
DTest: next-gating(dev), materialized_views_test:TestMaterializedViews.interrupt_build_process_and_resharding_half_to_max_test(debug)
* tag 'validate-first-and-last-keys-ordering-v1':
sstable: validate first and last keys ordering
test: lib: reusable_sst: save unexpected errors
test: sstable_datafile_test: stcs_reshape_test: use token_generation_for_current_shard
test: sstable_test: define primary key in schema for compressed sstable
Currently the test is using "first_key", "last_key"
literals for the first and last keys and expects them
to sort properly with the murmur3 partitioner.
Also it does that for all generated sstables
which is less interesting for reshape.
Use token_generation_for_current_shard to
generate random, properly ordered keys.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Compaction manager can start tons of compaction of fully expired sstable in
parallel, which may consume a significant amount of resources.
This problem is caused by weight being released too early in compaction, after
data is all compacted but before table is called to update its state, like
replacing sstables and so on.
Fully expired sstables aren't actually compacted, so the following can happen:
- compaction 1 starts for expired sst A with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 2 starts for expired sst B with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 3 starts for expired sst C with weight W, but there's nothing to
be compacted, so weight W is released, then calls table to update state.
- compaction 1 is done updating table state, so it finally completes and
releases all the resources.
- compaction 2 is done updating table state, so it finally completes and
releases all the resources.
- compaction 3 is done updating table state, so it finally completes and
releases all the resources.
This happens because, with expired sstable, compaction will release weight
faster than it will update table state, as there's nothing to be compacted.
With my reproducer, it's very easy to reach 50 parallel compactions on a single
shard, but that number can be easily worse depending on the amount of sstables
with fully expired data, across all tables. This high parallelism can happen
only with a couple of tables, if there are many time windows with expired data,
as they can be compacted in parallel.
Prior to 55a8b6e3c9, weight was released earlier in compaction, before
last sstable was sealed, but right now, there's no need to release weight
earlier. Weight can be released in a much simpler way, after the compaction is
actually done. So such compactions will be serialized from now on.
Fixes#8710.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210527165443.165198-1-raphaelsc@scylladb.com>
[avi: drop now unneeded storage_service_for_tests]
The purpose of the class in question is to start sharded storage
service to make its global instance alive. I don't know when exactly
it happened but no code that instantiates this wrapper really needs
the global storage service.
Ref: #2795
tests: unit(dev), perf_sstable(dev)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Message-Id: <20210526170454.15795-1-xemul@scylladb.com>
Off-strategy compaction on a table using STCS is slow because of
the needless write amplification of 2. That's because STCS reshape
isn't taking advantage of the fact that sstables produced by
a repair-based operation are disjoint. So the ~256 input sstables
were compacted (in batches of 32) into larger sstables, which in
turn were compacted into even larger ones. That write amp is very
significant on large data sets, making the whole operation 2x
slower.
Fixes#8449.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210524213426.196407-1-raphaelsc@scylladb.com>
"
The current scrub compaction has a serious drawback, while it is
very effective at removing any corruptions it recognizes, it is very
heavy-handed in its way of repairing such corruptions: it simply drops
all data that is suspected to be corrupt. While this *is* the safest way
to cleanse data, it might not be the best way from the point of view of
a user who doesn't want to loose data, even at the risk of retaining
some business-logic level corruption. Mind you, no database-level scrub
can ever fully repair data from the business-logic point of view, they
can only do so on the database-level. So in certain cases it might be
desirable to have a less heavy-handed approach of cleansing the data,
that tries as hard as it can to not loose any data.
This series introduces a new scrub mode, with the goal of addressing
this use-case: when the user doesn't want to loose any data. The new
mode is called "segregate" and it works by segregating its input into
multiple outputs such that each output contains a valid stream. This
approach can fix any out-of-order data, be that on the partition or
fragment level. Out-of-order partitions are simply written into a
separate output. Out of order fragments are handled by injecting a
partition-end/partition-start pair right before them, so that they are
now in a separate (duplicate) partition, that will just be written into
a separate output, just like a regular out-of-order partition.
The reason this series is posted as an RFC is that although I consider
the code stable and tested, there are some questions related to the UX.
* First and foremost every scrub that does more than just discard data
that is suspected to be corrupt (but even these a certain degree) have
to consider the possibility that they are rehabilitating corruptions,
leaving them in the system without a warning, in the sense that the
user won't see any more problems due to low-level corruptions and
hence might think everything is alright, while data is still corrupt
from the business logic point of view. It is very hard to draw a line
between what should and shouldn't scrub do, yet there is a demand from
users for scrub that can restore data without loosing any of it. Note
that anybody executing such a scrub is already in a bad shape, even if
they can read their data (they often can't) it is already corrupt,
scrub is not making anything worse here.
* This series converts the previous `skip_corrupted` boolean into an
enum, which now selects the scrub mode. This means that
`skip_corrupted` cannot be combined with segregate to throw out what
the former can't fix. This was chosen for simplicity, a bunch of
flags, all interacting with each other is very hard to see through in
my opinion, a linear mode selector is much more so.
* The new segregate mode goes all-in, by trying to fix even
fragment-level disorder. Maybe it should only do it on the partition
level, or maybe this should be made configurable, allowing the user to
select what to happen with those data that cannot be fixed.
Tests: unit(dev), unit(sstable_datafile_test:debug)
"
* 'sstable-scrub-segregate-by-partition/v1' of https://github.com/denesb/scylla:
test: boost/sstable_datafile_test: add tests for segregate mode scrub
api: storage_service/keyspace_scrub: expose new segregate mode
sstables: compaction/scrub: add segregate mode
mutation_fragment_stream_validator: add reset methods
mutation_writer: add segregate_by_partition
api: /storage_service/keyspace_scrub: add scrub mode param
sstables: compaction/scrub: replace skip_corrupted with mode enum
sstables: compaction/scrub: prevent infinite loop when last partition end is missing
tests: boost/sstable_datafile_test: use the same permit for all fragments in scrub tests
With strict mode, it could happen that a sstable alone in level 0 is
selected for offstrategy compaction, which means that we could run
into an infinite reshape process.
This is fixed by respecting the offstrategy threshold. Unit test is
added.
Fixes#8573.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20210506181324.49636-1-raphaelsc@scylladb.com>
`with_closeable` simplifies scoped use of
flat_mutation_reader, making sure to always close
the reader after use.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The merger could return end-of-stream if some (but not all) of the
underlying readers were empty (i.e. not even returning a
`partition_start`). This could happen in places where it was used
(`time_series_sstable_set::create_single_key_sstable_reader`) if we
opened an sstable which did not have the queried partition but passed
all the filters (specifically, the bloom filter returned a false
positive for this sstable).
The commit also extends the random tests for the merger to include empty
readers and adds an explicit test case that catches this bug (in a
limited scope: when we merge a single empty reader).
It also modifies `test_twcs_single_key_reader_filtering` (regression
test for #8432) because the time where the clustering key filter is
invoked changes (some invocations move from the constructor of the
merger to operator()). I checked manually that it still catches the bug
when I reintroduce it.
Fixes#8445.
Closes#8446