"
After the recent conversion of the row-cache, two v1 mutation sources
remained: the memtable and the kl sstable reader.
This series converts both to a native v2 implementation. The conversion
is shallow: both continue to read and process the underlying (v1) data
in v1, the fragments are converted to v2 right before being pushed to
the reader's buffer. This conversion is simple, surgical and low-risk.
It is also better than the upgrade_to_v2() used previously.
Following this, the remaining v1 reader implementations are removed,
with the exception of the downgrade_to_v1(), which is the only one left
at this point. Removing this requires converting all mutation sinks to
accept a v2 stream.
upgrade_to_v2() is now not used in any production code. It is still
needed to properly test downgrade_to_v1() (which is till used), so we
can't remove it yet. Instead it hidden as a private method of
mutation_source. This still allows for the above mentioned testing to
continue, while preventing anyone from being tempted to introduce new
usage.
tests: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/191
"
* 'convert-remaining-v1-mutation-sources/v2' of https://github.com/denesb/scylla:
readers: make upgrade_to_v2() private
test/lib/mutation_source_test: remove upgrade_to_v2 tests
readers: remove v1 forwardable reader
readers: remove v1 empty_reader
readers: remove v1 delegating_reader
sstables/kl: make reader impl v2 native
sstables/kl: return v2 reader from factory methods
sstables: move mp_row_consumer_reader_k_l to kl/reader.cc
partition_snapshot_reader: convert implementation to native v2
mutation_fragment_v2: range_tombstone_change: add minimal_memory_usage()
This just moves the upgrade_to_v2() calls to the other side of said
factory methods, preparing the ground for converting the kl reader impl
to a native v2 one.
If SSTable write fails, it will leave a partial sst which contains
a temporary TOC in addition to other components partially written.
temporary TOC content is written upfront, to allow us from deleting
all partial components using the former content if write fails.
After commit e5fc4b6, partial sst cannot be deleted because deletion
procedure is incorrectly assuming all SSTs being deleted unconditionally
have TOC, but partial SSTs only have TMP TOC instead.
That happens because parent_path() requires all path components to
exist due to its usage of fs::path::canonical.
The consequence of this is that space of partial files cannot be
reclaimed, making it worse for Scylla to recover from ENOSPC,
which could happen by selecting a set of files for compaction with
higher chance of suceeeding given the free space.
This is fixed by only calling parent_path() on TMP TOC, which is
guaranteed to exist prior to calling fsync_directory().
Fixes#10410.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
fsync_directory() is broken because it's unconditionally performing
fsync on parent directory, not on the directory that it was called
with. To fix, let's remove wrong parent_path() usage.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
dirname() is confusing because if it's called on a directory, parent
path is retrieved. By renaming it to parent_path(), it's clearer
what the function will do exactly.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
"
First migrate all users to the v2 variant, all of which are tests.
However, to be able to properly migrate all tests off it, a v2 variant
of the restricted reader is also needed. All restricted reader users are
then migrated to the freshly introduced v2 variant and the v1 variant is
removed.
Users include:
* replica::table::make_reader_v2()
* streaming_virtual_table::as_mutation_source()
* sstables::make_reader()
* tests
This allows us to get rid of a bunch of conversions on the query path,
which was mostly v2 already.
With a few tests we did kick the can down the road by wrapping the v2
reader in `downgrade_to_v1()`, but this series is long enough already.
Tests: unit(dev), unit(boost/flat_mutation_reader_test:debug)
"
* 'remove-reader-from-mutations-v1/v3' of https://github.com/denesb/scylla:
readers: remove now unused v1 reader from mutations
test: move away from v1 reader from mutations
test/boost/mutation_reader_test: use fragment_scatterer
test/boost/mutation_fragment_test: extract fragment_scatterer into a separate hh
test/boost: mutation_fragment_test: refactor fragment_scatterer
readers: remove now unused v1 reversing reader
test/boost/flat_mutation_reader_test: convert to v2
frozen_mutation: fragment_and_freeze(): convert to v2
frozen_mutation: coroutinize fragment_and_freeze()
readers: migrate away from v1 reversing reader
db/virtual_table: use v2 variant of reversing and forwardable readers
replica/table: use v2 variant of reversing reader
sstables/sstable: remove unused make_crawling_reader_v1()
sstables/sstable: remove make_reader_v1()
readers: add v2 variant of reversing reader
readers/reversing: remove FIXME
readers: reader from mutations: use mutation's own schema when slicing
No external users, only used internally, by make_reader(), who delegates
cases currently unsupported by v2 to it. The code needed from
make_reader_v1() is inlined into make_reader() and the former is
removed.
In most files it was unused. We should move these to the patch which
moved out the last interesting reader from mutation_reader.hh (and added
the corresponding new header include) but its probably not worth the
effort.
Some other files still relied on mutation_reader.hh to provide reader
concurrency semaphore and some other misc reader related definitions.
"
This patchset adds two new operations to scylla-sstable:
* validate-checksums - helps identifying whether an sstable is intact or
not, but checking the digest and the per-chunk checksums against the
data on disk.
* decompress - helps when one wants to manually examine the content of a
compressed sstable.
Refs: #497
Tests: unit(dev)
"
* 'scylla-sstable-validate-checksums-decompress/v3' of https://github.com/denesb/scylla:
tools/scylla-sstable: consume_sstables(): s/no_skips/use_crawling_reader/
tools/scylla-sstable: add decompress operation
tools/scylla-sstables: add validate-checksums operation
sstables/sstable: add validate_checksums()
sstables/sstable: add raw_stream option to data_stream()
sstables/sstable: make data_stream() and data_read() public
utils/exceptions: add maybe_rethrow_exception()
Sstables have two kind of checksums: per-chunk checksums and
full-checksum (digest) calculated over the entire content of Data.db.
The full-checksum (digest) is stored in Digest.crc
(component_type::Digest).
When compression is used, the per-chunk checksum is stored directly
inside Data.db, after each compressed chunk. These are validated on
read, when decompressing the respective chunks.
When no compression is used, the per-chunk checksum is stored separately
in CRC.db (component_type::CRC). Chunk size is defined and stored in said
component as well.
In both compressed and uncompressed sstables, checksums are calculated
on the data that is actually written to disk, so in case of compressed
data, on the compressed data.
This method validates both the full checksum and the per-chunk checksum
for the entire Data.db.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.
With changes
real 29m14.051s
user 168m39.071s
sys 5m13.443s
Without changes
real 30m36.203s
user 175m43.354s
sys 5m26.376s
Closes#10194
The sstables::sstable class has two methods for writing sstables:
1) sstable_writer get_writer(...);
2) future<> write_components(flat_mutation_reader, ...);
(1) directly exposes the writer type, so we have to update all users of
it (there is not that many) in this same patch. We defer updating
users of (2) to a follow-up commits.
Although Cassandra generally does not allow empty strings as partition
keys (note they are allowed as clustering keys!), it *does* allow empty
strings in regular columns to be indexed by a secondary index, or to
become an empty partition-key column in a materialized view. As noted in
issues #9375 and #9364 and verified in a few xfailing cql-pytest tests,
Scylla didn't allow these cases - and this patch fixes that.
The patch mostly *removes* unnecessary code: In one place, code
prevented an sstable with an empty partition key from being written.
Another piece of removed code was a function is_partition_key_empty()
which the materialized-view code used to check whether the view's
row will end up with an empty partition key, which was supposedly
forbidden. But in fact, should have been allowed like they are allowed
in Cassandra and required for the secondary-index implementation, and
the entire function wasn't necessary.
Note that the removed function is_partition_key_empty() was *NOT* required
for the "IS NOT NULL" feature of materialized views - this continues to
work as expected after this patch, and we add another test to confirm it.
Being null and being an empty string are two different things.
This patch also removes a part of a unit test which enshrined the
wrong behavior.
After this patch we are left with one interesting difference from
Cassandra: Though Cassandra allows a user to create a view row with an
empty-string partition key, and this row is fully visible in when
scanning the view, this row can *not* be queried individually because
"WHERE v=''" is forbidden when v is the partition key (of the view).
Scylla does not reproduce this anomaly - and such point query does work
in Scylla after this patch. We add a new test to check this case, and mark
it "cassandra_bug", i.e., it's a Cassandra behavior which we consider
wrong and don't want to emulate.
This patch relies on #9352 and #10178 having been fixed in previous patches,
otherwise the WHERE v='' does not work when reading from sstables.
We add to the already existing tests we had for empty materialized-views
keys a lookup with WHERE v='' which failed before fixing those two issues.
Fixes#9364Fixes#9375
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If the sstable is marked for deletion, e.g. when
writing the sstable fails for any reason before it's
sealed, make sure to remove the sstable's temporary
directory, if present, besides the sstables files.
This condition is benign as these empty temp dirs
are removed when scylla starts up, but the do accumulate
and we better remove them too.
Fixes#9522
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220302161827.2448980-1-bhalevy@scylladb.com>
This makes host id mismatch cause a warning and stop being fatal,
to un-break node replacement dtests.
Should be revisited if/when the underlying problem (double setting of
local host id on a replacing node) is fixed.
Refs #10148
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Message-Id: <20220303085049.186259-1-michael.livshin@scylladb.com>
Memtables are a replica-side entity, and so are moved to the
replica module and namespace.
Memtables are also used outside the replica, in two places:
- in some virtual tables; this is also in some way inside the replica,
(virtual readers are installed at the replica level, not the
cooordinator), so I don't consider it a layering violation
- in many sstable unit tests, as a convenient way to create sstables
with known input. This is a layering violation.
We could make memtables their own module, but I think this is wrong.
Memtables are deeply tied into replica memory management, and trying
to make them a low-level primitive (at a lower level than sstables) will
be difficult. Not least because memtables use sstables. Instead, we
should have a memtable-like thing that doesn't support merging and
doesn't have all other funky memtable stuff, and instead replace
the uses of memtables in sstable tests with some kind of
make_flat_mutation_reader_from_unsorted_mutations() that does
the sorting that is the reason for the use of memtables in tests (and
live with the layering violation meanwhile).
Test: unit (dev)
Closes#10120
Currently, when advancing one of index_reader's bounds,
we're creating a new index_consume_entry_context with a new
underlying file input_stream for each new page.
For either bound, the streams can be reused, because
the indexes of pages that we are reading are never
decreasing.
This patch adds a index_consume_entry_context to each of
index_reader's bounds, so that for each new page, the same
file input_stream is used.
As a result, when reading consecutive pages, the reads that
follow the first one can be satisfied by the input_stream's
read aheads, decreasing the number of blocking reads and
increasing the throughput of the index_reader.
Fixes#2388
Signed-off-by: Wojciech Mitros <wojciech.mitros@scylladb.com>
Add an additional sstable validation step to check that originating
host id matches the local host id.
This is only done for ME-and-up sstables, which do not come from
upload/, and when the local host id is known.
When local host id is unknown, check that the sstable belongs to a
system keyspace, i.e. whether it is plausible that Scylla is still
booting up and hasn't loaded/generated the local host id yet.
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
(that is, instances of `std::optional`).
The ME sstable format includes optional originating host id in stats
metadata. We know how to write and parse uuids, but not how to write
and parse optionals.
The format is (used by C* in this case, and also happens to be
consistent with how booleans are serialized): first a boolean
indicating whether the contents are present (0 or 1, as a byte), then
the contents (if any).
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.
Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.
The changes we applied mechanically with a script, except to
licenses/README.md.
Closes#9937
"
Said method should take care of checking that parsing stopped in a valid
state. This patch-set expands the existing but very lacking
implementation by improving the existing error message and adding an
additional check for prematurely exiting the parser in the middle of
parsing an index entry, something we've seen recently in #9446.
To help in debugging such issues, some additional information is added
to the trace messages.
The series also fixes a bug in the error handling code of the partition
index cache.
Refs: #9446
Tests: unit(dev)
"
* 'index-reader-better-verify-end-state/v2.1' of https://github.com/denesb/scylla:
sstables/index_reader: process_state(): add additional information to trace logging
sstables/index_reader: verify_end_state(): add check for premature EOS
sstables/index_reader: convert exception in verify_end_state() to malformed sstable exception
sstables/index_reader: add const sstable& to index_consume_entry_context
sstables/index_reader: remove unused members from index_consume_entry_context
bytes_on_disk is intended to reflect the bytes allocated for the
sstable files on disk.
Accumulating the files logical size, as done today, causes a
discrepancy between information retrieved over the
storage_service/sstables_info api, like nodetool status or nodetool
cfstats and command line tools like df -H /var/lib/scylla.
Fixes#9941
Test: unit(dev)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220118070208.3963076-1-bhalevy@scylladb.com>
sstables/sstables.cc uses seastar::metrics but was missing an include of
<seastar/core/metrics.hh>. It probably received this include through
some other random included Seastar header (e.g., smp.hh).
Now that we're reducing the unnecessary inclusions in Seastar (an ongoing
effort of Seastar patches), it is no longer included implicitly, and we
need to include it explicitly in sstables.cc.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Message-Id: <20220109162823.511781-1-nyh@scylladb.com>
database.hh is expensive to include, and turns out it's no longer
needed. also stop including other unused ones.
build time of sstables.o reduces by ~3% (cleared all caches and set
cpu frequency to a fixed value before building sstables.o from
scratch)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20220104175908.98833-1-raphaelsc@scylladb.com>
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.
To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.
In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:
1) GC a tombstone after gc_grace_seconds
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;
This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.
2) Never GC a tombstone
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};
3) GC a tombstone immediately
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};
4) GC a tombstone after repair
cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};
In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.
A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.
Tests: compaction_test.py, ninja test
Fixes#3560
[avi: resolve conflicts vs data_dictionary]
Currently this parse function reads only 100KB worth
of members in eac hiteration.
Since the default max_chunk_capacity is 128KB,
100KB underutilize the chunk capacity, and it could
be safely increased to the max to reduce the number of
allocations and corresponding calls to read_exactly
for large arrays.
Expose utils::chunked_vector::max_chunk_capacity
so that the caler wouldn't have to guess this number
and use it in parse().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20211222103126.1819289-2-bhalevy@scylladb.com>
When invalid sstables are detected, move them
to the quarantine subdirectory so they won't be
selected for regular compaction.
Refs #7658
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Quarantined sstables will reside in a "quarantine" subdirectory
and are also not eligible for regular compaction.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Define the "staging", "upload", and "snapshots" subdirectory
names as named const expressions in the sstables namespace
rather than relying on their string representation,
that could lead to typo mistakes.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Most of the machinery was already implemented since it was used when
jumping between clustering ranges of a query slice. We need only perform
one additional thing when performing an index skip during
fast-forwarding: reset the stored range tombstone in the consumer (which
may only be stored in fast-forwarding mode, so it didn't matter that it
wasn't reset earlier). Comments were added to explain the details.
metric currently_open_for_writing, used to inform # of sstables opened for writing,
holds the same value as total_open_for_writing. that means we aren't actually
decreasing the counter, so it is bogus.
Moved to sstable_writer, because sstable is used by writer to open files,
which are then extracted from sstable object, and later the same object is
reused for read-only mode.
Fixes#9455.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Message-Id: <20211013134812.177398-1-raphaelsc@scylladb.com>
Not necessitating these to be extracted from the sstable dir path. This
practically allows for la/mx sstables at non-standard paths to be
opened. This will be used by the `scylla-sstable` tool which wants to be
flexible about where the sstables it opens are located.