Commit Graph

107 Commits

Author SHA1 Message Date
Taras Veretilnyk
bc2e83bc1f sstables: store digest of all sstable components in scylla metadata
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in the sstable structure and later persisted to disk
as part of the Scylla metadata component during writer::consume_end_of_stream.
2025-12-04 21:00:09 +01:00
Taras Veretilnyk
a191503ddf sstables: Extract file writer closing logic into separate methods
Refactor the consume_end_of_stream() method by extracting the inline
file writer closing logic into dedicated methods:
- close_index_writer()
- close_partitions_writer()
- close_rows_writer()
2025-12-02 13:07:41 +01:00
Michał Chojnowski
3cf51cb9e8 sstables: fix some typos in comments
I added those typos recently, and spellcheckers complain.

Closes scylladb/scylladb#26376
2025-10-07 13:20:06 +03:00
Michał Chojnowski
cee4011e7a sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller
Partitions.db internally uses a piece of the partition key murmur
hash (the same hash which is used to compute the token and the
relevant bits in the bloom filter). Before this patch,
the Partitions.db writer computes the hash internally from the
`sstables::partition_key`.

That's a waste, because this hash is also computed for bloom filter
purposes just before that, in the owning sstable writer.
So in this patch we let the caller pass that hash here instead.
2025-09-29 13:01:21 +02:00
Michał Chojnowski
f8e3d5e7c2 sstables/mx: make Index and Summary components optional
In previous patches we (hopefully) modified all users of
Index and Summary components so that they don't longer
need those components to exist. (And can use Partitions and
Rows components instead).
2025-09-29 13:01:21 +02:00
Michał Chojnowski
b1984d6798 sstables: implement an alternative way to rebuild bloom filters for sstables without Index
For efficiency, the cardinality of the bloom filter
(i.e. the number of partition keys which will be written into the sstable)
has to be known before elements are inserted into the filter.

In some cases (e.g. memtables flush) this number is known exactly.
But in others (e.g. repair) it can only be estimated,
and the estimation might be very wrong, leading to an oversized filter.

Because of that, some time ago we added a piece of logic
(ran after the sstable is written, but before it's sealed)
which looks at the actual number of written partitions,
compares it to the initial estimate (on which the size of the bloom
filter was based on), and if the difference is unacceptably large,
it rewrites the bloom filter from partition keys contained in Index.db.

But the idea to rebuild the bloom filters from index files
isn't going to work with BTI indexes, because they don't store
whole partition keys. If we want sstables which don't have Index.db
files, we need some other way to deal with oversized filters.
Partition keys can be recovered from Data.db,
but that would often be way too expensive.

This patch adds another way. We introduce a new component file,
TemporaryHashes. This component, if written at all,
contains the 16-byte murmur hash for every partition key, in order,
and can be used in place of Index to reconstruct the bloom filter.

(Our bloom filters are actually built from the set of murmur hashes of
partition keys. The first step of inserting a partition key into a
filter is hashing the key. Remembering the hashes is sufficient
to build the filter later, without looking at partition keys again.)

As of this patch, if the Index component is not being written,
we don't allocate and populate a bloom filter during the Data.db write.
Instead, we write the murmur hashes to TemporaryHashes, and only
later, after the Data write finishes, we allocate the optimal-size,
bloom filter, we read the hashes back from TemporaryHashes,
and we populate the filter with them.

That is suboptimal.
Writing the hashes to disk (or worse, to S3) and reading
them back is more expensive than building the bloom filter
during the main Data pass.
So ideally it should be avoided in cases where we know
in advance that the partition key count estimate is good enough.
(Which should be the case in flushes and compactions).
But we defer that to a future patch.
(Such a change would involve passing some flag to the sstable writer
if the cardinality estimate is trustworthy, and not creating
TemporaryHashes if the estimate is trustworthy).
2025-09-29 13:01:21 +02:00
Michał Chojnowski
e0fda9ae6f sstables/mx/writer: populate BTI index files
In the previous patch we added code responsible
for creating and opening Partitions.db and Rows.db,
but we left those files empty.

In this patch, we populate the files using
`trie::bti_row_index_writer` and `trie::bti_partition_index_writer`.

Note: for the row index, we insert the same clustering blocks to
both indexes. The logic for choosing the size of the blocks
hasn't been changed in any way.

Much of this patch has to do with propagating the current range
tombstone down to all places which can start a new clustering block.

The reason we need that is that, for each clustering block,
BIG indexes store the range tombstone succeeding the block
(i.e. the range tombstone in between the given block and its successor)
BTI indexes store the range tombstone preceding the block,
(i.e. the range tombstone in between the given block and its predecessor).
So before the patch there's no code which looks at the current tombstone
when *starting* the block, only when *ending* the block.

This patch adds an extra copy for each `decorated_key`.
This is mostly unavoidable -- the BTI partition writer just
has to remember the key until its successor appears, to find the
common prefix. (We could avoid the key copy if the BTI isn't used, though.
We don't do that in this patch, we just let the copy happen).
2025-09-29 13:01:21 +02:00
Michał Chojnowski
cdcf34b3a0 sstables: create and open BTI index files, when enabled
This patch adds code responsible for creation and opening
of BTI index components (Rows.db, Partitions.db) when
BTI index writing is enabled.

(It is enabled if the cluster feature is enabled and the relevant
config entry permits it).

The files are empty for now, and are never read.
We will populate and use them in following patches.
2025-09-29 13:01:21 +02:00
Michał Chojnowski
e04ee6d5f6 sstables/mx/writer: make _pi_write_m.partition_tombstone a sstables::deletion_time
There's a test (boost/sstable_compaction_test.cc::tombstone_purge_test)
which tests the value of `_stats.capped_tombstone_deletion_time`.

Before this patch, for "ms" sstables, `to_deletion_time` would have
be called twice for each written partition tombstone, which would fail
the test.

Since `_pi_write_m.partition_tombstone` always ends up being
converted from `tombstone` to `sstables::deletion_time` anyway,
let's just make it a `sstables::deletion_time` to begin with.
This will ensure that `to_deletion_time` will be able to be
only called once per partition tombstone.
2025-09-29 13:01:20 +02:00
Michał Chojnowski
416f5d64d4 sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone
With `tomb` it's not obvious enough what tombstone this field is about.
2025-09-17 12:22:40 +02:00
Michał Chojnowski
94088f0a41 sstables/writer: add an accessor for the current write position in Data.db
It will be used by index tests to know the ground truth for
where each partition and row are written, so that this can
be checked against the index.
2025-09-07 00:32:02 +02:00
Michał Chojnowski
30dad06c9a sstables/mx: move clustering_info from writer.cc to types.hh
We will use this type as the input to the BTI row index writer.
Since it will be implemented in other translation units,
the definition of the type has to be moved to a header.
2025-08-14 02:06:33 +02:00
Botond Dénes
592ca789e2 sstables/mx/writer: handler rows with empty keys
Although valid for compact tables, non-full (or empty) clustering key
prefixes are not handled for row keys when writing sstables. Only the
present components are written, consequently if the key is empty, it is
omitted entirely.
When parsing sstables, the parsing code unconditionally parses a full
prefix. This mis-match results in parsing failures, as the parser parses
part of the row content as a key resulting in a garbage key and
subsequent mis-parsing of the row content and maybe even subsequent
partitions.

Use the recently introduced corrupt_data_handler to handle rows with
such corrupt keys. This way, we avoid corrupting the sstables beyond
parsing and the rows are also kept around in system.corrupt_data for
later inspection and possible recovery.
2025-06-25 08:41:29 +03:00
Michał Chojnowski
b18ddcb92e sstables: delegate compressor creation to the compressor factory
Remove `compressor::create()`. This enforces that compressors
are only created through the `sstable_compressor_factory`.

Unlike the synchronous `compressor::create()`, the factory will be able
to create dict-aware compressors.
2025-04-01 00:07:28 +02:00
Michał Chojnowski
8e611536b0 sstables/compress: move ownership of compressor to sstable::compression
SSTable readers and writers use `compressor` objects to compress and
decompress chunks of SSTable data files.

`compressor` objects are read-only, so only one of them is needed
for each SSTable. Before this commit, each reader and writer has
its own `compressor` object. This isn't necessary, but it's okay.

But later in this series it will stop being okay, because the creation
of a `compressor` will become an expensive cross-shard
operation (because it might require sharing a compression dictionary
from another shard). So we have to adjust the code so that there is
only once `compressor` per sstable, not one per reader/writer.

We stuff the ownership of this compressor into `sstable::compression`.

To make the ownership clear, we remove `compression_ptr` shared
pointers from readers and writers, and make them access the
compressor via the `sstable::compression` instead.
2025-04-01 00:07:27 +02:00
Avi Kivity
73e4a3c581 sstables: store features early in write path
sstable features indicate that an sstable has some extension, or that
some bug was fixed. They allow us to know if we can rely on certain
properties in a read sstables.

Currently, sstable features are set early in the read path (when we
read the scylla metadata file) and very late in the write path
(when we write the scylla metadata file just before sealing the sstable).

However, we happen to read features before we set them in the write path -
when we resize the bloom filter for a newly written sstable we instantiate
an index reader, and that depends on some features. As a result,
we read a disengaged optional (for the scylla metadata component) as if
it was engaged. This somehow worked so far, but fails with libstdc++
hash table implementation.

Fix it by moving storage of the features to the sstable itself, and
setting it early in the write path.

Fixes #23484

Closes scylladb/scylladb#23485
2025-03-31 09:33:56 +03:00
Pavel Emelyanov
68c41f0459 sstables: Make file_writer keep component_name on board
The class in question is a wrapper around output_stream that writes,
flushes and closes the stream in async context. For logging it also
keeps the component filename on board, and now it's good time to patch
it and keep the component_filename instead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
1ba91e28cb sstables: Make get_filename() return component_name
Similarly to previous patches -- mostly the result is used as log
argument. The remaining users include

- scylla sstable tool that dumps component names to json output
- API endpoint that returns component names to user
- tests

these are all good to explicitly convert component_names to strings.

There are few more places that expect strings instead of component name
objects. For now they also use fmt::to_string() explicitly, partially it
will be fixed later, mostly -- as future follow-ups.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:03:29 +03:00
Pavel Emelyanov
80e0030613 sstables: Make sstable::index_filename() return component_name
Most of the method callers use it as log parameter. There are few more
places that push it to malformed_sstable_exception, which immediately
converts it to string, so this patch makes the exception be constructed
with the component_name either.

And there's one more place that passes this string to file_writer
constructor. For now, convert it to string explicitly, but next patches
will fix that place to use pure component_name too.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 13:01:23 +03:00
Pavel Emelyanov
dcc9167734 sstables: Rename filename($component) calls to ${component}_filename()
There's a generic sstable::filename(component_type) method that returns
a file name for the given component. For "popular" components, namely
TOC, Data and Index there are dedicated sstable methods to get their
names. Fix existing callers of the generic method to use the former.
It's shorter, nicer and makes further patching simpler.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-03-19 12:45:21 +03:00
Pavel Emelyanov
eff61b167c treewide: Reduce db/config.hh header fanout
Drop it from files that obviously don't need it. Also kill some forward
declarations while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#22979
2025-02-25 15:16:40 +01:00
Kefu Chai
0237913337 sstables: Migrate from boost::adaptors::indexed to std::views::enumerate
This change modernizes the codebase by:
- Replacing Boost's indexed adaptor with C++20's std::views::enumerate
- Removing unnecessary Boost header inclusion

With this change, we can:
- Reduce external dependencies
- Leverage standard library features
- Improve long-term code maintainability

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22469
2025-01-26 20:51:14 +02:00
Kefu Chai
93be8f3a0c db,sstables: migate boost::range::stable_partition to std library
now that we are allowed to use C++23. we now have the luxury of using
`std::ranges::stable_partition`.

in this change, we:

- replace `boost::range::stable_parition()` to
  `std::ranges::stable_parition()`
- since `std::ranges::stable_parition()` returns a subrange instead of
  an iterator, change the names of variables which were previously used
  for holding the return value of `boost::range::stable_partition()`
  accordingly for better readability.
- remove unused `#include` of boost headers

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21911
2024-12-19 14:56:07 +02:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Avi Kivity
94c21e5c05 Merge 'sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions' from Tomasz Grabiec
Single-row reads from large partition issue 64 KiB reads to the data file,
which is equal to the default span of the promoted index block in the data file.
If users would want to increase selectivity of the index to speed up single-row reads,
this won't be effective. The reason is that the reader uses promoted index
to look up the start position in the data file of the read, but end position
will in practice extend to the next partition, and amount of I/O will be
determined by the underlying file input stream implementation and its
read-ahead heuristics. By default, that results in at least 2 IOs 32KB each.

There is already infrastructure to lookup end position based on upper
bound of the read, in anticipation for sharing the promoted index cache,
but it's not effective becasue it's a non-populating lookup and the upper
bound cursor has its own private cached_promoted_index, which is cold
when positions are computed. It's non-populating on purpose, to avoid
extra index file IO to read upper bound. In case upper bound is far-enough
from the lower bound, this will only increase the cost of the read.

The solution employed here is to warm up the lower bound cursor's
cache before positions are computed, and use that cursor for
non-populating lookup of the upper bound.

We use the lower bound cursor and the slice's lower bound so that we
read the same blocks as later lower-bound slicing would, so that we
don't incur extra IO for cases where looking up upper bound is not
worth it, that is when upper bound is far from the lower bound. If
upper bound is near lower bound, then warming up using lower bound
will populate cached_promoted_index with blocks which will allow us to
locate the upper bound block accurately.  This is especially important
for single-row reads, where the bounds are around the same key.  In
this case we want to read the data file range which belongs to a
single promoted index block.  It doesn't matter that the upper bound
is not exactly the same. They both will likely lie in the same block,
and if not, binary search will bring adjacent blocks into cache.  Even
if upper bound is not near, the binary search will populate the cache
with blocks which can be used to narrow down the data file range
somewhat.

Fixes #10030.

The change was tested with perf-fast-forward.

I populated the data set with `column_index_size_in_kb` set to 1

  scylla perf-fast-forward --populate --run-tests=large-partition-slicing --column-index-size-in-kb=1

Test run:

  build/release/scylla perf-fast-forward --run-tests=large-partition-select-few-rows -c1 --keep-cache-across-test-cases --test-case-duration=0

This test issues two reads of subsequent keys from the middle of a large partition (1M rows in total). The first read will miss in the index file page cache, the second read will hit.

Notice that before the change, the second read issued 2 aio requests worth of 64KiB in total.
After the change, the second read issued 1 aio worth of 2 KiB. That's because promoted index block is larger than 1 KiB.
I verified using logging that the data file range matches a single promoted index block.

Also, the first read which misses in cache is still faster after the change.

Before:

```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009802            1         1        102          0        102        102       21.0     21        196       2       1        0        1        1        0        0        0       568     269 4716050  53.4%
500001  1         0.000321            1         1       3113          0       3113       3113        2.0      2         64       1       0        1        0        0        0        0        0       116      26  555110  45.0%
```

After:

```
running: large-partition-select-few-rows on dataset large-part-ds1
Testing selecting few rows from a large partition:
stride  rows      time (s)   iterations     frags     frag/s    mad f/s    max f/s    min f/s    avg aio    aio      (KiB) blocked dropped  idx hit idx miss  idx blk    c hit   c miss    c blk    allocs   tasks insns/f    cpu
500000  1         0.009609            1         1        104          0        104        104       20.0     20        137       2       1        0        1        1        0        0        0       561     268 4633407  43.1%
500001  1         0.000217            1         1       4602          0       4602       4602        1.0      1          2       1       0        1        0        0        0        0        0       110      26  313882  64.1%
```

Backports: none, not a regression

Closes scylladb/scylladb#20522

* github.com:scylladb/scylladb:
  perf: perf_fast_forward: Add test case for querying missing rows
  perf-fast-forward: Allow overriding promoted index block size
  perf-fast-forward: Test subsequent key reads from the middle in test_large_partition_select_few_rows
  perf-fast-forward: Allow adding key offset in test_large_partition_select_few_rows
  perf-fast-forward: Use single-partition reads in test_large_partition_select_few_rows
  sstables: bsearch_clustered_cursor: Add more tracing points
  sstables: reader: Log data file range
  sstables: bsearch_clustered_cursor: Unify skip_info logging
  sstables: bsearch_clustered_cursor: Narrow down range using "end" position of the block
  sstables: bsearch_clustered_cursor: Skip even to the first block
  test: sstables: sstable_3_x_test: Improve failure message
  sstables: mx: writer: Never include partition_end marker in promoted index block width
  sstables: Reduce amount of I/O for clustering-key-bounded reads from large partitions
  sstables: clustered_cursor: Track current block
2024-10-28 21:13:23 +02:00
Avi Kivity
820509026f schema: replace boost ranges with std ranges
To reduce dependency load, use std ranges instead of boost ranges.

The std::ranges::{lower,upper}_bound don't support heterogeneous lookup,
but a more natural solution is to use a projection to search for the name,
so we use that and the custom comparator is removed.

Many callers are converted as well due to poor interoperability between
boost ranges and std ranges.
2024-10-15 16:42:54 +03:00
Tomasz Grabiec
7f077893ed sstables: mx: writer: Never include partition_end marker in promoted index block width
Currently, it may happen that the last promoted index block includes
the partition_end marker. That's because we first write the partition
end marker and then emit the unclosed block. This behavior matches
Cassandra (checked in 3.x and 5.0.1).

This is problematic for ruling out data file reads based on index.
The width field is currently unused, but it will be used later where
the width of the last block is used to compute the skip position past
the last block for lookups which land after all keys in the
partition. If width includes the marker then such a skip would land in
the next partition, which is incorrect, as the reader context expects
a cell element. Even if that was recognized, it's wrong - if this is
not a single partition read (so upper bound is not at the next
partition too), then we would read from the wrong (next) partition.

We want to be able to make such skips in order to avoid unnecessary
data file IO for reads of missing rows. Currently, we would always
read the last block even if the key is past its "end" position.

Another way to solve this would be to propagate the "past the last
block" condition from the index cursor to the reader and let it deal
with it, but the logic for that would be complicated. With this fix,
there is no special logic required.
2024-10-03 14:09:57 +02:00
Botond Dénes
c7c5817808 Merge 'Improve timestamp heuristics for tombstone garbage collection' from Benny Halevy
When purging regular tombstone consult the min_live_timestamp, if available.
This is safe since we don't need to protect dead data from resurrection, as it is already dead.

For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.

If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data.

In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable.

If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp.

Fixes scylladb/scylladb#20423
Fixes scylladb/scylladb#20424

> [!NOTE]
> no backport needed at this time
> We may consider backport later on after given some soak time in master/enterprise
> since we do see tombstone accumulation in the field under some materialized views workloads

Closes scylladb/scylladb#20446

* github.com:scylladb/scylladb:
  cql-pytest: add test_compaction_tombstone_gc
  sstable_compaction_test: add mv_tombstone_purge_test
  sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection
  sstable_compaction_test: tombstone_purge_test: add testlog debugging
  sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp
  sstable, compaction: add debug logging for extended min timestamp stats
  compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats
  compaction: define max_purgeable_fn
  tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh
  sstables: scylla_metadata: add ext_timestamp_stats
  compaction_group, storage_group, table_state: add extended timestamp stats getters
  sstables, memtable: track live timestamps
  memtable_encoding_stats_collector: update row_marker: do nothing if missing
2024-09-13 08:56:51 +03:00
Kefu Chai
3e84d43f93 treewide: use seastar::format() or fmt::format() explicitly
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.

that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:

```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
  265 |     return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
      |            ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
 4290 |     format(format_string<_Args...> __fmt, _Args&&... __args)
      |     ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
  143 | format(fmt::format_string<A...> fmt, A&&... a) {
      | ^
```

in this change, we

change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
  `seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
  because, `sstring::operator std::basic_string` would incur a deep
  copy.

we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-11 23:21:40 +03:00
Benny Halevy
4de4af954f sstables: scylla_metadata: add ext_timestamp_stats
Store and retrieve the optional extended timestamp statistics
(min_live_timestamp and min_live_row_marker_timestamp)
in the scylla_metadata component.

Note that there is no need for a cluster feature to
store those attributes since the scylla_metadata
on-disk format is extensible so that old sstables
can be read by new versions, seeing the extra stats
is missing, and new sstables can be read by old
versions that ignore unknown scylla metadata section types.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:57 +03:00
Benny Halevy
14d86a3a12 sstables, memtable: track live timestamps
When garbage collecting tombstones, we care only
about shadowing of live data.  However, currently
we track min/max timestamp of both live and dead
data, but there is no problem with purging tombstones
that shadow dead data (expired or shdowed by other
tombstones in the sstable/memtable).

Also, for shadowable tombstones, we track live row marker timestamps
separately since, if the live row marker timestamp is greater than
a shadowable tombstone timestamp, then the row marker
would shadow the shadowable tombstone thus exposing the cells
in that row, even if their timestasmp may be smaller
than the shadow tombstone's.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:49 +03:00
Avi Kivity
aa1270a00c treewide: change assert() to SCYLLA_ASSERT()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.

Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.

To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.

[1] 66ef711d68

Closes scylladb/scylladb#20006
2024-08-05 08:23:35 +03:00
Lakshmi Narayanan Sreethar
7b58fa2534 sstables: use _origin in write path
Now that the origin is available inside the sstable object, no need to
pass it to the methods called in the write path.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:44:28 +05:30
Lakshmi Narayanan Sreethar
b762a09dcd sstable::open_sstable: pass and store origin
Pass origin when opening the sstable from the writer and store it in the
sstable object. This will make the origin available for the entire write
path.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:43:30 +05:30
Michał Chojnowski
1a8ee69a43 sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

The problem fixed by this patch is that the writer was wrongly creating
the compressor objects based on its own schema, but using them based
based on the sstable's schema the sstable's schema.
This patch forces the writer to use the sstable's schema for both.
2024-07-11 12:53:54 +02:00
Michał Chojnowski
d10b38ba5b sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

The problem fixed by this patch is that the writer was wrongly creating
the filter based on its own schema, while the layer outside the writer
was interpreting it as if it was created with the sstable's schema.

This patch forces the writer to pick the filter's parameters based on the
sstable's schema instead.
2024-07-11 12:53:54 +02:00
Lakshmi Narayanan Sreethar
c80df8504c sstables::maybe_rebuild_filter_from_index: log sstable origin
Log the sstable origin when its bloom filter is being rebuilt. The
origin has to be passed to the method by the caller as it is not
available in the sstable object when the filter is rebuilt.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#19601
2024-07-04 10:01:23 +03:00
Lakshmi Narayanan Sreethar
21e463b108 sstables/mx/writer: rebuild bloom filters with bad partition estimates
The bloom filters are built with partition estimates, as the actual
partition count might not be available in all the cases. If the estimate
was bad, the bloom filters might end up too large or too small than
their optimal sizes. Rebuild such bloom filters with actual partition
count before the filter is written to disk and the sstable is sealed.

Fixes #19049

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-06-24 12:06:02 +05:30
Lakshmi Narayanan Sreethar
afc90657d6 sstables/mx/writer: add variable to track number of partitions consumed
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-06-24 12:06:02 +05:30
Ferenc Szili
b06af5b2b9 sstable: write dead_rows count to system.large_partitions 2024-05-02 11:49:10 +02:00
Ferenc Szili
63e724c974 sstable: added counter for dead rows 2024-05-02 11:49:10 +02:00
Ferenc Szili
98bec4e02a sstable: large data handler needs to count range tombstones as rows
When issuing warnings about partitions with the number of rows above a configured threshold,
the  large partitions handler does not take into consideration the number of range tombstone
markers in the total rows count. This fix adds the number of range tombstone markers to the
total number of rows and saves this total in system.large_partitions.rows (if it is above
the threshold). It also adds a new column range_tombstones to the system.large_partitions
table which only contains the number of range tombstone markers for the given partition.

This PR fixes the first part of issue #13968
It does not cover distinguishing between live and dead rows. A subsequent PR will handle that.
2024-04-22 15:24:18 +02:00
Avi Kivity
7cb1c10fed treewide: replace seastar::future::get0() with seastar::future::get()
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.

Replace with seastar::future::get(), which does the same thing.
2024-02-02 22:12:57 +08:00
Avi Kivity
8ee75ae8f4 sstables: writer: don't require effective_replication_map for sharding metadata
Currently, we pass an effective_replication_map_ptr to sstable_writer,
so that we can get a stable dht::sharder for writing the sharding metadata.
This is needed because with tablets, the sharder can change dynamically.

However, this is both bad and unnecessary:
 - bad: holding on to an effective_replication_map_ptr is a barrier
   for topology operations, preventing tablet migrations (etc) while
   an sstable is being written
 - unnecessary: tablets don't require sharding metadata at all, since
   two tablets cannot overlap (unlike two sstables from different shards in
   the same node). So the first/last key is sufficient to determine the
   shard/tablet ownership.

Given that, just pass the sharder for vnode sstables, and don't generate
sharding metadata for tablet sstables.
2024-01-23 22:23:08 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Kefu Chai
9c24be05c3 sstable/writer: log sstable name and pk when capping ldt
when the local_deletion_time is too large and beyond the
epoch time of INT32_MAX, we cap it to INT32_MAX - 1.
this is a signal of bad configuration or a bug in scylla.
so let's add more information in the logging message to
help track back to the source of the problem.

Fixes #15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-21 19:25:32 +08:00
Tomasz Grabiec
17d6163548 sstables: Generate sharding metadata using sharder from erm when writing
We need to keep sharding metadata consistent with tablet mapping to
shards in order for node restart to detect that those sstables belong
to a single shard and that resharding is not necessary. Resharding of
sstables based on tablet metadata is not implemented yet and will
abort after this series.

Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
2023-06-21 00:58:24 +02:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Pavel Emelyanov
ac1e56c9d9 sstable, storage: Virtualize data sink making for Data and Index
Add the make_data_or_index_sink() virtual method and its implementation for
filesystem_storage.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-10 16:43:01 +03:00
Pavel Emelyanov
1d4fcce5dd sstable/writer: Shuffle writer::init_file_writers()
The method needs to create two data sinks -- for Data and for Index
files -- and then wrap it with more stuff (compression, checksums,
streams, etc.). With S3 backend using file-output-stream won't work,
becase S3 storage cannot provide writable file API (it has data_sink
instead).

This patch extracts file_data_sink creation so that it could be
virtualized with storage API later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-10 16:43:01 +03:00