Commit Graph

80 Commits

Author SHA1 Message Date
Botond Dénes
c7c5817808 Merge 'Improve timestamp heuristics for tombstone garbage collection' from Benny Halevy
When purging regular tombstone consult the min_live_timestamp, if available.
This is safe since we don't need to protect dead data from resurrection, as it is already dead.

For shadowable_tombstones, consult the min_memtable_live_row_marker_timestamp,
if available, otherwise fallback to the min_live_timestamp.

If we see in a view table a shadowable tombstone with time T, then in any row where the row marker's timestamp is higher than T the shadowable tombstone is completely ignored and it doesn't hide any data in any column, so the shadowable tombstone can be safely purged without any effect or risk resurrecting any deleted data.

In other words, rows which might cause problems for purging a shadowable tombstone with time T are rows with row markers older or equal T. So to know if a whole sstable can cause problems for shadowable tombstone of time T, we need to check if the sstable's oldest row marker (and not oldest column) is older or equal T. And the same check applies similarly to the memtable.

If both extended timestamp statistics are missing, fallback to the legacy (and inaccurate) min_timestamp.

Fixes scylladb/scylladb#20423
Fixes scylladb/scylladb#20424

> [!NOTE]
> no backport needed at this time
> We may consider backport later on after given some soak time in master/enterprise
> since we do see tombstone accumulation in the field under some materialized views workloads

Closes scylladb/scylladb#20446

* github.com:scylladb/scylladb:
  cql-pytest: add test_compaction_tombstone_gc
  sstable_compaction_test: add mv_tombstone_purge_test
  sstable_compaction_test: tombstone_purge_test: test that old deleted data do not inhibit tombstone garbage collection
  sstable_compaction_test: tombstone_purge_test: add testlog debugging
  sstable_compaction_test: tombstone_purge_test: make_expiring: use next_timestamp
  sstable, compaction: add debug logging for extended min timestamp stats
  compaction: get_max_purgeable_timestamp: use memtable and sstable extended timestamp stats
  compaction: define max_purgeable_fn
  tombstone: can_gc_fn: move declaration to compaction_garbage_collector.hh
  sstables: scylla_metadata: add ext_timestamp_stats
  compaction_group, storage_group, table_state: add extended timestamp stats getters
  sstables, memtable: track live timestamps
  memtable_encoding_stats_collector: update row_marker: do nothing if missing
2024-09-13 08:56:51 +03:00
Kefu Chai
3e84d43f93 treewide: use seastar::format() or fmt::format() explicitly
before this change, we rely on `using namespace seastar` to use
`seastar::format()` without qualifying the `format()` with its
namespace. this works fine until we changed the parameter type
of format string `seastar::format()` from `const char*` to
`fmt::format_string<...>`. this change practically invited
`seastar::format()` to the club of `std::format()` and `fmt::format()`,
where all members accept a templated parameter as its `fmt`
parameter. and `seastar::format()` is not the best candidate anymore.
despite that argument-dependent lookup (ADT for short) favors the
function which is in the same namespace as its parameter, but
`using namespace` makes `seastar::format()` more competitive,
so both `std::format()` and `seastar::format()` are considered
as the condidates.

that is what is happening scylladb in quite a few caller sites of
`format()`, hence ADT is not able to tell which function the winner
in the name lookup:

```
/__w/scylladb/scylladb/mutation/mutation_fragment_stream_validator.cc:265:12: error: call to 'format' is ambiguous
  265 |     return format("{} ({}.{} {})", _name_view, s.ks_name(), s.cf_name(), s.id());
      |            ^~~~~~
/usr/bin/../lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/format:4290:5: note: candidate function [with _Args = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
 4290 |     format(format_string<_Args...> __fmt, _Args&&... __args)
      |     ^
/__w/scylladb/scylladb/seastar/include/seastar/core/print.hh:143:1: note: candidate function [with A = <const std::basic_string_view<char> &, const seastar::basic_sstring<char, unsigned int, 15> &, const seastar::basic_sstring<char, unsigned int, 15> &, const utils::tagged_uuid<table_id_tag> &>]
  143 | format(fmt::format_string<A...> fmt, A&&... a) {
      | ^
```

in this change, we

change all `format()` to either `fmt::format()` or `seastar::format()`
with following rules:
- if the caller expects an `sstring` or `std::string_view`, change to
  `seastar::format()`
- if the caller expects an `std::string`, change to `fmt::format()`.
  because, `sstring::operator std::basic_string` would incur a deep
  copy.

we will need another change to enable scylladb to compile with the
latest seastar. namely, to pass the format string as a templated
parameter down to helper functions which format their parameters.
to miminize the scope of this change, let's include that change when
bumping up the seastar submodule. as that change will depend on
the seastar change.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-09-11 23:21:40 +03:00
Benny Halevy
4de4af954f sstables: scylla_metadata: add ext_timestamp_stats
Store and retrieve the optional extended timestamp statistics
(min_live_timestamp and min_live_row_marker_timestamp)
in the scylla_metadata component.

Note that there is no need for a cluster feature to
store those attributes since the scylla_metadata
on-disk format is extensible so that old sstables
can be read by new versions, seeing the extra stats
is missing, and new sstables can be read by old
versions that ignore unknown scylla metadata section types.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:57 +03:00
Benny Halevy
14d86a3a12 sstables, memtable: track live timestamps
When garbage collecting tombstones, we care only
about shadowing of live data.  However, currently
we track min/max timestamp of both live and dead
data, but there is no problem with purging tombstones
that shadow dead data (expired or shdowed by other
tombstones in the sstable/memtable).

Also, for shadowable tombstones, we track live row marker timestamps
separately since, if the live row marker timestamp is greater than
a shadowable tombstone timestamp, then the row marker
would shadow the shadowable tombstone thus exposing the cells
in that row, even if their timestasmp may be smaller
than the shadow tombstone's.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:49 +03:00
Avi Kivity
aa1270a00c treewide: change assert() to SCYLLA_ASSERT()
assert() is traditionally disabled in release builds, but not in
scylladb. This hasn't caused problems so far, but the latest abseil
release includes a commit [1] that causes a 1000 insn/op regression when
NDEBUG is not defined.

Clearly, we must move towards a build system where NDEBUG is defined in
release builds. But we can't just define it blindly without vetting
all the assert() calls, as some were written with the expectation that
they are enabled in release mode.

To solve the conundrum, change all assert() calls to a new SCYLLA_ASSERT()
macro in utils/assert.hh. This macro is always defined and is not conditional
on NDEBUG, so we can later (after vetting Seastar) enable NDEBUG in release
mode.

[1] 66ef711d68

Closes scylladb/scylladb#20006
2024-08-05 08:23:35 +03:00
Lakshmi Narayanan Sreethar
7b58fa2534 sstables: use _origin in write path
Now that the origin is available inside the sstable object, no need to
pass it to the methods called in the write path.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:44:28 +05:30
Lakshmi Narayanan Sreethar
b762a09dcd sstable::open_sstable: pass and store origin
Pass origin when opening the sstable from the writer and store it in the
sstable object. This will make the origin available for the entire write
path.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-07-16 20:43:30 +05:30
Michał Chojnowski
1a8ee69a43 sstables/mx/writer: when creating local_compression, use the sstables's schema, not the writer's
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

The problem fixed by this patch is that the writer was wrongly creating
the compressor objects based on its own schema, but using them based
based on the sstable's schema the sstable's schema.
This patch forces the writer to use the sstable's schema for both.
2024-07-11 12:53:54 +02:00
Michał Chojnowski
d10b38ba5b sstables/mx/writer: when creating filter, use the sstables's schema, not the writer's
There are two schema's associated with a sstable writer:
the sstable's schema (i.e. the schema of the table at the time when the
sstable object was created), and the writer's schema (equal to the schema
of the reader which is feeding into the writer).

It's easy to mix up the two and break something as a result.

The writer's schema is needed to correctly interpret and serialize the data
passing through the writer, and to populate the on-disk metadata about the
on-disk schema.

The sstables's schema is used to configure some parameters for newly created
sstable, such as bloom filter false positive ratio, or compression.

The problem fixed by this patch is that the writer was wrongly creating
the filter based on its own schema, while the layer outside the writer
was interpreting it as if it was created with the sstable's schema.

This patch forces the writer to pick the filter's parameters based on the
sstable's schema instead.
2024-07-11 12:53:54 +02:00
Lakshmi Narayanan Sreethar
c80df8504c sstables::maybe_rebuild_filter_from_index: log sstable origin
Log the sstable origin when its bloom filter is being rebuilt. The
origin has to be passed to the method by the caller as it is not
available in the sstable object when the filter is rebuilt.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#19601
2024-07-04 10:01:23 +03:00
Lakshmi Narayanan Sreethar
21e463b108 sstables/mx/writer: rebuild bloom filters with bad partition estimates
The bloom filters are built with partition estimates, as the actual
partition count might not be available in all the cases. If the estimate
was bad, the bloom filters might end up too large or too small than
their optimal sizes. Rebuild such bloom filters with actual partition
count before the filter is written to disk and the sstable is sealed.

Fixes #19049

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-06-24 12:06:02 +05:30
Lakshmi Narayanan Sreethar
afc90657d6 sstables/mx/writer: add variable to track number of partitions consumed
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-06-24 12:06:02 +05:30
Ferenc Szili
b06af5b2b9 sstable: write dead_rows count to system.large_partitions 2024-05-02 11:49:10 +02:00
Ferenc Szili
63e724c974 sstable: added counter for dead rows 2024-05-02 11:49:10 +02:00
Ferenc Szili
98bec4e02a sstable: large data handler needs to count range tombstones as rows
When issuing warnings about partitions with the number of rows above a configured threshold,
the  large partitions handler does not take into consideration the number of range tombstone
markers in the total rows count. This fix adds the number of range tombstone markers to the
total number of rows and saves this total in system.large_partitions.rows (if it is above
the threshold). It also adds a new column range_tombstones to the system.large_partitions
table which only contains the number of range tombstone markers for the given partition.

This PR fixes the first part of issue #13968
It does not cover distinguishing between live and dead rows. A subsequent PR will handle that.
2024-04-22 15:24:18 +02:00
Avi Kivity
7cb1c10fed treewide: replace seastar::future::get0() with seastar::future::get()
get0() dates back from the days where Seastar futures carried tuples, and
get0() was a way to get the first (and usually only) element. Now
it's a distraction, and Seastar is likely to deprecate and remove it.

Replace with seastar::future::get(), which does the same thing.
2024-02-02 22:12:57 +08:00
Avi Kivity
8ee75ae8f4 sstables: writer: don't require effective_replication_map for sharding metadata
Currently, we pass an effective_replication_map_ptr to sstable_writer,
so that we can get a stable dht::sharder for writing the sharding metadata.
This is needed because with tablets, the sharder can change dynamically.

However, this is both bad and unnecessary:
 - bad: holding on to an effective_replication_map_ptr is a barrier
   for topology operations, preventing tablet migrations (etc) while
   an sstable is being written
 - unnecessary: tablets don't require sharding metadata at all, since
   two tablets cannot overlap (unlike two sstables from different shards in
   the same node). So the first/last key is sufficient to determine the
   shard/tablet ownership.

Given that, just pass the sharder for vnode sstables, and don't generate
sharding metadata for tablet sstables.
2024-01-23 22:23:08 +02:00
Yaniv Kaul
c658bdb150 Typos: fix typos in comments
Fixes some typos as found by codespell run on the code.
In this commit, I was hoping to fix only comments, not user-visible alerts, output, etc.
Follow-up commits will take care of them.

Refs: https://github.com/scylladb/scylladb/issues/16255
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2023-12-02 22:37:22 +02:00
Kefu Chai
9c24be05c3 sstable/writer: log sstable name and pk when capping ldt
when the local_deletion_time is too large and beyond the
epoch time of INT32_MAX, we cap it to INT32_MAX - 1.
this is a signal of bad configuration or a bug in scylla.
so let's add more information in the logging message to
help track back to the source of the problem.

Fixes #15015
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-08-21 19:25:32 +08:00
Tomasz Grabiec
17d6163548 sstables: Generate sharding metadata using sharder from erm when writing
We need to keep sharding metadata consistent with tablet mapping to
shards in order for node restart to detect that those sstables belong
to a single shard and that resharding is not necessary. Resharding of
sstables based on tablet metadata is not implemented yet and will
abort after this series.

Keeping sharding metadata accurate for tablets is only necessary until
compaction group integration is finished. After that, we can use the
sstable token range to determine the owning tablet and thus the owning
shard. Before that, we can't, because a single sstable may contain
keys from different tablets, and the whole key range may overlap with
keys which belong to other shards.
2023-06-21 00:58:24 +02:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Pavel Emelyanov
ac1e56c9d9 sstable, storage: Virtualize data sink making for Data and Index
Add the make_data_or_index_sink() virtual method and its implementation for
filesystem_storage.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-10 16:43:01 +03:00
Pavel Emelyanov
1d4fcce5dd sstable/writer: Shuffle writer::init_file_writers()
The method needs to create two data sinks -- for Data and for Index
files -- and then wrap it with more stuff (compression, checksums,
streams, etc.). With S3 backend using file-output-stream won't work,
becase S3 storage cannot provide writable file API (it has data_sink
instead).

This patch extracts file_data_sink creation so that it could be
virtualized with storage API later.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2023-04-10 16:43:01 +03:00
Kefu Chai
c37f4e5252 treewide: use fmt::join() when appropriate
now that fmtlib provides fmt::join(). see
https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view
there is not need to revent the wheel. so in this change, the homebrew
join() is replaced with fmt::join().

as fmt::join() returns an join_view(), this could improve the
performance under certain circumstances where the fully materialized
string is not needed.

please note, the goal of this change is to use fmt::join(), and this
change does not intend to improve the performance of existing
implementation based on "operator<<" unless the new implementation is
much more complicated. we will address the unnecessarily materialized
strings in a follow-up commit.

some noteworthy things related to this change:

* unlike the existing `join()`, `fmt::join()` returns a view. so we
  have to materialize the view if what we expect is a `sstring`
* `fmt::format()` does not accept a view, so we cannot pass the
  return value of `fmt::join()` to `fmt::format()`
* fmtlib does not format a typed pointer, i.e., it does not format,
  for instance, a `const std::string*`. but operator<<() always print
  a typed pointer. so if we want to format a typed pointer, we either
  need to cast the pointer to `void*` or use `fmt::ptr()`.
* fmtlib is not able to pick up the overload of
  `operator<<(std::ostream& os, const column_definition* cd)`, so we
  have to use a wrapper class of `maybe_column_definition` for printing
  a pointer to `column_definition`. since the overload is only used
  by the two overloads of
  `statement_restrictions::add_single_column_parition_key_restriction()`,
  the operator<< for `const column_definition*` is dropped.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-16 20:34:18 +08:00
Pavel Emelyanov
0959739216 sstables: Remove always-false sstable_writer_config::leave_unsealed
It was used in sstables streaming code up until e5be3352 (database,
streaming, messaging: drop streaming memtables) or nearby, then the
whole feature was reworked.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12967
2023-02-23 12:50:06 +01:00
Kefu Chai
45f0449ccf sstables: mx/writer: remove defaulted move ctor
because its base class of `writer_impl` has a member variable
`_validator`, which has its copy ctor deleted. let's just
drop the defaulted move ctor, as compiler is not able to
generate one for us.

```
/home/kefu/dev/scylladb/sstables/mx/writer.cc:805:5: error: explicitly defaulted move constructor is implicitly deleted [-Werror,-Wdefaulted-function-deleted]
    writer(writer&& o) = default;
    ^
/home/kefu/dev/scylladb/sstables/mx/writer.cc:528:16: note: move constructor of 'writer' is implicitly deleted because base class 'sstable_writer::writer_impl' has a deleted move constructor
class writer : public sstable_writer::writer_impl {
               ^
/home/kefu/dev/scylladb/sstables/writer_impl.hh:29:48: note: copy constructor of 'writer_impl' is implicitly deleted because field '_validator' has a deleted copy constructor
    mutation_fragment_stream_validating_filter _validator;
                                               ^
/home/kefu/dev/scylladb/mutation/mutation_fragment_stream_validator.hh:188:5: note: 'mutation_fragment_stream_validating_filter' has been explicitly marked deleted here
    mutation_fragment_stream_validating_filter(const mutation_fragment_stream_validating_filter&) = delete;
    ^
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes #12877
2023-02-15 23:06:10 +02:00
Kefu Chai
0cb842797a treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-15 22:57:18 +02:00
Avi Kivity
69a385fd9d Introduce schema/ module
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.

Closes #12858
2023-02-15 11:01:50 +02:00
Avi Kivity
c5e4bf51bd Introduce mutation/ module
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.

mutation_reader remains in the readers/ module.

mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.

This is a step forward towards librarization or modularization of the
source base.

Closes #12788
2023-02-14 11:19:03 +02:00
Avi Kivity
7c7eb81a66 Merge 'Encapsulate filesystem access by sstable into filesystem_storage subsclass' from Pavel Emelyanov
This is to define the API sstable needs from underlying storage. When implementing object-storage backend it will need to implement those. The API looks like

        future<> snapshot(const sstable& sst, sstring dir, absolute_path abs) const;
        future<> quarantine(const sstable& sst, delayed_commit_changes* delay);
        future<> move(const sstable& sst, sstring new_dir, generation_type generation, delayed_commit_changes* delay);
        void open(sstable& sst, const io_priority_class& pc); // runs in async context
        future<> wipe(const sstable& sst) noexcept;

        future<file> open_component(const sstable& sst, component_type type, open_flags flags, file_open_options options, bool check_integrity);

It doesn't have "list" or alike, because it's not a method of an individual sstable, but rather the one from sstables_manager. It will come as separate PR.

Closes #12217

* github.com:scylladb/scylladb:
  sstable, storage: Mark dir/temp_dir private
  sstable: Remove get_dir() (well, almost)
  sstable: Add quarantine() method to storage
  sstable: Use absolute/relative path marking for snapshot()
  sstable: Remove temp_... stuff from sstable
  sstable: Move open_component() on storage
  sstable: Mark rename_new_sstable_component_file() const
  sstable: Print filename(type) on open-component error
  sstable: Reorganize new_sstable_component_file()
  sstable: Mark filename() private
  sstable: Introduce index_filename()
  tests: Disclosure private filename() calls
  sstable: Move wipe_storage() on storage
  sstable: Remove temp dir in wipe_storage()
  sstable: Move unlink parts into wipe_storage
  sstable: Remove get_temp_dir()
  sstable: Move write_toc() to storage
  sstable: Shuffle open_sstable()
  sstable: Move touch_temp_dir() to storage
  sstable: Move move() to storage
  sstable: Move create_links() to storage
  sstable: Move seal_sstable() to storage
  sstable: Tossing internals of seal_sstable()
  sstable: Move remove_temp_dir() to storage
  sstable: Move create_links_common() to storage
  sstable: Move check_create_links_replay() to storage
  sstable: Remove one of create_links() overloads
  sstable: Remove create_links_and_mark_for_removal()
  sstable: Indentation fix after prevuous patch
  sstable: Coroutinize create_links_common()
  sstable: Rename create_links_common()'s "dir" argument
  sstable: Make mark_for_removal bool_class
  sstable, table: Add sstable::snapshot() and use in table::take_snapshot
  sstable: Move _dir and _temp_dir on filesystem_storage
  sstable: Use sync_directory() method
  test, sstable: Use component_basename in test
  sstables: Move read_{digest|checksum} on sstable
2022-12-18 17:29:35 +02:00
Pavel Emelyanov
636d49f1c1 sstable: Shuffle open_sstable()
When an sstable is prepared to be written on disk the .write_toc() is
called on it which created temporary toc file. Prior to this, the writer
code calls generate_toc() to collect components on the sstable.

This patch adds the .open_sstable() API call that does both. This
prepares the write_toc() part to be moved to storage, because it's not
just "write data into TOC file", it's the first step in transaction
implemeted on top of rename()s.

The test need care -- there's rewrite_toc_without_scylla_component()
thing in utils that doesn't want the generate_toc() part to be called.
It's not patched here and continues calling .write_toc().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-12-15 10:14:49 +03:00
Tomasz Grabiec
23e4c83155 position_in_partition: Make after_key() work with non-full keys
This fixes a long standing bug related to handling of non-full
clustering keys, issue #1446.

after_key() was creating a position which is after all keys prefixed
by a non-full key, rather than a position which is right after that
key.

This will issue will be caught by cql_query_test::test_compact_storage
in debug mode when mutation_partition_v2 merging starts inserting
sentinels at position after_key() on preemption.

It probably already causes problems for such keys.
2022-12-14 14:47:33 +01:00
Benny Halevy
7286f5d314 sstables: mx/writer: optimize large data stats members order
Since `_partition_size_entry` and `_rows_in_partition_entry`
are accessed at the same time when updated, and similarly
`_cell_size_entry` and `_elements_in_collection_entry`,
place the member pairs closely together to improve data
cache locality.

Follow the same order when preparing the
`scylla_metadata::large_data_stats` map.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-05 10:54:04 +03:00
Benny Halevy
8c8a0adb40 sstables: mx/writer: keep large data stats entry as members
To save the map lookup on the hot write path,
keep each large data stats entry as a member in the writer
object and build a map for storing the disk_hash in the
scylla metadata only when finalizing it in consume_end_of_stream.

Fixes #11686

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-05 10:54:04 +03:00
Benny Halevy
6dadca2648 db/large_data_handler: maybe_record_large_cells: consider collection_elements
Detect large_collections when the number of collection_elements
is above the configured threshold.

Next step would be to record the number of collection_elements
in the system.large_cells table, when the respective
cluster feature is enabled.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:42:05 +03:00
Benny Halevy
7dead10742 sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells
And update the sstable elements_in_collection
stats entry.

Next step would be to forward it to
large_data_handler().maybe_record_large_cells().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:41:58 +03:00
Benny Halevy
54ab038825 sstables: mx/writer: add large_data_type::elements_in_collection
Add a new large_data_stats type and entry for keeping
the collection_elements_count_threshold and the maximum value
of collection_elements.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-10-04 08:41:56 +03:00
Benny Halevy
ae7fd1c7b2 sstables: do not include db/large_data_handler.hh in sstables.hh
Reduce dependencies by only forward-declaring
class db::large_data_handler in sstables.hh

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-29 12:42:58 +03:00
Raphael S. Carvalho
e2ccafbe38 compaction: Add support to split large partitions
Adds support for splitting large partitions during compaction.

Large partitions introduce many problems, like memory overhead and
breaks incremental compaction promise. We want to split large
partitions across fixed-size fragments. We'll allow a partition
to exceed size limit by 10%, as we don't want to unnecessarily split
partitions that just crossed the limit boundary.

To avoid having to open a minimal of 2 fragments in a read, partition
tombstone will be replicated to every fragment storing the
partition.

The splitting isn't enabled by default, and can be used by
strategies that are run aware like ICS. LCS still cannot support
it as it's still using physical level metadata, not run id.

An incremental reader for sstable runs will follow soon.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-09-14 13:23:16 -03:00
Benny Halevy
7747b8fa33 sstables: define run_identifier as a strong tagged_uuid type
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #11321
2022-08-18 19:03:10 +03:00
Avi Kivity
f5062f4b5a Merge 'Use generation_type for SSTable ancestors' from Raphael "Raph" Carvalho
To avoid a discrepancy about underlying generation type once something other than integer is allowed for the sstable generation.
Also simplifies one generic writer interface for sealing sstable statistics.

Closes #10703

* github.com:scylladb/scylla:
  sstables: Use generation_type for compaction ancestors
  sstables: Make compaction ancestors optional when sealing statistics
2022-06-01 19:55:08 +03:00
Raphael S. Carvalho
d36604703f sstables: Make compaction ancestors optional when sealing statistics
Compaction ancestors is only available in versions older than mx,
therefore we can make it optional in seal_statistics(). The motivation
is that mx writer will no longer call sstable::compaction_ancestors()
which return type will be soon changed to type generation_type, so the
returned value can be something other than an integer, e.g. uuid.
We could kill compaction_ancestors in seal_statistics interface, but
given that most generic write functions still work for older versions,
if there were still a writer for them, I decided to not do it now.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-05-31 15:26:03 -03:00
Mikołaj Sielużycki
bc18e97473 sstable_writer: Fix mutation order violation
The change
- adds a test which exposes a problem of a peculiar setup of
tombstones that trigger a mutation fragment stream validation exception
- fixes the problem

Applying tombstones in the order:

range_tombstone_change pos(ck1), after_all_prefixed, tombstone_timestamp=1
range_tombstone_change pos(ck2), before_all_prefixed, tombstone=NONE
range_tombstone_change pos(NONE), after_all_prefixed, tombstone=NONE

Leads to swapping the order of mutations when written and read from
disk via sstable writer. This is caused by conversion of
range_tombstone_change (in memory representation) to range tombstone
marker (on disk representation) and back.

When this mutation stream is written to disk, the range tombstone
markers type is calculated based on the relationship between
range_tombstone_changes. The RTC series as above produces markers
(start, end, start). When the last marker is loaded from disk, it's kind
gets incorrectly loaded as before_all_prefixed instead of
after_all_prefixed. This leads to incorrect order of mutations.

The solution is to skip writing a new range_tombstone_change with empty
tombstone if the last range_tombstone_change already has empty
tombstone. This is redundant information and can be safely removed,
while the logic of encoding RTCs as markers doesn't handle such
redundancy well.

Closes #10643
2022-05-31 13:39:48 +03:00
Benny Halevy
33bad72fd2 sstables: mx: add pi_auto_scale_events metric
Counts the number of promoted index auto-scale events.

A large number of those, relative to `partition_writes`,
indicates that `column_index_size_in_kb` should be increased.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-24 13:32:39 +03:00
Benny Halevy
6677028212 sstables: mx/writer: auto-scale promoted index
Add column_index_auto_scale_threshold_in_kb to the configuration
(defaults to 10MB).

When the promoted index (serialized) size gets to this
threshold, it's halved by merging each two adjacent blocks
into one and doubling the desired_block_size.

Fixes #4217

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-24 13:32:35 +03:00
Botond Dénes
105bf8888a sstables: convert mx writer to v2
The sstables::sstable class has two methods for writing sstables:
1) sstable_writer get_writer(...);
2) future<> write_components(flat_mutation_reader, ...);

(1) directly exposes the writer type, so we have to update all users of
it (there is not that many) in this same patch. We defer updating
users of (2) to a follow-up commits.
2022-03-10 07:03:49 +02:00
Botond Dénes
11adb404c6 sstables/metadata_collector: use position_in_partition for min/max keys
Instead of naked clustering keys. Working with the latter is dangerous
because it cannot accurately represent the entire clustering domain: it
cannot represent positions between (before/after) keys. For this reason
the metadata collector had a separate update_min_max_components()
overload for range tombstones because the positions of these cannot be
represented by clustering keys alone.
Moving to position_in_partition solves this problem and it is now enough
to have a single overload with position_in_partition_view. This is also
more future proof as it will work with range tombstone changes without
any additional changes.
2022-03-10 07:03:49 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Avi Kivity
4d7a013e94 sstables: mx: writer: make large partition stats accounting branch-free
It is bad form to introduce branches just for statistics, since branches
can be expensive (even when perfectly predictable, they consume branch
history resources). Switch to simple addition instead; this should be
not cause any cache misses since we already touch other statistics
earlier.

The inputs are already boolean, but cast them to boolean just so it
is clear we're adding 0/1, not a count.

Closes #9626
2021-11-15 11:28:48 +02:00
Michael Livshin
a7511cf600 system keyspace: record partitions with too many rows
Add "rows" field to system.large_partitions.  Add partitions to the
table when they are too large or have too many rows.

Fixes #9506

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>

Closes #9577
2021-11-14 14:25:18 +02:00