Commit Graph

38 Commits

Author SHA1 Message Date
Raphael S. Carvalho
2c9b13d2d1 compaction: Check for key presence in memtable when calculating max purgeable timestamp
It was observed that some use cases might append old data constantly to
memtable, blocking GC of expired tombstones.

That's because timestamp of memtable is unconditionally used for
calculating max purgeable, even when the memtable doesn't contain the
key of the tombstone we're trying to GC.

The idea is to treat memtable as we treat L0 sstables, i.e. it will
only prevent GC if it contains data that is possibly shadowed by the
expired tombstone (after checking for key presence and timestamp).

Memtable will usually have a small subset of keys in largest tier,
so after this change, a large fraction of keys containing expired
tombstones can be GCed when memtable contains old data.

Fixes #17599.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#17835
2024-03-18 13:37:44 +02:00
Kefu Chai
3f0fbdcd86 replica: do not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16810
2024-01-17 09:27:09 +02:00
Kefu Chai
0092700ad1 memtable: add formatter for replica::{memtable,memtable_entry}
before this change, we rely on the default-generated fmt::formatter
created from operator<<, but fmt v10 dropped the default-generated
formatter.

in this change, we define a formatter for replica::memtable and
replica::memtable_entry, and remove their operator<<().

Refs #13245

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#16793
2024-01-16 16:46:52 +02:00
Avi Kivity
02111d6754 memtable: consolidate _read_section, _allocating_section in a struct
Those two members are passed from memtable_list to memtable. Since we
wish to pass them from table, it becomes awkward to pass them as two
separate variables as their contents are specific to memtable internals.

Wrap them in a name that indicates their role (being table-wide shared
data for memtables) and pass them as a unit.
2023-12-26 21:11:48 +02:00
Avi Kivity
7d5e22b43b replica: memtable: don't forget memtable memory allocation statistics
A memtable object contains two logalloc::allocating_section members
that track memory allocation requirements during reads and writes.
Because these are local to the memtable, each time we seal a memtable
and create a new one, these statistics are forgotten. As a result
we may have to re-learn the typical size of reads and writes, incurring
a small performance penalty.

The solution is to move the allocating_section object to the memtable_list
container. The workload is the same across all memtables of the same
table, so we don't lose discrimination here.

The performance penalty may be increased later if log changes to
memory reserve thresholds including a backtrace, so this reduces the
odds of incurring such a penalty.

Closes scylladb/scylladb#15737
2023-10-18 17:43:33 +02:00
Pavel Emelyanov
66e43912d6 code: Switch to seastar API level 7
In that level no io_priority_class-es exist. Instead, all the IO happens
in the context of current sched-group. File API no longer accepts prio
class argument (and makes io_intent arg mandatory to impls).

So the change consists of
- removing all usage of io_priority_class
- patching file_impl's inheritants to updated API
- priority manager goes away altogether
- IO bandwidth update is performed on respective sched group
- tune-up scylla-gdb.py io_queues command

The first change is huge and was made semi-autimatically by:
- grep io_priority_class | default_priority_class
- remove all calls, found methods' args and class' fields

Patching file_impl-s is smaller, but also mechanical:
- replace io_priority_class& argument with io_intent* one
- pass intent to lower file (if applicatble)

Dropping the priority manager is:
- git-rm .cc and .hh
- sed out all the #include-s
- fix configure.py and cmakefile

The scylla-gdb.py update is a bit hairry -- it needs to use task queues
list for IO classes names and shares, but to detect it should it checks
for the "commitlog" group is present.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #13963
2023-06-06 13:29:16 +03:00
Michał Chojnowski
0273101890 partition_version: remove the unused "from" argument in partition_entry::upgrade()
partition_entry now contains a reference to its schema, so it doesn't have to
be supplied by the caller anymore.
2023-05-04 02:37:30 +02:00
Michał Chojnowski
94e4dc3d8d partition_version: add a logalloc::region argument to partition_entry::upgrade()
The argument is currently unused, but will be further propagated to
add_version() in an upcoming patch.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
98dfe3355e memtable: propagate the region to memtable_entry::upgrade_schema()
Adds a logalloc::region argument to upgrade_schema().
It's currently unused, but will be further propagated to
partition_entry::upgrade() in an upcoming patch.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
caaf0bd6bf partition_version: remove _schema from partition_entry::operator<<
operator<< accepts a schema& and a partition_entry&. But since the latter
now contains a reference to its schema inside, the former is redundant.
Remove it.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
f6e11c95e2 partition_version: remove the schema argument from partition_entry::read()
partition_entry now contains a reference to its schema, so it no longer
needs to be supplied by the caller.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
4e4ae43a84 memtable: remove _schema from memtable_entry
After adding a _schema field to each partition version,
the field in memtable_entry is redundant. It can be always recovered
from the latest version. Remove it.
2023-05-04 02:37:29 +02:00
Michał Chojnowski
a70c5704df mutation_partition: change schema_ptr to schema& in mutation_partition constructor
Cosmetic change. See the preceding commit for details.
2023-05-04 02:37:29 +02:00
Kefu Chai
c37f4e5252 treewide: use fmt::join() when appropriate
now that fmtlib provides fmt::join(). see
https://fmt.dev/latest/api.html#_CPPv4I0EN3fmt4joinE9join_viewIN6detail10iterator_tI5RangeEEN6detail10sentinel_tI5RangeEEERR5Range11string_view
there is not need to revent the wheel. so in this change, the homebrew
join() is replaced with fmt::join().

as fmt::join() returns an join_view(), this could improve the
performance under certain circumstances where the fully materialized
string is not needed.

please note, the goal of this change is to use fmt::join(), and this
change does not intend to improve the performance of existing
implementation based on "operator<<" unless the new implementation is
much more complicated. we will address the unnecessarily materialized
strings in a follow-up commit.

some noteworthy things related to this change:

* unlike the existing `join()`, `fmt::join()` returns a view. so we
  have to materialize the view if what we expect is a `sstring`
* `fmt::format()` does not accept a view, so we cannot pass the
  return value of `fmt::join()` to `fmt::format()`
* fmtlib does not format a typed pointer, i.e., it does not format,
  for instance, a `const std::string*`. but operator<<() always print
  a typed pointer. so if we want to format a typed pointer, we either
  need to cast the pointer to `void*` or use `fmt::ptr()`.
* fmtlib is not able to pick up the overload of
  `operator<<(std::ostream& os, const column_definition* cd)`, so we
  have to use a wrapper class of `maybe_column_definition` for printing
  a pointer to `column_definition`. since the overload is only used
  by the two overloads of
  `statement_restrictions::add_single_column_parition_key_restriction()`,
  the operator<< for `const column_definition*` is dropped.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-03-16 20:34:18 +08:00
Kefu Chai
0cb842797a treewide: do not define/capture unused variables
these warnings are found by Clang-17 after removing
`-Wno-unused-lambda-capture` and '-Wno-unused-variable' from
the list of disabled warnings in `configure.py`.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-15 22:57:18 +02:00
Avi Kivity
c5e4bf51bd Introduce mutation/ module
Move mutation-related files to a new mutation/ directory. The names
are kept in the global namespace to reduce churn; the names are
unambiguous in any case.

mutation_reader remains in the readers/ module.

mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this
patch.

This is a step forward towards librarization or modularization of the
source base.

Closes #12788
2023-02-14 11:19:03 +02:00
Avi Kivity
8d37370a71 Revert "Merge "memtable-sstable: Add compacting reader when flushing memtable." from Mikołaj"
This reverts commit bcadd8229b, reversing
changes made to cf528d7df9. Since
4bd4aa2e88 ("Merge 'memtable, cache: Eagerly
compact data with tombstones' from Tomasz Grabiec"), memtable is
self-compacting and the extra compaction step only reduces throughput.

The unit test in memtable_test.cc is not reverted as proof that the
revert does not cause a regression.

Closes #11243
2022-08-09 11:23:29 +03:00
Benny Halevy
fcb3347c7a memtable: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Benny Halevy
2d1ba0d7d8 memtable: memtable_encoding_stats_collector: mark functions noexcept
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 13:43:17 +03:00
Avi Kivity
2cb5f79e9d logalloc, dirty_memory_manager: move size-tracking binomial heap out of logalloc
The region_group mechanism used an intrusive heap handle embedded in
logalloc::region to allow region_group:s to track the largest region. But
with region_group moved out of logalloc, the handle is out of place.

Move it out, introducing a new intermediate class size_tracked_region
to hold the heap handle. We might eventually merge the new class into
memtable (which derives from it), but that requires a large rearrangement
of unit tests, so defer that.
2022-07-26 11:12:10 +03:00
Avi Kivity
ee720fa23b logalloc: relax lifetime rules around region_listener
Currently, a region_listener is added during construction and removed
during destruction. This was done to mimick the old region(region_group&)
constructor, as region_listener replaces region_group.

However, this makes moving the binomial heap handle outside logalloc
difficult. The natural place for the handle is in a derived class
of logalloc::region (e.g. memtable), but members of this derived class
will be destroyed earlier than the logalloc::region here. We could play
trickes with an earlier base class but it's better to just decouple
region lifecycle from listener lifecycle.

Do that be adding listen()/unlisten() methods. Some small awkwardness
remains in that merge() implicitly unlistens (see comment in
region::unlisten).

Unit tests are adjusted.
2022-07-26 11:12:10 +03:00
Avi Kivity
cb1251199a memtable: stop using logalloc::region::group() to test for flushed memtables
Currently, the memtable reader uses logalloc::region::group() to test
for whether a memtable has been flushed. If a memtable doesn't belong
to a region group (from dirty_memory_manager), it is flushed.

This is quite tortuous - logalloc::region::merge() makes the merged-from
region identical to the merged-to region. The merged-to region, the cache,
doesn't have a group, so the check works.

Since we're making region groups part of dirty_memory_manager, the cache
will no longer have this indirect way of communication with memtable. But
instead we can use a direct callback it already has -
on_detach_from_region_group(). Use that to set a flag, and examine it in
the read path.
2022-07-26 11:07:25 +03:00
Tomasz Grabiec
53026f3ba6 memtable: Subtract from flushed memory when cleaning
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.

This will start happening after tombstones are compacted with rows on
partition version merging.

This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
2022-06-15 11:30:25 +02:00
Tomasz Grabiec
a4e96960b8 mvcc: Apply mutations in memtable with preemption enabled
Preerequisite for eagerly applying tombstones, which we want to be
preemptible. Before the patch, apply path to the memtable was not
preemptible.

Because merging can now be defered, we need to involve snapshots to
kick-off background merging in case of preemption. This requires us to
propagate region and cleaner objects, in order to create a snapshot.
2022-06-15 11:29:43 +02:00
Avi Kivity
5129280f45 Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec"
This reverts commit e0670f0bb5, reversing
changes made to 605ee74c39. It causes failures
in debug mode in
database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain,
though with low probability.

Fixes #10780
Reopens #652.
2022-06-14 18:06:22 +03:00
Tomasz Grabiec
9135d1fd1f memtable: Subtract from flushed memory when cleaning
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.

This will start happening after tombstones are compacted with rows on
partition version merging.

This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
2022-06-06 19:25:41 +02:00
Tomasz Grabiec
0e3c4fc641 mvcc: Apply mutations in memtable with preemption enabled
Preerequisite for eagerly applying tombstones, which we want to be
preemptible. Before the patch, apply path to the memtable was not
preemptible.

Because merging can now be defered, we need to involve snapshots to
kick-off background merging in case of preemption. This requires us to
propagate region and cleaner objects, in order to create a snapshot.
2022-06-06 19:23:37 +02:00
Botond Dénes
4f77e74bd4 partition_snapshot_reader: convert implementation to native v2
The underlying mutation representation is still v1, so the
implementation still has to do conversion. This happens right above the
lsa reader component.
2022-04-28 14:12:12 +03:00
Avi Kivity
a9812166cd replica, partition_snapshot_reader, keys: replace boost::any with std::any
Reduce #include load by standardizing on std::any.

In keys.cc, we just drop the unneeded include.

One instance of boost::any remains in config_file, due to a tie-in with
other boost components.

Closes #10441
2022-04-28 07:18:53 +03:00
Avi Kivity
e7fb71020b Merge 'replica: Optimize empty_flat_reader out of the hot path' from Michał Chojnowski
When row_cache::make_reader() and memtable::make_flat_reader() see that the query result is empty, they return empty_flat_reader, which is a trivial implementation of flat_mutation_reader.
Even though empty_flat_reader doesn't do anything meaningful, it still needs to be created, handled in merging_reader and destroyed. Turns out this is costly.

This patch series replaces hot path uses of empty_flat_reader with an empty optional.

Performance effects:

`perf_simple_query --smp 1`
TPS: 138k -> 168k
allocs/op: 80.2 -> 71.1
insns/op: 49.9k -> 45.1k

`perf_simple_query --smp 1 --enable-cache=1 --flush`
TPS: 125k -> 150k
allocs/op: 79.2 -> 71.1
insns/op: 51.7k -> 47.2k

For a cassandra-stress benchmark (localhost, 100% cache reads) this translates to a TPS increase from ~42k to ~48k per hyperthread.

Note that this optimization is effective for single-partition reads where the queried partition is only in cache/sstables or only in memtables. Other queries (e.g. where the partition is in both cache in memtables and needs to be merged) are unaffected.

Closes #10204

* github.com:scylladb/scylla:
  replica: Prefer row_cache::make_reader_opt() to row_cache::make_reader()
  row_cache: Add row_cache::make_reader_opt()
  replica: Prefer memtable::make_flat_reader_opt() to memtable::make_flat_reader()
  memtable: Add memtable::make_flat_reader_opt()

[avi: adjust #include for readers/ split]
2022-03-14 14:07:00 +02:00
Mikołaj Sielużycki
1d84a254c0 flat_mutation_reader: Split readers by file and remove unnecessary includes.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.

With changes

real	29m14.051s
user	168m39.071s
sys	5m13.443s

Without changes

real	30m36.203s
user	175m43.354s
sys	5m26.376s

Closes #10194
2022-03-14 13:20:25 +02:00
Michał Chojnowski
f211ef9d71 replica: Prefer memtable::make_flat_reader_opt() to memtable::make_flat_reader()
The former is significantly cheaper when there is nothing to be read.
2022-03-14 12:02:49 +01:00
Michał Chojnowski
218f2b6e98 memtable: Add memtable::make_flat_reader_opt()
When there is nothing to read, make_flat_reader() returns an empty (no-op)
reader. But it turns out that constructing, combining and destroying that
empty reader is quite costly.
As an optimization, add an alternative version which returns an empty optional
instead.
2022-03-14 12:02:49 +01:00
Mikołaj Sielużycki
f4c57cbe87 memtable: Convert partition_snapshot_flat_reader to v2.
This is a facade change only, the make_partition_snapshot_flat_reader
function calls upgrade_to_v2 internally.

Closes #10152
2022-03-02 15:07:36 +02:00
Michael Livshin
34ed752885 memtable::make_flush_reader(): return flat_mutation_reader_v2
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michael Livshin
9bacce4359 memtable::make_flat_reader(): return flat_mutation_reader_v2
This is just a facade change.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Michał Radwański
2a3bd40c69 memtable: upgrade scanning_reader and flush_reader to v2
This change is a part of effort to migrate existing readers from old API
to the new one. The corresponding make_flush_reader and
make_flat_reader functions still return flat_mutation_reader.
2022-02-28 17:11:54 +02:00
Avi Kivity
cbba80914d memtable: move to replica module and namespace
Memtables are a replica-side entity, and so are moved to the
replica module and namespace.

Memtables are also used outside the replica, in two places:
 - in some virtual tables; this is also in some way inside the replica,
   (virtual readers are installed at the replica level, not the
   cooordinator), so I don't consider it a layering violation
 - in many sstable unit tests, as a convenient way to create sstables
   with known input. This is a layering violation.

We could make memtables their own module, but I think this is wrong.
Memtables are deeply tied into replica memory management, and trying
to make them a low-level primitive (at a lower level than sstables) will
be difficult. Not least because memtables use sstables. Instead, we
should have a memtable-like thing that doesn't support merging and
doesn't have all other funky memtable stuff, and instead replace
the uses of memtables in sstable tests with some kind of
make_flat_mutation_reader_from_unsorted_mutations() that does
the sorting that is the reason for the use of memtables in tests (and
live with the layering violation meanwhile).

Test: unit (dev)

Closes #10120
2022-02-23 09:05:16 +02:00