Commit Graph

416 Commits

Author SHA1 Message Date
Botond Dénes
4d2ce5c304 mutation_compactor: remove emit_only_live_rows template parameter
Now that we use emit_only_live_rows::no everywhere we can remove this
template parameters. Only the template parameter is removed, the
internal logic around it is left in place (will be removed in a next
patch), by hard-wiring `only_live()`.
2022-07-12 08:43:49 +03:00
Botond Dénes
bedc82e52c tree: use emit_only_live_rows::no
emit_only_live_rows is a convenience so downstream consumers of the
mutation compactors don't have to check the `bool is_live` already
passed to them. This convenience however causes a template parameter and
additional logic for the compactor. As the most prominent of these
consumers (the query result builder) will soon have to switch to
emit_only_live_rows::no for other reasons anyway (it will want to count
tombstones), we take the opportunity to switch everybody to ::no. This
can be done with very little additional complexity to these consumer --
basically an additional if or two.
This prepares the ground for removing this template parameter and the
associate logic from the compactor.
2022-07-12 08:41:51 +03:00
Botond Dénes
fd5f8f2275 query: have replica provide the last position
Use the recently introduced query-result facility to have the replica
set the position where the query should continue from. For now this is
the same as what the implicit position would have been previously (last
row in result), but it opens up the possibility to stop the query at a
dead row.
2022-06-23 13:36:24 +03:00
Michał Chojnowski
a061eb9e76 mutation_fragment: pass the applied row by reference in clustering_row::apply()
Currently, clustering_row::apply() takes deletable_row by reference, but
copies it before passing it to deletable_row::apply(). This is more expensive
than passing the reference down (by about 1800 instructions for
perf_simple_query rows).
2022-06-20 15:22:17 +02:00
Tomasz Grabiec
169025d9b4 memtable: Add counters for tombstone compaction 2022-06-15 11:30:25 +02:00
Tomasz Grabiec
94f9109bea memtable, cache: Eagerly compact data with tombstones
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.

One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.

Fixes #652.
2022-06-15 11:30:25 +02:00
Tomasz Grabiec
53026f3ba6 memtable: Subtract from flushed memory when cleaning
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.

This will start happening after tombstones are compacted with rows on
partition version merging.

This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
2022-06-15 11:30:25 +02:00
Tomasz Grabiec
cd523214a2 mvcc: Introduce apply_resume to hold state for partition version merging
Partition version merging is preemptable. It may stop in the middle
and be resumed later. Currently, all state is kept inside the versions
themselves, in the form of elements in the source version which are
yet to be moved. This will change once we add compaction (tombstones
with rows) into the merging algorithm. There, state cannot be encoded
purley within versions. Consider applying a partition tombstone over
large number of rows.

This patch introduces apply_rows object to hold the necessary state to
make sure forward progress in case of preemption.

No change in behavior yet.
2022-06-15 11:30:01 +02:00
Tomasz Grabiec
44bb9d495b mutation_partition: Extract deletable_row::compact_and_expire() 2022-06-15 11:30:01 +02:00
Avi Kivity
5129280f45 Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec"
This reverts commit e0670f0bb5, reversing
changes made to 605ee74c39. It causes failures
in debug mode in
database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain,
though with low probability.

Fixes #10780
Reopens #652.
2022-06-14 18:06:22 +03:00
Tomasz Grabiec
0bc45f9666 memtable: Add counters for tombstone compaction 2022-06-06 19:25:41 +02:00
Tomasz Grabiec
beadd248e3 memtable, cache: Eagerly compact data with tombstones
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.

One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.

Fixes #652.
2022-06-06 19:25:41 +02:00
Tomasz Grabiec
9135d1fd1f memtable: Subtract from flushed memory when cleaning
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.

This will start happening after tombstones are compacted with rows on
partition version merging.

This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
2022-06-06 19:25:41 +02:00
Tomasz Grabiec
989ef88e26 mvcc: Introduce apply_resume to hold state for partition version merging
Partition version merging is preemptable. It may stop in the middle
and be resumed later. Currently, all state is kept inside the versions
themselves, in the form of elements in the source version which are
yet to be moved. This will change once we add compaction (tombstones
with rows) into the merging algorithm. There, state cannot be encoded
purley within versions. Consider applying a partition tombstone over
large number of rows.

This patch introduces apply_rows object to hold the necessary state to
make sure forward progress in case of preemption.

No change in behavior yet.
2022-06-06 19:25:41 +02:00
Tomasz Grabiec
080c403d0b mutation_partition: Extract deletable_row::compact_and_expire() 2022-06-06 19:23:37 +02:00
Benny Halevy
ca1b616092 mutation: add consume_gently
Allow yielding when consuming a mutation,
and use in to_data_query_result.

Fixes #10038

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-05 13:32:25 +03:00
Benny Halevy
09fb2c983a frozen_mutation: add consume_gently
Allow yielding when consuming a frozen_mutation,
and use in to_data_query_result.

Refs #10038

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-05 13:32:25 +03:00
Benny Halevy
c9612855c7 query: coroutinize to_data_query_result
Reduce stalls by maybe yielding in-between partitions,
and by awaiting unfreeze_gently where possible.

Refs #10038

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-05-05 13:32:25 +03:00
Benny Halevy
abbf5de68c frozen_mutation: introduce consume method
Allowing to consume the frozen_mutation directly
to a stream rather than unfreezing it first
and then consuming the unfrozen mutation.

Streaming directly from the frozen_mutation
saves both cpu and memory, and will make it
easier to be made async as a follow, to allow
yielding, e.g. between rows.

This is used today only in to_data_query_result
which is invoked on the read-repair path.

Refs #10038
Fixes #10021

Test: unit(release)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Message-Id: <20220405055807.1834494-1-bhalevy@scylladb.com>
2022-04-05 10:51:21 +03:00
Botond Dénes
21584262be query_result_builder: remove v1 support
Amounts to dropping (the noop) range tombstone consume() overload.
2022-03-11 09:24:17 +02:00
Botond Dénes
87ac2e9ab0 tree: migrate to the v2 consumer APIs 2022-03-11 09:24:05 +02:00
Botond Dénes
4629f7d7b5 reconcilable_result_builder: add v2 support
Add a `consume()` overload for range tombstone changes and convert them
internally to range tombstones, as the underlying reconcilable result
is still v1.
2022-03-11 09:24:05 +02:00
Botond Dénes
728c14549f query_result_writer: add v2 support
Add a consume() overload which takes a range tombstone change and drops
it just like the existing range tombstone overload does: query results
don't care about range tombstones.
2022-03-11 09:22:14 +02:00
Botond Dénes
d61f934c50 query_result_builder: make consume(range_tombstone) noop
The downstream consumer (mutation_querier) already ignores range
tombstones, so no point forwarding them to it. This makes adding v2
support easier too as range tombstone changes can be similarly dropped.
2022-03-11 08:39:12 +02:00
Pavel Emelyanov
645896335d code: Convert is_same+result_of assertions into invocable concepts
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2022-02-24 19:46:10 +03:00
Botond Dénes
2941803da0 mutation_partition: do_compact(): do drop row tombstones covered by higher order tombstones
The comment on the public methods calling said method promises to do so
but doesn't actually follows through. This patch fixes this for row
tombstones, to mirror the behaviour of the mutation compactor. This is
especially important for tests that compare mutations compacted with
different methods.
2022-02-21 12:29:24 +02:00
Botond Dénes
d4ac473f7d mutation: counter_write_query: use v2 reader 2022-02-21 12:27:55 +02:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Botond Dénes
aa3c943f4c mutation_reader: remove unecessary stable_flattened_mutations_consumer
Said wrapper was conceived to make unmovable `compact_mutation` because
readers wanted movable consumers. But `compact_mutation` is movable for
years now, as all its unmovable bits were moved into an
`lw_shared_ptr<>` member. So drop this unnecessary wrapper and its
unnecessary usages.
2022-01-07 13:52:07 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Piotr Jastrzebski
1ca39458f2 result_memory_accounter: use new max_result_size::get_page_size in check_local_limit
This means when page_size is sent together with read_command it will be
used for paged queries instead of the hard_limit.

Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
2021-12-28 16:38:01 +01:00
Pavel Emelyanov
e03f7191d9 mutation_partition: Use B-tree insertion sugar
The B-tree insertion methods accept smart pointers and
automatically release the ownership after exception-risky
part is passed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-12-10 12:35:12 +03:00
Mikołaj Sielużycki
6dd9f63f3b memtable-sstable: Track existence of tombstones in memtable.
Add flags if memtable contains tombstones. They can be used as a
heuristic to determine if a memtable should be compacted on
flush. It's an intermediate step until we can compact during applying
mutations on a memtable.
2021-11-29 13:06:12 +01:00
Avi Kivity
1aff7d19c2 treewide: replace seastar::fmt_print() with fmt::print()
We shouldn't be using Seastar as a text formatting library; that's
not its focus. Use fmt directly instead. fmt::print() doesn't return
the output stream which is a minor inconvenience, but that's life.

Closes #9556
2021-11-01 10:05:16 +02:00
Piotr Jastrzebski
79de151158 cache_tracker: remove unused parameter from on_remove
Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com>
Message-Id: <f66ad391d86963b43b2a01e957887ea597e591e8.1632992165.git.piotr@scylladb.com>
2021-09-30 13:03:13 +03:00
Kamil Braun
28193805e5 mutation_partition: fix exception message in append_clustered_row 2021-09-19 13:47:19 +03:00
Botond Dénes
502a45ad58 treewide: switch to native reversed format for reverse reads
We define the native reverse format as a reversed mutation fragment
stream that is identical to one that would be emitted by a table with
the same schema but with reversed clustering order. The main difference
to the current format is how range tombstones are handled: instead of
looking at their start or end bound depending on the order, we always
use them as-usual and the reversing reader swaps their bounds to
facilitate this. This allows us to treat reversed streams completely
transparently: just pass along them a reversed schema and all the
reader, compacting and result building code is happily ignorant about
the fact that it is a reversed stream.
2021-09-09 15:42:15 +03:00
Botond Dénes
0af5a8add0 mutation: consume(): add native reverse order
The existing consume_in_reverse::yes is renamed to
consume_in_reverse::legacy_half_reverse and consume_in_reverse::yes now
means native reverse order. This is because we expect the legacy order
to die out at one point and when that happens we can just remove that
ugly third option and will be left with yes and no as before.
2021-09-09 14:18:32 +03:00
Pavel Emelyanov
5515f7187d range_tombstone, code: Add range_tombstone& getters
Currently all the code operates on the range_tombstone class.
and many of those places get the range tombstone in question
from the range_tombstone_list. Next patches will make that list
carry (and return) some new object called range_tombstone_entry,
so all the code that expects to see the former one there will
need to patched to get the range_tombstone from the _entry one.

This patch prepares the ground for that by introdusing the

    range_tombstone& tombstone() { return *this; }

getter on the range_tombstone itself and patching all future
users of the _entry to call .tombstone() right now.

Next patch will remove those getters together with adding the new
range_tombstone_entry object thus automatically converting all
the patched places into using the entry in a proper way.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 19:34:45 +03:00
Pavel Emelyanov
ac473a9e67 mutation_partition: Shorten memory usage calculation
The range_tombstone_list's replacer runs exactly the same loop

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 12:56:13 +03:00
Pavel Emelyanov
f173be29d9 mutation_partition: Remove unused local variable
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-09-03 12:56:13 +03:00
Benny Halevy
4476800493 flat_mutation_reader: get rid of timeout parameter
Now that the timeout is taken from the reader_permit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Benny Halevy
4439e5c132 everywhere: cleanup defer.hh includes
Get rid of unused includes of seastar/util/{defer,closeable}.hh
and add a few that are missing from source files.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-22 21:11:39 +03:00
Asias He
4c1f8c2f83 compaction: Move compaction_garbage_collector.hh to compaction dir
The top dir is a mess. Move compaction_garbage_collector.hh to
the new home.
2021-08-07 08:07:09 +08:00
Pavel Emelyanov
1bf643d4fd mutation_partition: Pin mutable access to range tombstones
Some callers of mutation_partition::row_tomstones() don't want
(and shouldn't) modify the list itself, while they may want to
modify the tombstones. This patch explicitly locates those that
need to modify the collection, because the next patch will
return immutable collection for the others.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2021-07-27 20:06:53 +03:00
Tomasz Grabiec
7fa4e10aa0 row_cache: Use generic LRU for eviction
In preparation for tracking different kinds of objects, not just
rows_entry, in the LRU, switch to the LRU implementation form
utils/lru.hh which can hold arbitrary element type.
2021-07-02 10:25:58 +02:00
Tomasz Grabiec
655bc9fba5 mutation_partition: Trim range tombstones to query ranges
Current code was only selecting overlapping range tombstones.

We will need range tombstones to be trimmed. This is needed to change
the semantics of flat_mutation_reader v1 to produce only range
tombstones trimmed to clustering restrictions. This constructor is
used in unit tests which verify what reader produces.
2021-06-16 00:23:49 +02:00
Pavel Solodovnikov
76bea23174 treewide: reduce header interdependencies
Use forward declarations wherever possible.

Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>

Closes #8813
2021-06-07 15:58:35 +03:00
Avi Kivity
a55b434a2b treewide: extent copyright statements to present day 2021-06-06 19:18:49 +03:00
Benny Halevy
7427b60caf mutation_partition: counter_write_query: close reader when done
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-04-25 11:35:07 +03:00