Commit Graph

61 Commits

Author SHA1 Message Date
Avi Kivity
0770b328c7 test: fix some mismatched signed/unsigned comparisons
gcc likes to complain about sized/unsigned compares as they
can yield surprising results. The fixes are trivial, so apply
them.
2023-03-21 13:15:12 +02:00
Kefu Chai
60eac12db6 test: memtable_test: mark dummy variable for loop [[maybe_unused]]
without C++23 `std::ranges::repeat_view`, it'd be cumbersume to
implement a loop without dummy variable. this change helps to
silence following warning:

```
test/boost/memtable_test.cc:1135:26: error: unused variable 'value' [-Werror,-Wunused-variable]
                for (int value : boost::irange<int>(0, num_flushes)) {
                         ^
```

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2023-02-28 21:56:55 +08:00
Avi Kivity
69a385fd9d Introduce schema/ module
Schema related files are moved there. This excludes schema files that
also interact with mutations, because the mutation module depends on
the schema. Those files will have to go into a separate module.

Closes #12858
2023-02-15 11:01:50 +02:00
Tomasz Grabiec
ccc8e47db1 Merge 'test/lib: introduce key_utils.hh' from Botond Dénes
We currently have two method families to generate partition keys:
* make_keys() in test/lib/simple_schema.hh
* token_generation_for_shard() in test/lib/sstable_utils.hh

Both work only for schemas with a single partition key column of `text` type and both generate keys of fixed size.
This is very restrictive and simplistic. Tests, which wanted anything more complicated than that had to rely on open-coded key generation.
Also, many tests started to rely on the simplistic nature of these keys, in particular two tests started failing because the new key generation method generated keys of varying size:
* sstable_compaction_test.sstable_run_based_compaction_test
* sstable_mutation_test.test_key_count_estimation

These two tests seems to depend on generated keys all being of the same size. This makes some sense in the case of the key count estimation test, but makes no sense at all to me in the case of the sstable run test.

Closes #12657

* github.com:scylladb/scylladb:
  test/lib/sstable_utils: remove now unused token_generation_for_shard() and friends
  test/lib/simple_schema: remove now unused make_keys() and friends
  test: migrate to tests::generate_partition_key[s]()
  test/lib/test_services: add table_for_tests::make_default_schema()
  test/lib: add key_utils.hh
  test/lib/random_schema.hh: value_generator: add min_size_in_bytes
2023-02-06 18:11:32 +01:00
Avi Kivity
f73e2c992f Merge 'Keep range tombstones with rows in memtables and cache' from Tomasz Grabiec
This series switches memtable and cache to use a new representation for mutation data,
called `mutation_partition_v2`. In this representation, range tombstone information is stored
in the same tree as rows, attached to row entries. Each entry has a new tombstone field,
which represents range tombstone part which applies to the interval between this entry and
the previous one. See docs/dev/mvcc.md for more details about the format.

The transient mutation object still uses the old model in order to avoid work needed to adapt
old code to the new model. It may also be a good idea to live with two models, since the
transient mutation has different requirements and thus different trade-offs can be made.
Transient mutation doesn't need to support eviction and strong exception guarantees,
so its algorithms and in-memory representation can be simpler.

This allows us to incrementally evict range tombstone information. Before this series,
range tombstones were accumulated and evicted only when the whole partition entry was evicted. This
could lead to inefficient use of cache memory.

Another advantage of the new representation is that reads don't have to lookup
range tombstone information in a different tree while reading. This leads to simpler
and more efficient readers.

There are several disadvantages too. Firstly, rows_entry is now larger by 16 bytes.
Secondly, update algorithms are more complex because they need to deoverlap range tombstone
information. Also, to handle preemption and provide strong exception guarantees, update
algorithms may need to allocate sentinel entries, which adds complexity and reduces performance.

The memtable reader was changed to use the same cursor implementation
which cache uses, for improved code reuse and reducing risk of bugs
due to discrepancy of algorithms which deal with MVCC.

Remaining work:
  - performance optimizations to apply_monotonically() to avoid regressions
  - performance testing
  - preemption support in apply_to_incomplete (cache update from memtable)

Fixes #2578
Fixes #3288
Fixes #10587

Closes #12048

* github.com:scylladb/scylladb:
  test: mvcc: Extend some scenarios with exhaustive consistency checks on eviction
  test: mvcc: Extract mvcc_container::allocate_in_region()
  row_cache, lru: Introduce evict_shallow()
  test: mvcc: Avoid copies of mutation under failure injection
  test: mvcc: Add missing logalloc::reclaim_lock to test_apply_is_atomic
  mutation_partition_v2: Avoid full scan when applying mutation to non-evictable
  Pass is_evictable to apply()
  tests: mutation_partition_v2: Introduce test_external_memory_usage_v2 mirroring the test for v1
  tests: mutation: Fix test_external_memory_usage() to not measure mutation object footprint
  tests: mutation_partition_v2: Add test for exception safety of mutation merging
  tests: Add tests for the mutation_partition_v2 model
  mutation_partition_v2: Implement compact()
  cache_tracker: Extract insert(mutation_partition_v2&)
  mvcc, mutation_partition: Document guarantees in case merging succeeds
  mutation_partition_v2: Accept arbitrary preemption source in apply_monotonically()
  mutation_partition_v2: Simplify get_continuity()
  row_cache: Distinguish dummy insertion site in trace log
  db: Use mutation_partition_v2 in mvcc
  range_tombstone_change_merger: Introduce peek()
  readers: Extract range_tombstone_change_merger
  mvcc: partition_snapshot_row_cursor: Handle non-evictable snapshots
  mvcc: partition_snapshot_row_cursor: Support digest calculation
  mutation_partition_v2: Store range tombstones together with rows
  db: Introduce mutation_partition_v2
  doc: Introduce docs/dev/mvcc.md
  db: cache_tracker: Introduce insert() variant which positions before existing entry in the LRU
  db: Print range_tombstone bounds as position_in_partition
  test: memtable_test: Relax test_segment_migration_during_flush
  test: cache_flat_mutation_reader: Avoid timestamp clash
  test: cache_flat_mutation_reader_test: Use monotonic timestamps when inserting rows
  test: mvcc: Fix sporadic failures due to compact_for_compaction()
  test: lib: random_mutation_generator: Produce partition tombstone less often
  test: lib: random_utils: Introduce with_probability()
  test: lib: Improve error message in has_same_continuity()
  test: mvcc: mvcc_container: Avoid UB in tracker() getter when there is no tracker
  test: mvcc: Insert entries in the tracker
  test: mvcc_test: Do not set dummy::no on non-clustering rows
  mutation_partition: Print full position in error report in append_clustered_row()
  db: mutation_cleaner: Extract make_region_space_guard()
  position_in_partition: Optimize equality check
  mvcc: Fix version merging state resetting
  mutation_partition: apply_resume: Mark operator bool() as explicit
2023-02-05 22:33:10 +02:00
Raphael S. Carvalho
3c5afb2d5c test: Enable Scylla test command line options for boost tests
We have enabled the command line options without changing a
single line of code, we only had to replace old include
with scylla_test_case.hh.

Next step is to add x-log-compaction-groups options, which will
determine the number of compaction groups to be used by all
instantiations of replica::table.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-02-01 20:14:51 -03:00
Raphael S. Carvalho
2d2460046b test: memtable_test: Fix it with multiple compaction groups
With compaction groups, automatic flushing may not pick the user
table. Fix it by using explicit flush.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2023-02-01 20:14:51 -03:00
Botond Dénes
4ad3ba52b0 test: migrate to tests::generate_partition_key[s]()
Use the newly introduced key generation facilities, instead of the the
old inflexible alternatives and hand-rolled code.
Most of the migrations are mechanic, but there are two tests that
were tricky to migrate:
* sstable_compaction_test.sstable_run_based_compaction_test
* sstable_mutation_test.test_key_count_estimation

These two tests seems to depend on generated keys all being of the same
size. This makes some sense in the case of the key count estimation
test, but makes no sense at all to me in the case of the sstable run
test.
2023-01-30 05:03:42 -05:00
Tomasz Grabiec
919ff433d1 tests: Add tests for the mutation_partition_v2 model 2023-01-27 21:56:31 +01:00
Tomasz Grabiec
40719c600c test: memtable_test: Relax test_segment_migration_during_flush
Partition version merging can now insert sentinels, which may
temporarily increase unspooled memory. It is no longer true that
unspooled monotonically decreases, which the test verified.  Relax it,
and only verify that unspooled is smaller than real dirty.
2023-01-27 19:15:39 +01:00
Raphael S. Carvalho
ef8f542d75 replica: Adapt table::active_memtable() to compaction groups
active_memtable() was fine to a single group, but with multiple groups,
there will be one active memtable per group. Let's change the
interface to reflect that.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2022-12-19 11:15:14 -03:00
Pavel Emelyanov
6075e01312 test/lib: Remove sstable_utils.hh from simple_schema.hh
The latter is pretty popular test/lib header that disseminates the
former one over whole lot of unit tests. The former, in turn, naturally
includes sstables.hh thus making tons of unrelated tests depend on
sstables class unused by them.

However, simple removal doesn't work, becase of local_shard_only bool
class definition in sstable_utils.hh used in simple_schema.hh. This
thing, in turn, is used in keys making helpers that don't belong to
sstable utils, so these are moved into simple_schema as well.

When done, this affects the mutation_source_test.hh, which needs the
local_shard_only bool class (and helps spreading the sstables.hh
throughout more unrelated tests) and a bunch of .cc test sources that
used sstable_utils.hh to indirectly include various headers of their
demand.

After patching, sstables.hh touches 2x times less tests. As a side
effect the sstables_manager.hh also becomes 2x times less dependent
on by tests.

Continuation of 9bdea110a6

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes #12240
2022-12-08 15:37:33 +02:00
Avi Kivity
444de2831e dirty_memory_manager: move to replica module
It's a replica-side thing, so move it there. The related
flush_permit and sstable_write_permit are moved alongside.
2022-12-06 22:24:17 +02:00
Avi Kivity
37c6b46d26 dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty"
The "virtual dirty" term is not very informative. "Virtual" means
"not real", but it doesn't say in which way it isn't real.

In this case, virtual dirty refers to real dirty memory, minus
the portion of memtables that has been written to disk (but not
yet sealed - in that case it would not be dirty in the first
place).

I chose to call "the portion of memtables that has been written
to disk" as "spooled memory". At least the unique term will cause
people to look it up and may be easier to remember. From that
we have "unspooled memory".

I plan to further change the accounting to account for spooled memory
rather than unspooled, as that is a more natural term, but that is left
for later.

The documentation, config item, and metrics are adjusted. The config
item is practically unused so it isn't worth keeping compatibility here.
2022-10-04 14:03:59 +03:00
Benny Halevy
0627667a06 mutation_partition: compact_for_compaction: get tombstone_gc_state
And pass down to `do_compact`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-09-07 07:43:15 +03:00
Mikołaj Sielużycki
e0c6e1ef3c table: Add test where compaction doesn't keep up with flush rate.
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.

(cherry picked from commit b5684aa96d)
(cherry picked from commit 25407a7e41)
2022-07-28 14:43:33 +03:00
Benny Halevy
bb9eddc67f test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once
Now that memtable flush error handling was moved entirely
to table::seal_active_memtable, we don't need to notify_soft_pressure
to keep retry going.  The inifinite retry loop should
eventually either succeed or die (by isolating the node or aborting)
on its own.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:59 +03:00
Benny Halevy
b5abbb971f test: memtable_test: failed_flush_prevents_writes: extend error injection
Inject errors into all seal_active_memtable distinct error
handling sites.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2022-07-27 14:06:59 +03:00
Avi Kivity
419fe65259 Revert "Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki"
This reverts commit aa8f135f64, reversing
changes made to 9a88bc260c. The patch
causes hangs during flush.

Also reverts parts of 411231da75 that impacted the unit test.

Fixes #10897.
2022-07-06 12:19:02 +03:00
Tomasz Grabiec
a6aef60b93 memtable: Fix missing range tombstones during reads under ceratin rare conditions
There is a bug introduced in e74c3c8 (4.6.0) which makes memtable
reader skip one a range tombstone for a certain pattern of deletions
and under certain sequence of events.

_rt_stream contains the result of deoverlapping range tombstones which
had the same position, which were sipped from all the versions. The
result of deoverlapping may produce a range tombstone which starts
later, at the same position as a more recent tombstone which has not
been sipped from the partition version yet. If we consume the old
range tombstone from _rt_stream and then refresh the iterators, the
refresh will skip over the newer tombstone.

The fix is to drop the logic which drains _rt_stream so that
_rt_stream is always merged with partition versions.

For the problem to trigger, there have to be multiple MVCC versions
(at least 2) which contain deletions of the following form:

[a, c] @ t0
[a, b) @ t1, [b, d] @ t2

c > b

The proper sequence for such versions is (assuming d > c):

[a, b) @ t1,
[b, d] @ t2

Due to the bug, the reader will produce:

[a, b) @ t1,
[b, c] @ t0

The reader also needs to be preempted right before processing [b, d] @
t2 and iterators need to get invalidated so that
lsa_partition_reader::do_refresh_state() is called and it skips over
[b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it
does emit the proper range tombstone, it's possible that it will violate
fragment order in the stream if _rt_stream accumulated remainders
(possible with 3 MVCC versions).

The problem goes away once MVCC versions merge.

Fixes #10913
Fixes #10830

Closes #10914
2022-06-29 19:02:23 +03:00
Kamil Braun
411231da75 test/boost: memtable_test: perform schema operations on shard 0
Will be a prerequisite with Raft enabled.
2022-06-23 16:14:41 +02:00
Botond Dénes
4bd4aa2e88 Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drop tombstoned data.

One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.

Fixes #652.

Closes #10807

* github.com:scylladb/scylla:
  memtable: Add counters for tombstone compaction
  memtable, cache: Eagerly compact data with tombstones
  memtable: Subtract from flushed memory when cleaning
  mvcc: Introduce apply_resume to hold state for partition version merging
  test: mutation: Compare against compacted mutations
  compacting_reader: Drop irrelevant tombstones
  mutation_partition: Extract deletable_row::compact_and_expire()
  mvcc: Apply mutations in memtable with preemption enabled
  test: memtable: Make failed_flush_prevents_writes() immune to background merging
2022-06-15 18:12:42 +03:00
Tomasz Grabiec
3bec1cc19f test: memtable: Make failed_flush_prevents_writes() immune to background merging
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.

Fix by triggering soft pressure on retries.

Fixes #10801
Refs #10793

(cherry picked from commit 0e78ad50ea)

Closes #10802
2022-06-15 14:33:19 +02:00
Tomasz Grabiec
94f9109bea memtable, cache: Eagerly compact data with tombstones
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.

One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.

Fixes #652.
2022-06-15 11:30:25 +02:00
Tomasz Grabiec
53026f3ba6 memtable: Subtract from flushed memory when cleaning
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.

This will start happening after tombstones are compacted with rows on
partition version merging.

This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
2022-06-15 11:30:25 +02:00
Tomasz Grabiec
c682521ac7 test: memtable: Make failed_flush_prevents_writes() immune to background merging
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.

Fix by triggering soft pressure on retries.
2022-06-15 11:29:43 +02:00
Mikołaj Sielużycki
25407a7e41 table: Add test where compaction doesn't keep up with flush rate.
The test simulates a situation where 2 threads issue flushes to 2
tables. Both issue small flushes, but one has injected reactor stalls.
This can lead to a situation where lots of small sstables accumulate on
disk, and, if compaction never has a chance to keep up, resources can be
exhausted.
2022-06-15 10:57:28 +02:00
Avi Kivity
5129280f45 Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec"
This reverts commit e0670f0bb5, reversing
changes made to 605ee74c39. It causes failures
in debug mode in
database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain,
though with low probability.

Fixes #10780
Reopens #652.
2022-06-14 18:06:22 +03:00
Benny Halevy
5bd2e0ccce test: memtable_test: failed_flush_prevents_writes: validate flush using min_memtable_timestamp
active_memtable().empty() becomes true once seal_active_memtable
succeeds with _memtables->add_memtable(), not when it is able
to flush the (once active) memtable.

In contrast, min_memtable_timestamp() returns api::max_timestamp
only if there is no data in any memtable.

Fixes #10793

Backport notes:
- Introduced in f6d9d6175f (currently in
branch-5.0)
- backport requires also 0e78ad50ea

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes #10798
2022-06-14 16:13:35 +03:00
Tomasz Grabiec
beadd248e3 memtable, cache: Eagerly compact data with tombstones
When memtable receives a tombstone it can happen under some workloads
that it covers data which is still in the memtable. Some workloads may
insert and delete data within a short time frame. We could reduce the
rate of memtable flushes if we eagerly drpo tombstoned data.

One workload which benefits is the raft log. It stores a row for each
uncommitted raft entry. When entries are committed they are
deleted. So the live set is expected to be short under normal
conditions.

Fixes #652.
2022-06-06 19:25:41 +02:00
Tomasz Grabiec
9135d1fd1f memtable: Subtract from flushed memory when cleaning
This patch prevents virtual dirty from going negative during memtable
flush in case partition version merging erases data previously
accounted by the flush reader. There is an assert in
~flush_memory_accounter which guards for this.

This will start happening after tombstones are compacted with rows on
partition version merging.

This problem is prevented by the patch by having the cleaner notify
the memtable layer via callback about the amount of dirty memory released
during merging, so that the memtable layer can adjust its accounting.
2022-06-06 19:25:41 +02:00
Tomasz Grabiec
0e78ad50ea test: memtable: Make failed_flush_prevents_writes() immune to background merging
Before the change, the test artificiallu set the soft pressure
condition hoping that the background flusher will flush the
memtable. It won't happen if by the time the background flusher runs
the LSA region is updated and soft pressure (which is not really
there) is lifted. Once apply() becomes preemptibe, backgroun partition
version merging can lift the soft pressure, making the memtable flush
not occur and making the test fail.

Fix by triggering soft pressure on retries.
2022-06-06 19:23:37 +02:00
Michael Livshin
029508b77c flat_mutation_reader ist tot
Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-31 23:42:34 +03:00
Michael Livshin
3cc2343775 tests: trivial flat_reader_assertions{,_v2} conversions
(Which entails temporary cut-and-pasting some utility functions)

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-05-10 22:10:40 +03:00
Mikołaj Sielużycki
1d84a254c0 flat_mutation_reader: Split readers by file and remove unnecessary includes.
The flat_mutation_reader files were conflated and contained multiple
readers, which were not strictly necessary. Splitting optimizes both
iterative compilation times, as touching rarely used readers doesn't
recompile large chunks of codebase. Total compilation times are also
improved, as the size of flat_mutation_reader.hh and
flat_mutation_reader_v2.hh have been reduced and those files are
included by many file in the codebase.

With changes

real	29m14.051s
user	168m39.071s
sys	5m13.443s

Without changes

real	30m36.203s
user	175m43.354s
sys	5m26.376s

Closes #10194
2022-03-14 13:20:25 +02:00
Michael Livshin
9bacce4359 memtable::make_flat_reader(): return flat_mutation_reader_v2
This is just a facade change.

Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>
2022-02-28 17:11:54 +02:00
Avi Kivity
cbba80914d memtable: move to replica module and namespace
Memtables are a replica-side entity, and so are moved to the
replica module and namespace.

Memtables are also used outside the replica, in two places:
 - in some virtual tables; this is also in some way inside the replica,
   (virtual readers are installed at the replica level, not the
   cooordinator), so I don't consider it a layering violation
 - in many sstable unit tests, as a convenient way to create sstables
   with known input. This is a layering violation.

We could make memtables their own module, but I think this is wrong.
Memtables are deeply tied into replica memory management, and trying
to make them a low-level primitive (at a lower level than sstables) will
be difficult. Not least because memtables use sstables. Instead, we
should have a memtable-like thing that doesn't support merging and
doesn't have all other funky memtable stuff, and instead replace
the uses of memtables in sstable tests with some kind of
make_flat_mutation_reader_from_unsorted_mutations() that does
the sorting that is the reason for the use of memtables in tests (and
live with the layering violation meanwhile).

Test: unit (dev)

Closes #10120
2022-02-23 09:05:16 +02:00
Kamil Braun
a664ac7ba5 treewide: require group0_guard when performing schema changes
`announce` now takes a `group0_guard` by value. `group0_guard` can only
be obtained through `migration_manager::start_group0_operation` and
moved, it cannot be constructed outside `migration_manager`.

The guard will be a method of ensuring linearizability for group 0
operations.
2022-01-24 15:20:35 +01:00
Kamil Braun
283ac7fefe treewide: pass mutation timestamp from call sites into migration_manager::prepare_* functions
The functions which prepare schema change mutations (such as
`prepare_new_column_family_announcement`) would use internally
generated timestamps for these mutations. When schema changes are
managed by group 0 we want to ensure that timestamps of mutations
applied through Raft are monotonic. We will generate these timestamps at
call sites and pass them into the `prepare_` functions. This commit
prepares the APIs.
2022-01-24 15:12:50 +01:00
Avi Kivity
fcb8d040e8 treewide: use Software Package Data Exchange (SPDX) license identifiers
Instead of lengthy blurbs, switch to single-line, machine-readable
standardized (https://spdx.dev) license identifiers. The Linux kernel
switched long ago, so there is strong precedent.

Three cases are handled: AGPL-only, Apache-only, and dual licensed.
For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0),
reasoning that our changes are extensive enough to apply our license.

The changes we applied mechanically with a script, except to
licenses/README.md.

Closes #9937
2022-01-18 12:15:18 +01:00
Mikołaj Sielużycki
f6d9d6175f sstables: Harden bad_alloc handling during memtable flush.
dirty_memory_manager monitors memory and triggers memtable flushing if
there is too much pressure. If bad_alloc happens during the flush, it
may break the loop and flushes won't be triggered automatically, leading
to blocked writes as memory won't be automatically released.

The solution is to add exception handling to the loop, so that the inner
part always returns a non-exceptional future (meaning the loop will
break only on node shutdown).

try/catch is used around on_internal_error instead of
on_internal_error_noexcept, as the latter doesn't have a version that
accepts an exception pointer. To get the exception message from
std::exception_ptr a rethrow is needed anyway, so this was a simpler
approach.

Fixes: #4174

Message-Id: <20220114082452.89189-1-mikolaj.sieluzycki@scylladb.com>
2022-01-14 16:09:21 +02:00
Gleb Natapov
512556914a test: move memtable_test.cc to new schema announcement api 2022-01-13 23:10:13 +02:00
Avi Kivity
bbad8f4677 replica: move ::database, ::keyspace, and ::table to replica namespace
Move replica-oriented classes to the replica namespace. The main
classes moved are ::database, ::keyspace, and ::table, but a few
ancillary classes are also moved. There are certainly classes that
should be moved but aren't (like distributed_loader) but we have
to start somewhere.

References are adjusted treewide. In many cases, it is obvious that
a call site should not access the replica (but the data_dictionary
instead), but that is left for separate work.

scylla-gdb.py is adjusted to look for both the new and old names.
2022-01-07 12:04:38 +02:00
Avi Kivity
ae3a360725 database: Move database, keyspace, table classes to replica/ directory
The database, keyspace, and table classes represent the replica-only
part of the objects after which they are named. Reading from a table
doesn't give you the full data, just the replica's view, and it is not
consistent since reconciliation is applied on the coordinator.

As a first step in acknowledging this, move the related files to
a replica/ subdirectory.
2022-01-06 17:07:30 +02:00
Asias He
a8ad385ecd repair: Get rid of the gc_grace_seconds
The gc_grace_seconds is a very fragile and broken design inherited from
Cassandra. Deleted data can be resurrected if cluster wide repair is not
performed within gc_grace_seconds. This design pushes the job of making
the database consistency to the user. In practice, it is very hard to
guarantee repair is performed within gc_grace_seconds all the time. For
example, repair workload has the lowest priority in the system which can
be slowed down by the higher priority workload, so that there is no
guarantee when a repair can finish. A gc_grace_seconds value that is
used to work might not work after data volume grows in a cluster. Users
might want to avoid running repair during a specific period where
latency is the top priority for their business.

To solve this problem, an automatic mechanism to protect data
resurrection is proposed and implemented. The main idea is to remove the
tombstone only after the range that covers the tombstone is repaired.

In this patch, a new table option tombstone_gc is added. The option is
used to configure tombstone gc mode. For example:

1) GC a tombstone after gc_grace_seconds

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ;

This is the default mode. If no tombstone_gc option is specified by the
user. The old gc_grace_seconds based gc will be used.

2) Never GC a tombstone

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'};

3) GC a tombstone immediately

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'};

4) GC a tombstone after repair

cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'};

In addition to the 'mode' option, another option 'propagation_delay_in_seconds'
is added. It defines the max time a write could possibly delay before it
eventually arrives at a node.

A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc
option can only be used after the whole cluster supports the new
feature. A mixed cluster works with no problem.

Tests: compaction_test.py, ninja test

Fixes #3560

[avi: resolve conflicts vs data_dictionary]
2022-01-04 19:48:14 +02:00
Mikołaj Sielużycki
504efe0607 table: Prevent resurrecting data from memtable on compaction
Mutations are not guaranteed to come in the order of their timestamps.
If there is an expired tombstone in the sstable and a repair inserts old
data into memtable, the compaction would not consider memtable data and
purge the tombstone leading to data resurrection. The solution is to
disallow purging tombstones newer than min memtable timestamp.
2021-12-09 13:22:14 +01:00
Mikołaj Sielużycki
a88f7df195 memtable-sstable: Add compacting reader when flushing memtable.
When memtable contains both mutations and tombstones that delete them,
the output flushed to sstables contains both mutations. Inserting a
compacting reader results in writing smaller sstables and saves
compaction work later.

Performance tests of this change have shown a regression in a common
case where there are no deletes. A heuristic is employed to skip
compaction unless there are tombstones in the memtable to minimise
the impact of that issue.
2021-11-29 13:19:42 +01:00
Michał Radwański
cac9ac5126 test: memtable: add full_compaction in background
Add full compaction in test_memtable_with_many_versions_conforms_to_mutation_source
in background. Without it, some paths in the partition snapshot reader
weren't covered, as the tests always managed to read all range
tombstones and rows which cover a given clustering range from just a
single snapshot. Now, when full_compaction happens in process of reading
from a clustering range, we can force state refresh with non-nullopt
positions of last row and last range tombstone.

Note: this inability to test affected only the reversing reader.
2021-11-04 16:19:54 +01:00
Benny Halevy
4476800493 flat_mutation_reader: get rid of timeout parameter
Now that the timeout is taken from the reader_permit.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2021-08-24 16:30:51 +03:00
Botond Dénes
2d2b9e7b36 test/boost: migrate off the global test reader semaphore 2021-07-08 16:53:38 +03:00