scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 12:47:02 +00:00

Author	SHA1	Message	Date
Raphael S. Carvalho	ef8f542d75	replica: Adapt table::active_memtable() to compaction groups active_memtable() was fine to a single group, but with multiple groups, there will be one active memtable per group. Let's change the interface to reflect that. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-12-19 11:15:14 -03:00
Pavel Emelyanov	6075e01312	test/lib: Remove sstable_utils.hh from simple_schema.hh The latter is pretty popular test/lib header that disseminates the former one over whole lot of unit tests. The former, in turn, naturally includes sstables.hh thus making tons of unrelated tests depend on sstables class unused by them. However, simple removal doesn't work, becase of local_shard_only bool class definition in sstable_utils.hh used in simple_schema.hh. This thing, in turn, is used in keys making helpers that don't belong to sstable utils, so these are moved into simple_schema as well. When done, this affects the mutation_source_test.hh, which needs the local_shard_only bool class (and helps spreading the sstables.hh throughout more unrelated tests) and a bunch of .cc test sources that used sstable_utils.hh to indirectly include various headers of their demand. After patching, sstables.hh touches 2x times less tests. As a side effect the sstables_manager.hh also becomes 2x times less dependent on by tests. Continuation of `9bdea110a6` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12240	2022-12-08 15:37:33 +02:00
Avi Kivity	444de2831e	dirty_memory_manager: move to replica module It's a replica-side thing, so move it there. The related flush_permit and sstable_write_permit are moved alongside.	2022-12-06 22:24:17 +02:00
Avi Kivity	37c6b46d26	dirty_memory_manager: re-term "virtual dirty" to "unspooled dirty" The "virtual dirty" term is not very informative. "Virtual" means "not real", but it doesn't say in which way it isn't real. In this case, virtual dirty refers to real dirty memory, minus the portion of memtables that has been written to disk (but not yet sealed - in that case it would not be dirty in the first place). I chose to call "the portion of memtables that has been written to disk" as "spooled memory". At least the unique term will cause people to look it up and may be easier to remember. From that we have "unspooled memory". I plan to further change the accounting to account for spooled memory rather than unspooled, as that is a more natural term, but that is left for later. The documentation, config item, and metrics are adjusted. The config item is practically unused so it isn't worth keeping compatibility here.	2022-10-04 14:03:59 +03:00
Benny Halevy	0627667a06	mutation_partition: compact_for_compaction: get tombstone_gc_state And pass down to `do_compact`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Mikołaj Sielużycki	e0c6e1ef3c	table: Add test where compaction doesn't keep up with flush rate. The test simulates a situation where 2 threads issue flushes to 2 tables. Both issue small flushes, but one has injected reactor stalls. This can lead to a situation where lots of small sstables accumulate on disk, and, if compaction never has a chance to keep up, resources can be exhausted. (cherry picked from commit `b5684aa96d`) (cherry picked from commit `25407a7e41`)	2022-07-28 14:43:33 +03:00
Benny Halevy	bb9eddc67f	test: memtable_test: failed_flush_prevents_writes: notify_soft_pressure only once Now that memtable flush error handling was moved entirely to table::seal_active_memtable, we don't need to notify_soft_pressure to keep retry going. The inifinite retry loop should eventually either succeed or die (by isolating the node or aborting) on its own. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 14:06:59 +03:00
Benny Halevy	b5abbb971f	test: memtable_test: failed_flush_prevents_writes: extend error injection Inject errors into all seal_active_memtable distinct error handling sites. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-07-27 14:06:59 +03:00
Avi Kivity	419fe65259	Revert "Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki" This reverts commit `aa8f135f64`, reversing changes made to `9a88bc260c`. The patch causes hangs during flush. Also reverts parts of `411231da75` that impacted the unit test. Fixes #10897.	2022-07-06 12:19:02 +03:00
Tomasz Grabiec	a6aef60b93	memtable: Fix missing range tombstones during reads under ceratin rare conditions There is a bug introduced in `e74c3c8` (4.6.0) which makes memtable reader skip one a range tombstone for a certain pattern of deletions and under certain sequence of events. _rt_stream contains the result of deoverlapping range tombstones which had the same position, which were sipped from all the versions. The result of deoverlapping may produce a range tombstone which starts later, at the same position as a more recent tombstone which has not been sipped from the partition version yet. If we consume the old range tombstone from _rt_stream and then refresh the iterators, the refresh will skip over the newer tombstone. The fix is to drop the logic which drains _rt_stream so that _rt_stream is always merged with partition versions. For the problem to trigger, there have to be multiple MVCC versions (at least 2) which contain deletions of the following form: [a, c] @ t0 [a, b) @ t1, [b, d] @ t2 c > b The proper sequence for such versions is (assuming d > c): [a, b) @ t1, [b, d] @ t2 Due to the bug, the reader will produce: [a, b) @ t1, [b, c] @ t0 The reader also needs to be preempted right before processing [b, d] @ t2 and iterators need to get invalidated so that lsa_partition_reader::do_refresh_state() is called and it skips over [b, d] @ t2. Otherwise, the reader will emit [b, d] @ t2 later. If it does emit the proper range tombstone, it's possible that it will violate fragment order in the stream if _rt_stream accumulated remainders (possible with 3 MVCC versions). The problem goes away once MVCC versions merge. Fixes #10913 Fixes #10830 Closes #10914	2022-06-29 19:02:23 +03:00
Kamil Braun	411231da75	test/boost: memtable_test: perform schema operations on shard 0 Will be a prerequisite with Raft enabled.	2022-06-23 16:14:41 +02:00
Botond Dénes	4bd4aa2e88	Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drop tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652. Closes #10807 * github.com:scylladb/scylla: memtable: Add counters for tombstone compaction memtable, cache: Eagerly compact data with tombstones memtable: Subtract from flushed memory when cleaning mvcc: Introduce apply_resume to hold state for partition version merging test: mutation: Compare against compacted mutations compacting_reader: Drop irrelevant tombstones mutation_partition: Extract deletable_row::compact_and_expire() mvcc: Apply mutations in memtable with preemption enabled test: memtable: Make failed_flush_prevents_writes() immune to background merging	2022-06-15 18:12:42 +03:00
Tomasz Grabiec	3bec1cc19f	test: memtable: Make failed_flush_prevents_writes() immune to background merging Before the change, the test artificiallu set the soft pressure condition hoping that the background flusher will flush the memtable. It won't happen if by the time the background flusher runs the LSA region is updated and soft pressure (which is not really there) is lifted. Once apply() becomes preemptibe, backgroun partition version merging can lift the soft pressure, making the memtable flush not occur and making the test fail. Fix by triggering soft pressure on retries. Fixes #10801 Refs #10793 (cherry picked from commit `0e78ad50ea`) Closes #10802	2022-06-15 14:33:19 +02:00
Tomasz Grabiec	94f9109bea	memtable, cache: Eagerly compact data with tombstones When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drpo tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652.	2022-06-15 11:30:25 +02:00
Tomasz Grabiec	53026f3ba6	memtable: Subtract from flushed memory when cleaning This patch prevents virtual dirty from going negative during memtable flush in case partition version merging erases data previously accounted by the flush reader. There is an assert in ~flush_memory_accounter which guards for this. This will start happening after tombstones are compacted with rows on partition version merging. This problem is prevented by the patch by having the cleaner notify the memtable layer via callback about the amount of dirty memory released during merging, so that the memtable layer can adjust its accounting.	2022-06-15 11:30:25 +02:00
Tomasz Grabiec	c682521ac7	test: memtable: Make failed_flush_prevents_writes() immune to background merging Before the change, the test artificiallu set the soft pressure condition hoping that the background flusher will flush the memtable. It won't happen if by the time the background flusher runs the LSA region is updated and soft pressure (which is not really there) is lifted. Once apply() becomes preemptibe, backgroun partition version merging can lift the soft pressure, making the memtable flush not occur and making the test fail. Fix by triggering soft pressure on retries.	2022-06-15 11:29:43 +02:00
Mikołaj Sielużycki	25407a7e41	table: Add test where compaction doesn't keep up with flush rate. The test simulates a situation where 2 threads issue flushes to 2 tables. Both issue small flushes, but one has injected reactor stalls. This can lead to a situation where lots of small sstables accumulate on disk, and, if compaction never has a chance to keep up, resources can be exhausted.	2022-06-15 10:57:28 +02:00
Avi Kivity	5129280f45	Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec" This reverts commit `e0670f0bb5`, reversing changes made to `605ee74c39`. It causes failures in debug mode in database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain, though with low probability. Fixes #10780 Reopens #652.	2022-06-14 18:06:22 +03:00
Benny Halevy	5bd2e0ccce	test: memtable_test: failed_flush_prevents_writes: validate flush using min_memtable_timestamp active_memtable().empty() becomes true once seal_active_memtable succeeds with _memtables->add_memtable(), not when it is able to flush the (once active) memtable. In contrast, min_memtable_timestamp() returns api::max_timestamp only if there is no data in any memtable. Fixes #10793 Backport notes: - Introduced in `f6d9d6175f` (currently in branch-5.0) - backport requires also `0e78ad50ea` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #10798	2022-06-14 16:13:35 +03:00
Tomasz Grabiec	beadd248e3	memtable, cache: Eagerly compact data with tombstones When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drpo tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652.	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	9135d1fd1f	memtable: Subtract from flushed memory when cleaning This patch prevents virtual dirty from going negative during memtable flush in case partition version merging erases data previously accounted by the flush reader. There is an assert in ~flush_memory_accounter which guards for this. This will start happening after tombstones are compacted with rows on partition version merging. This problem is prevented by the patch by having the cleaner notify the memtable layer via callback about the amount of dirty memory released during merging, so that the memtable layer can adjust its accounting.	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	0e78ad50ea	test: memtable: Make failed_flush_prevents_writes() immune to background merging Before the change, the test artificiallu set the soft pressure condition hoping that the background flusher will flush the memtable. It won't happen if by the time the background flusher runs the LSA region is updated and soft pressure (which is not really there) is lifted. Once apply() becomes preemptibe, backgroun partition version merging can lift the soft pressure, making the memtable flush not occur and making the test fail. Fix by triggering soft pressure on retries.	2022-06-06 19:23:37 +02:00
Michael Livshin	029508b77c	flat_mutation_reader ist tot Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-31 23:42:34 +03:00
Michael Livshin	3cc2343775	tests: trivial flat_reader_assertions{,_v2} conversions (Which entails temporary cut-and-pasting some utility functions) Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-05-10 22:10:40 +03:00
Mikołaj Sielużycki	1d84a254c0	flat_mutation_reader: Split readers by file and remove unnecessary includes. The flat_mutation_reader files were conflated and contained multiple readers, which were not strictly necessary. Splitting optimizes both iterative compilation times, as touching rarely used readers doesn't recompile large chunks of codebase. Total compilation times are also improved, as the size of flat_mutation_reader.hh and flat_mutation_reader_v2.hh have been reduced and those files are included by many file in the codebase. With changes real 29m14.051s user 168m39.071s sys 5m13.443s Without changes real 30m36.203s user 175m43.354s sys 5m26.376s Closes #10194	2022-03-14 13:20:25 +02:00
Michael Livshin	9bacce4359	memtable::make_flat_reader(): return flat_mutation_reader_v2 This is just a facade change. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-02-28 17:11:54 +02:00
Avi Kivity	cbba80914d	memtable: move to replica module and namespace Memtables are a replica-side entity, and so are moved to the replica module and namespace. Memtables are also used outside the replica, in two places: - in some virtual tables; this is also in some way inside the replica, (virtual readers are installed at the replica level, not the cooordinator), so I don't consider it a layering violation - in many sstable unit tests, as a convenient way to create sstables with known input. This is a layering violation. We could make memtables their own module, but I think this is wrong. Memtables are deeply tied into replica memory management, and trying to make them a low-level primitive (at a lower level than sstables) will be difficult. Not least because memtables use sstables. Instead, we should have a memtable-like thing that doesn't support merging and doesn't have all other funky memtable stuff, and instead replace the uses of memtables in sstable tests with some kind of make_flat_mutation_reader_from_unsorted_mutations() that does the sorting that is the reason for the use of memtables in tests (and live with the layering violation meanwhile). Test: unit (dev) Closes #10120	2022-02-23 09:05:16 +02:00
Kamil Braun	a664ac7ba5	treewide: require `group0_guard` when performing schema changes `announce` now takes a `group0_guard` by value. `group0_guard` can only be obtained through `migration_manager::start_group0_operation` and moved, it cannot be constructed outside `migration_manager`. The guard will be a method of ensuring linearizability for group 0 operations.	2022-01-24 15:20:35 +01:00
Kamil Braun	283ac7fefe	treewide: pass mutation timestamp from call sites into `migration_manager::prepare_*` functions The functions which prepare schema change mutations (such as `prepare_new_column_family_announcement`) would use internally generated timestamps for these mutations. When schema changes are managed by group 0 we want to ensure that timestamps of mutations applied through Raft are monotonic. We will generate these timestamps at call sites and pass them into the `prepare_` functions. This commit prepares the APIs.	2022-01-24 15:12:50 +01:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Mikołaj Sielużycki	f6d9d6175f	sstables: Harden bad_alloc handling during memtable flush. dirty_memory_manager monitors memory and triggers memtable flushing if there is too much pressure. If bad_alloc happens during the flush, it may break the loop and flushes won't be triggered automatically, leading to blocked writes as memory won't be automatically released. The solution is to add exception handling to the loop, so that the inner part always returns a non-exceptional future (meaning the loop will break only on node shutdown). try/catch is used around on_internal_error instead of on_internal_error_noexcept, as the latter doesn't have a version that accepts an exception pointer. To get the exception message from std::exception_ptr a rethrow is needed anyway, so this was a simpler approach. Fixes: #4174 Message-Id: <20220114082452.89189-1-mikolaj.sieluzycki@scylladb.com>	2022-01-14 16:09:21 +02:00
Gleb Natapov	512556914a	test: move memtable_test.cc to new schema announcement api	2022-01-13 23:10:13 +02:00
Avi Kivity	bbad8f4677	replica: move ::database, ::keyspace, and ::table to replica namespace Move replica-oriented classes to the replica namespace. The main classes moved are ::database, ::keyspace, and ::table, but a few ancillary classes are also moved. There are certainly classes that should be moved but aren't (like distributed_loader) but we have to start somewhere. References are adjusted treewide. In many cases, it is obvious that a call site should not access the replica (but the data_dictionary instead), but that is left for separate work. scylla-gdb.py is adjusted to look for both the new and old names.	2022-01-07 12:04:38 +02:00
Avi Kivity	ae3a360725	database: Move database, keyspace, table classes to replica/ directory The database, keyspace, and table classes represent the replica-only part of the objects after which they are named. Reading from a table doesn't give you the full data, just the replica's view, and it is not consistent since reconciliation is applied on the coordinator. As a first step in acknowledging this, move the related files to a replica/ subdirectory.	2022-01-06 17:07:30 +02:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Mikołaj Sielużycki	504efe0607	table: Prevent resurrecting data from memtable on compaction Mutations are not guaranteed to come in the order of their timestamps. If there is an expired tombstone in the sstable and a repair inserts old data into memtable, the compaction would not consider memtable data and purge the tombstone leading to data resurrection. The solution is to disallow purging tombstones newer than min memtable timestamp.	2021-12-09 13:22:14 +01:00
Mikołaj Sielużycki	a88f7df195	memtable-sstable: Add compacting reader when flushing memtable. When memtable contains both mutations and tombstones that delete them, the output flushed to sstables contains both mutations. Inserting a compacting reader results in writing smaller sstables and saves compaction work later. Performance tests of this change have shown a regression in a common case where there are no deletes. A heuristic is employed to skip compaction unless there are tombstones in the memtable to minimise the impact of that issue.	2021-11-29 13:19:42 +01:00
Michał Radwański	cac9ac5126	test: memtable: add full_compaction in background Add full compaction in test_memtable_with_many_versions_conforms_to_mutation_source in background. Without it, some paths in the partition snapshot reader weren't covered, as the tests always managed to read all range tombstones and rows which cover a given clustering range from just a single snapshot. Now, when full_compaction happens in process of reading from a clustering range, we can force state refresh with non-nullopt positions of last row and last range tombstone. Note: this inability to test affected only the reversing reader.	2021-11-04 16:19:54 +01:00
Benny Halevy	4476800493	flat_mutation_reader: get rid of timeout parameter Now that the timeout is taken from the reader_permit. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-08-24 16:30:51 +03:00
Botond Dénes	2d2b9e7b36	test/boost: migrate off the global test reader semaphore	2021-07-08 16:53:38 +03:00
Botond Dénes	0f36e5c498	memtable: migrate off the global reader concurrency semaphore Require the caller of `create_flush_reader()` to pass a permit instead.	2021-07-08 12:31:36 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Benny Halevy	aa5289f255	test: everywhere: close flat_mutation_reader when done Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	574759bf95	memtable: flush_reader: make sure to close partition reader Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Benny Halevy	93b5d7d4c2	memtable: scanning_reader: make sure to close underlying reader Close _delegate if it's engaged both in the close() method and when ever it is currently reset by _delegate = {}. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-04-25 11:35:07 +03:00
Botond Dénes	4f5ccf82cb	mutation_fragment: s/as_mutable_clustering_row/mutate_as_clustering_row/ We will soon want to update the memory consumption of mutation fragment after each modification done to it, to do that safely we have to forbid direct access to the underlying data and instead have callers pass a lambda doing their modifications. Uses where this method was just used to move the fragment away are converted to use `as_clustering_row() &&`.	2020-09-28 10:53:56 +03:00
Rafael Ávila de Espíndola	74db08165d	tests: Convert to using memory::with_allocation_failures Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200805155143.122396-1-espindola@scylladb.com>	2020-08-10 18:37:42 +03:00
Botond Dénes	9ede82ebf8	memtable: pass a valid permit to the delegate reader All reader are soon going to require a valid permit, so make sure we have a valid permit which we can pass to the delegate reader when creating it. This means `memtable::make_flat_reader()` now also requires a permit to be passed to it. Internally the permit is stored in `scanning_reader`, which is used both for flushes and normal reads. In the former case a permit is not required.	2020-05-28 11:34:35 +03:00
Konstantin Osipov	ff3f9cb7cf	test: stop using BOOST_TEST_MESSAGE() for logging We use boost test logging primarily to generate nice XML xunit files used in Jenkins. These XML files can be bloated with messages from BOOST_TEST_MESSAGE(), hundreds of megabytes of build archives, on every build. Let's use seastar logger for test logging instead, reserving the use of boost log facilities for boost test markup information.	2020-03-05 11:38:11 +03:00
Rafael Ávila de Espíndola	dca1bc480f	everywhere: Use serialized(foo) instead of data_value(foo).serialize() This is just a simple cleanup that reduces the size of another patch I am working on and is an independent improvement. Signed-off-by: Rafael Ávila de Espíndola <espindola@scylladb.com> Message-Id: <20200114051739.370127-1-espindola@scylladb.com>	2020-01-14 12:17:12 +02:00

1 2

51 Commits