scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 11:00:35 +00:00

Author	SHA1	Message	Date
Michał Chojnowski	bed20a2e37	row_cache: use preemption_source in update() To facilitate testing the state of cache after the update is preempted at various points, pass a preemption_source& to update() instead of calling the reactor directly. In release builds, the calls to preemption_source methods should compile to the same direct reactor calls as today. Only in dev mode they should add an extra branch. (However, the `preemption_source&` argument has to be shoveled in any mode).	2024-02-07 18:31:36 +01:00
Michał Chojnowski	2d1a345068	test: mvcc_test: add a test for gentle schema upgrades	2023-05-04 03:35:15 +02:00
Michał Chojnowski	0273101890	partition_version: remove the unused "from" argument in partition_entry::upgrade() partition_entry now contains a reference to its schema, so it doesn't have to be supplied by the caller anymore.	2023-05-04 02:37:30 +02:00
Michał Chojnowski	94e4dc3d8d	partition_version: add a logalloc::region argument to partition_entry::upgrade() The argument is currently unused, but will be further propagated to add_version() in an upcoming patch.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	caaf0bd6bf	partition_version: remove _schema from partition_entry::operator<< operator<< accepts a schema& and a partition_entry&. But since the latter now contains a reference to its schema inside, the former is redundant. Remove it.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	f6e11c95e2	partition_version: remove the schema argument from partition_entry::read() partition_entry now contains a reference to its schema, so it no longer needs to be supplied by the caller.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	1d01a4a168	partition_version: add a _schema field to partition_version Currently, partition_version does not reference its schema. All partition_version reachable from a entry/snapshot have the same schema, which is referenced in memtable_entry/cache_entry/partition_snapshot. To enable gentle schema upgrades, we want to use the existing background version merging mechanism. To achieve that, we will move the schema reference into partition_version, and we will allow neighbouring MVCC versions to have different schemas, and we will merge them on-the-fly during reads and persistently during background version merges. This way, an upgrade will boil down to adding a new empty version with the new schema. This patch adds the _schema field to partition_version and propagates the schema pointer to it from the version's containers (entry/snapshot). Subsequent patches will remove the schema references from the containers, because they are now redundant.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	a70c5704df	mutation_partition: change schema_ptr to schema& in mutation_partition constructor Cosmetic change. See the preceding commit for details.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	781514acfe	mutation_partition_v2: change schema_ptr to schema& in mutation_partition_v2 constructor We don't have a convention for when to pass `schema_ptr` and and when to pass `const schema&` around. In general, IMHO the natural convention for such a situation is to pass the shared pointer if the callee might extend the lifetime of shared_ptr, and pass a reference otherwise. But we convert between them willy-nilly through shared_from_this(). While passing a reference to a function which actually expects a shared_ptr can make sense (e.g. due to the fact that smart pointers can't be passed in registers), the other way around is rather pointless. This patch takes one occurence of that and modifies the parameter to a reference. Since enable_shared_from_this makes shared pointer parameters and reference parameters interchangeable, this is a purely cosmetic change.	2023-05-04 02:37:29 +02:00
Michał Chojnowski	88a0871729	mutation_partition: remove apply_weak() apply_weak is just an alias for apply(), and most of its variants are dead code. Get rid of it.	2023-05-04 02:37:29 +02:00
Kefu Chai	3ae11de204	treewide: do not define/capture unused variables these warnings are found by Clang-17 after removing `-Wno-unused-lambda-capture` and '-Wno-unused-variable' from the list of disabled warnings in `configure.py`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2023-02-28 21:56:53 +08:00
Avi Kivity	c5e4bf51bd	Introduce mutation/ module Move mutation-related files to a new mutation/ directory. The names are kept in the global namespace to reduce churn; the names are unambiguous in any case. mutation_reader remains in the readers/ module. mutation_partition_v2.cc was missing from CMakeLists.txt; it's added in this patch. This is a step forward towards librarization or modularization of the source base. Closes #12788	2023-02-14 11:19:03 +02:00
Avi Kivity	f73e2c992f	Merge 'Keep range tombstones with rows in memtables and cache' from Tomasz Grabiec This series switches memtable and cache to use a new representation for mutation data, called `mutation_partition_v2`. In this representation, range tombstone information is stored in the same tree as rows, attached to row entries. Each entry has a new tombstone field, which represents range tombstone part which applies to the interval between this entry and the previous one. See docs/dev/mvcc.md for more details about the format. The transient mutation object still uses the old model in order to avoid work needed to adapt old code to the new model. It may also be a good idea to live with two models, since the transient mutation has different requirements and thus different trade-offs can be made. Transient mutation doesn't need to support eviction and strong exception guarantees, so its algorithms and in-memory representation can be simpler. This allows us to incrementally evict range tombstone information. Before this series, range tombstones were accumulated and evicted only when the whole partition entry was evicted. This could lead to inefficient use of cache memory. Another advantage of the new representation is that reads don't have to lookup range tombstone information in a different tree while reading. This leads to simpler and more efficient readers. There are several disadvantages too. Firstly, rows_entry is now larger by 16 bytes. Secondly, update algorithms are more complex because they need to deoverlap range tombstone information. Also, to handle preemption and provide strong exception guarantees, update algorithms may need to allocate sentinel entries, which adds complexity and reduces performance. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Remaining work: - performance optimizations to apply_monotonically() to avoid regressions - performance testing - preemption support in apply_to_incomplete (cache update from memtable) Fixes #2578 Fixes #3288 Fixes #10587 Closes #12048 * github.com:scylladb/scylladb: test: mvcc: Extend some scenarios with exhaustive consistency checks on eviction test: mvcc: Extract mvcc_container::allocate_in_region() row_cache, lru: Introduce evict_shallow() test: mvcc: Avoid copies of mutation under failure injection test: mvcc: Add missing logalloc::reclaim_lock to test_apply_is_atomic mutation_partition_v2: Avoid full scan when applying mutation to non-evictable Pass is_evictable to apply() tests: mutation_partition_v2: Introduce test_external_memory_usage_v2 mirroring the test for v1 tests: mutation: Fix test_external_memory_usage() to not measure mutation object footprint tests: mutation_partition_v2: Add test for exception safety of mutation merging tests: Add tests for the mutation_partition_v2 model mutation_partition_v2: Implement compact() cache_tracker: Extract insert(mutation_partition_v2&) mvcc, mutation_partition: Document guarantees in case merging succeeds mutation_partition_v2: Accept arbitrary preemption source in apply_monotonically() mutation_partition_v2: Simplify get_continuity() row_cache: Distinguish dummy insertion site in trace log db: Use mutation_partition_v2 in mvcc range_tombstone_change_merger: Introduce peek() readers: Extract range_tombstone_change_merger mvcc: partition_snapshot_row_cursor: Handle non-evictable snapshots mvcc: partition_snapshot_row_cursor: Support digest calculation mutation_partition_v2: Store range tombstones together with rows db: Introduce mutation_partition_v2 doc: Introduce docs/dev/mvcc.md db: cache_tracker: Introduce insert() variant which positions before existing entry in the LRU db: Print range_tombstone bounds as position_in_partition test: memtable_test: Relax test_segment_migration_during_flush test: cache_flat_mutation_reader: Avoid timestamp clash test: cache_flat_mutation_reader_test: Use monotonic timestamps when inserting rows test: mvcc: Fix sporadic failures due to compact_for_compaction() test: lib: random_mutation_generator: Produce partition tombstone less often test: lib: random_utils: Introduce with_probability() test: lib: Improve error message in has_same_continuity() test: mvcc: mvcc_container: Avoid UB in tracker() getter when there is no tracker test: mvcc: Insert entries in the tracker test: mvcc_test: Do not set dummy::no on non-clustering rows mutation_partition: Print full position in error report in append_clustered_row() db: mutation_cleaner: Extract make_region_space_guard() position_in_partition: Optimize equality check mvcc: Fix version merging state resetting mutation_partition: apply_resume: Mark operator bool() as explicit	2023-02-05 22:33:10 +02:00
Raphael S. Carvalho	3c5afb2d5c	test: Enable Scylla test command line options for boost tests We have enabled the command line options without changing a single line of code, we only had to replace old include with scylla_test_case.hh. Next step is to add x-log-compaction-groups options, which will determine the number of compaction groups to be used by all instantiations of replica::table. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2023-02-01 20:14:51 -03:00
Tomasz Grabiec	c9c476afd7	test: mvcc: Extend some scenarios with exhaustive consistency checks on eviction	2023-01-27 21:56:31 +01:00
Tomasz Grabiec	80de99cb1b	test: mvcc: Extract mvcc_container::allocate_in_region()	2023-01-27 21:56:31 +01:00
Tomasz Grabiec	f2832046e9	test: mvcc: Avoid copies of mutation under failure injection Speeds up the test a bit because we avoid the copy when converting to mutation_partition_v2 in apply().	2023-01-27 21:56:31 +01:00
Tomasz Grabiec	b8980f68f0	test: mvcc: Add missing logalloc::reclaim_lock to test_apply_is_atomic	2023-01-27 21:56:31 +01:00
Tomasz Grabiec	bc35fa7696	Pass is_evictable to apply()	2023-01-27 21:56:31 +01:00
Tomasz Grabiec	919ff433d1	tests: Add tests for the mutation_partition_v2 model	2023-01-27 21:56:31 +01:00
Tomasz Grabiec	026f8cc1e7	db: Use mutation_partition_v2 in mvcc This patch switches memtable and cache to use mutation_partition_v2, and all affected algorithms accordingly. The memtable reader was changed to use the same cursor implementation which cache uses, for improved code reuse and reducing risk of bugs due to discrepancy of algorithms which deal with MVCC. Range tombstone eviction in cache has now fine granularity, like with rows. Fixes #2578 Fixes #3288 Fixes #10587	2023-01-27 21:56:28 +01:00
Tomasz Grabiec	71057412ed	test: mvcc: Fix sporadic failures due to compact_for_compaction() compact_for_compaction() will perform cell expiration based on gc_clock::now(), which introduces sporadic mismatches due to expiry status of a row marker. Drop this, we can rely on compaction done by is_equal_to_compacted()	2023-01-27 19:15:38 +01:00
Tomasz Grabiec	08f68c5f20	test: mvcc: mvcc_container: Avoid UB in tracker() getter when there is no tracker	2023-01-27 19:15:38 +01:00
Tomasz Grabiec	5aa8cb56a8	test: mvcc: Insert entries in the tracker evictable snapshots must have all entries added to the tracker. Partition version merging assumes this. Before this was benign, but will start to trigger asserts in mutation_partition_v2.	2023-01-27 19:15:38 +01:00
Tomasz Grabiec	9d38997971	test: mvcc_test: Do not set dummy::no on non-clustering rows This will trigger an assert in apply_monotonically() later in the series, where this row would be merged with a dummy at the same position. This row must not be marked as non-dummy, there is an assumption that non-clustering positions are all dummies. There can't be two entries with the same position an a different dummy status.	2023-01-27 19:15:38 +01:00
Michał Radwański	dcab289656	boost/mvcc_test: use failure_injecting_allocation_strategy where it is meant to In test_apply_is_atomic, a basic form of exception testing is used. There is failure_injecting_allocation_strategy, which however is not used for any allocation, since for some reason, `with_allocator(r.allocator()` is used instead of `with_allocator(alloc`. Fix that. Closes #12354	2023-01-10 12:01:36 +01:00
Pavel Emelyanov	6075e01312	test/lib: Remove sstable_utils.hh from simple_schema.hh The latter is pretty popular test/lib header that disseminates the former one over whole lot of unit tests. The former, in turn, naturally includes sstables.hh thus making tons of unrelated tests depend on sstables class unused by them. However, simple removal doesn't work, becase of local_shard_only bool class definition in sstable_utils.hh used in simple_schema.hh. This thing, in turn, is used in keys making helpers that don't belong to sstable utils, so these are moved into simple_schema as well. When done, this affects the mutation_source_test.hh, which needs the local_shard_only bool class (and helps spreading the sstables.hh throughout more unrelated tests) and a bunch of .cc test sources that used sstable_utils.hh to indirectly include various headers of their demand. After patching, sstables.hh touches 2x times less tests. As a side effect the sstables_manager.hh also becomes 2x times less dependent on by tests. Continuation of `9bdea110a6` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes #12240	2022-12-08 15:37:33 +02:00
Avi Kivity	444de2831e	dirty_memory_manager: move to replica module It's a replica-side thing, so move it there. The related flush_permit and sstable_write_permit are moved alongside.	2022-12-06 22:24:17 +02:00
Michał Chojnowski	d785364375	cache: make all cache unlinks explicit Our LSA cache is implemented as an auto_unlink Boost intrusive list, meaning that elements of the list unlink themselves from the list automatically on destruction. Some parts of the code rely on that, and don't unlink them manually. However, this precludes accurate bookkeeping about the cache. Elements only have access to themselves and their neighbours, not to any bookkeeping context. Therefore, a destructor cannot update the relevant metadata. In this patch, we fix this by adding explicit unlink calls to places where it would be done by a destructor. In a following patch, we will add an assert to the destructor to check that every element is unlinked before destruction.	2022-10-17 12:07:27 +02:00
Benny Halevy	0627667a06	mutation_partition: compact_for_compaction: get tombstone_gc_state And pass down to `do_compact`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Tomasz Grabiec	05b0a62132	test: mvcc: Fix illegal use of maybe_refresh() maybe_refresh() can only be called if the cursor is pointing at a row. The code was calling it before the cursor was advanced, and was thus relying on implementation detail.	2022-08-09 02:28:56 +02:00
Tomasz Grabiec	02c92d5ea2	test: mutation: Compare against compacted mutations Memtables and cache will compact eagerly, so tests should not expect readers to produce exact mutations written, only those which are equivalant after applying copmaction.	2022-06-15 11:30:01 +02:00
Tomasz Grabiec	a4e96960b8	mvcc: Apply mutations in memtable with preemption enabled Preerequisite for eagerly applying tombstones, which we want to be preemptible. Before the patch, apply path to the memtable was not preemptible. Because merging can now be defered, we need to involve snapshots to kick-off background merging in case of preemption. This requires us to propagate region and cleaner objects, in order to create a snapshot.	2022-06-15 11:29:43 +02:00
Avi Kivity	5129280f45	Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec" This reverts commit `e0670f0bb5`, reversing changes made to `605ee74c39`. It causes failures in debug mode in database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain, though with low probability. Fixes #10780 Reopens #652.	2022-06-14 18:06:22 +03:00
Tomasz Grabiec	374234cf76	test: mutation: Compare against compacted mutations Memtables and cache will compact eagerly, so tests should not expect readers to produce exact mutations written, only those which are equivalant after applying copmaction.	2022-06-06 19:25:40 +02:00
Tomasz Grabiec	0e3c4fc641	mvcc: Apply mutations in memtable with preemption enabled Preerequisite for eagerly applying tombstones, which we want to be preemptible. Before the patch, apply path to the memtable was not preemptible. Because merging can now be defered, we need to involve snapshots to kick-off background merging in case of preemption. This requires us to propagate region and cleaner objects, in order to create a snapshot.	2022-06-06 19:23:37 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Tomasz Grabiec	d0c367f44f	mvcc: partition_snapshot: Support slicing range tombstones in reverse	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	757fc1275f	partition_snapshot_row_cursor: Support reverse iteration	2021-12-19 22:41:35 +01:00
Tomasz Grabiec	0d7b3f9463	tests: mvcc: Relax monotonicity check Consecutive range tombstones can have the same position. They will, in one of the test cases, after the range tombstone merger in partition_snapshot_flat_reader no longer uses range_tombstone_list to merge data form multiple versions, which deoverlaps, but rather merges the streams corresponding to each version, which interleaves range tombstones from different versions.	2021-07-26 17:27:03 +02:00
Botond Dénes	2d2b9e7b36	test/boost: migrate off the global test reader semaphore	2021-07-08 16:53:38 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Pavel Emelyanov	4558eb3afc	partition_snapshot_row_cursor: Move cells hash creation to reader Right now call to .row() method may create hash on row's cells. It's counterintuitive to see a const method that transparently changes something it points to. Since the only caller of a row() who knows whether the hash creation is required is the cache reader, it's better to move the call to prepare_hash() into it. Other than making the .row() less surprising this also helps to get rid of the whole method by the next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:18:29 +03:00
Pavel Emelyanov	00caf5f219	partition_snapshot_row_cursor: Move read_partition into test The method in question is test-only helper, there's no need in keeping it as a part of the API. Another reason to move is that the method is O(number of rows) and doesn't preempt while looping, but cursor code users try hard not to stall the reactor. So even though this method has a meaningful semantics within the class, it will better be reinvented if needed in core code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 12:16:13 +03:00
Calle Wilund	4b65d67a1a	partition_version: Change range_tombstones() to return chunked_vector Refs #7364 The number of tombstones can be large. As a stopgap measure to just returning a source range (with keepalive), we can at least alleviate the problem by using a chunked vector. Closes #7433	2020-10-26 11:54:42 +02:00
Botond Dénes	6ca0464af5	mutation_fragment: add schema and permit We want to start tracking the memory consumption of mutation fragments. For this we need schema and permit during construction, and on each modification, so the memory consumption can be recalculated and pass to the permit. In this patch we just add the new parameters and go through the insane churn of updating all call sites. They will be used in the next patch.	2020-09-28 11:27:23 +03:00
Konstantin Osipov	ff3f9cb7cf	test: stop using BOOST_TEST_MESSAGE() for logging We use boost test logging primarily to generate nice XML xunit files used in Jenkins. These XML files can be bloated with messages from BOOST_TEST_MESSAGE(), hundreds of megabytes of build archives, on every build. Let's use seastar logger for test logging instead, reserving the use of boost log facilities for boost test markup information.	2020-03-05 11:38:11 +03:00
Avi Kivity	6728b96df7	clustering_interval_set: split to own header file clustering_interval_set is a rarely used class, but one that requires boost/icl, which is quite heavyweight. To speed up compilation, move it to its own header and sprinkle #includes where needed. Tests: unit (dev) Message-Id: <20200214190507.1137532-1-avi@scylladb.com>	2020-02-16 17:40:47 +02:00
Konstantin Osipov	1c8736f998	tests: move all test source files to their new locations 1. Move tests to test (using singular seems to be a convention in the rest of the code base) 2. Move boost tests to test/boost, other (non-boost) unit tests to test/unit, tests which are expected to be run manually to test/manual. Update configure.py and test.py with new paths to tests.	2019-12-16 17:47:42 +03:00

50 Commits