scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 04:37:00 +00:00

Author	SHA1	Message	Date
Benny Halevy	0627667a06	mutation_partition: compact_for_compaction: get tombstone_gc_state And pass down to `do_compact`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Benny Halevy	5dd15aa3c8	tombstone_gc: introduce tombstone_gc_state and use it to access the repair history maps. At this introductory patch, we use default-constructed tombstone_gc_state to access the thread-local maps temporarily and those use sites will be replaced in following patches that will gradually pass the tombstone_gc_state down from the compaction_manager to where it's used. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:02:54 +03:00
Tomasz Grabiec	56e5b6f095	db: mutation_partition: Drop unnecessary maybe_shadow() It is performed inside row_tombstone::apply() invoked in the preceding line.	2022-08-17 17:39:54 +02:00
Tomasz Grabiec	9c66c9b3f0	db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone When the row has a live row marker which shadows the shadowable tombstone, the shadowable tombstone should not be effective. The code assumes that _shadowable always reflects the current tombstone, so maybe_shadow() needs to be called whenever marker or regular tombstone changes. This was not ensured by row::apply(tombstone). This causes problems in tests which use random_mutation_generator, which generates mutations which would violate this invariant, and as a result, mutation commutativity would be violated. I am not aware of problems in production code.	2022-08-17 17:34:13 +02:00
Botond Dénes	778f5adde7	mutation_partition: row: make row marker shadowing symmetric Currently row marker shadowing the shadowable tombstone is only checked in `apply(row_marker)`. This means that shadowing will only be checked if the shadowable tombstone and row marker are set in the correct order. This at the very least can cause flakyness in tests when a mutation produced just the right way has a shadowable tombstone that can be eliminated when the mutation is reconstructed in a different way, leading to artificial differences when comparing those mutations. This patch fixes this by checking shadowing in `apply(shadowable_tombstone)` too, making the shadowing check symmetric. There is still one vulnerability left: `row_marker& row_marker()`, which allow overwriting the marker without triggering the corresponding checks. We cannot remove this overload as it is used by compaction so we just add a comment to it warning that `maybe_shadow()` has to be manually invoked if it is used to mutate the marker (compaction takes care of that). A caller which didn't do the manual check is mutation_source_test: this patch updates it to use `apply(row_marker)` instead. Fixes: #9483 Tests: unit(dev) Closes #9519	2022-08-17 17:22:13 +02:00
Avi Kivity	8d37370a71	Revert "Merge "memtable-sstable: Add compacting reader when flushing memtable." from Mikołaj" This reverts commit `bcadd8229b`, reversing changes made to `cf528d7df9`. Since `4bd4aa2e88` ("Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec"), memtable is self-compacting and the extra compaction step only reduces throughput. The unit test in memtable_test.cc is not reverted as proof that the revert does not cause a regression. Closes #11243	2022-08-09 11:23:29 +03:00
Michał Chojnowski	a061eb9e76	mutation_fragment: pass the applied row by reference in clustering_row::apply() Currently, clustering_row::apply() takes deletable_row by reference, but copies it before passing it to deletable_row::apply(). This is more expensive than passing the reference down (by about 1800 instructions for perf_simple_query rows).	2022-06-20 15:22:17 +02:00
Tomasz Grabiec	169025d9b4	memtable: Add counters for tombstone compaction	2022-06-15 11:30:25 +02:00
Tomasz Grabiec	94f9109bea	memtable, cache: Eagerly compact data with tombstones When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drpo tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652.	2022-06-15 11:30:25 +02:00
Tomasz Grabiec	cd523214a2	mvcc: Introduce apply_resume to hold state for partition version merging Partition version merging is preemptable. It may stop in the middle and be resumed later. Currently, all state is kept inside the versions themselves, in the form of elements in the source version which are yet to be moved. This will change once we add compaction (tombstones with rows) into the merging algorithm. There, state cannot be encoded purley within versions. Consider applying a partition tombstone over large number of rows. This patch introduces apply_rows object to hold the necessary state to make sure forward progress in case of preemption. No change in behavior yet.	2022-06-15 11:30:01 +02:00
Tomasz Grabiec	44bb9d495b	mutation_partition: Extract deletable_row::compact_and_expire()	2022-06-15 11:30:01 +02:00
Avi Kivity	5129280f45	Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec" This reverts commit `e0670f0bb5`, reversing changes made to `605ee74c39`. It causes failures in debug mode in database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain, though with low probability. Fixes #10780 Reopens #652.	2022-06-14 18:06:22 +03:00
Tomasz Grabiec	0bc45f9666	memtable: Add counters for tombstone compaction	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	beadd248e3	memtable, cache: Eagerly compact data with tombstones When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drpo tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652.	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	989ef88e26	mvcc: Introduce apply_resume to hold state for partition version merging Partition version merging is preemptable. It may stop in the middle and be resumed later. Currently, all state is kept inside the versions themselves, in the form of elements in the source version which are yet to be moved. This will change once we add compaction (tombstones with rows) into the merging algorithm. There, state cannot be encoded purley within versions. Consider applying a partition tombstone over large number of rows. This patch introduces apply_rows object to hold the necessary state to make sure forward progress in case of preemption. No change in behavior yet.	2022-06-06 19:25:41 +02:00
Tomasz Grabiec	080c403d0b	mutation_partition: Extract deletable_row::compact_and_expire()	2022-06-06 19:23:37 +02:00
Pavel Emelyanov	645896335d	code: Convert is_same+result_of assertions into invocable concepts Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-02-24 19:46:10 +03:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Mikołaj Sielużycki	6dd9f63f3b	memtable-sstable: Track existence of tombstones in memtable. Add flags if memtable contains tombstones. They can be used as a heuristic to determine if a memtable should be compacted on flush. It's an intermediate step until we can compact during applying mutations on a memtable.	2021-11-29 13:06:12 +01:00
Botond Dénes	b136746040	mutation_partition: deletable_row::apply(shadowable_tombstone): remove redundant maybe_shadow() Shadowing is already checked by the underlying row_tombstone::apply(). This redundant check was introduced by a previous fix to #9483 (`6a76e12768`). The rest of that patch is good. Refs: #9483 Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20211115091513.181233-1-bdenes@scylladb.com>	2021-11-15 17:50:41 +01:00
Botond Dénes	6a76e12768	mutation_partition: row: make row marker shadowing symmetric Currently row marker shadowing the shadowable tombstone is only checked in `apply(row_marker)`. This means that shadowing will only be checked if the shadowable tombstone and row marker are set in the correct order. This at the very least can cause flakyness in tests when a mutation produced just the right way has a shadowable tombstone that can be eliminated when the mutation is reconstructed in a different way, leading to artificial differences when comparing those mutations. This patch fixes this by checking shadowing in `apply(shadowable_tombstone)` too, making the shadowing check symmetric. There is still one vulnerability left: `row_marker& row_marker()`, which allow overwriting the marker without triggering the corresponding checks. We cannot remove this overload as it is used by compaction so we just add a comment to it warning that `maybe_shadow()` has to be manually invoked if it is used to mutate the marker (compaction takes care of that). A caller which didn't do the manual check is mutation_source_test: this patch updates it to use `apply(row_marker)` instead. Fixes: #9483 Tests: unit(dev) Closes #9519	2021-10-26 20:40:31 +02:00
Pavel Emelyanov	b3c89787be	mutation_partition: Return immutable collection for range tombstones Patch the .row_tombstones() to return the range_tombstone_list wrapped into the immutable_collection<> so that callers are guaranteed not to touch the collection itself, but still can modify the tombstones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	1bf643d4fd	mutation_partition: Pin mutable access to range tombstones Some callers of mutation_partition::row_tomstones() don't want (and shouldn't) modify the list itself, while they may want to modify the tombstones. This patch explicitly locates those that need to modify the collection, because the next patch will return immutable collection for the others. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	05b8cdfd24	mutation_partition: Return immutable collection for rows Patch the .clustered_rows() method to return the btree of rows wrapped into the immutable_collection<> so that callers are guaranteed not to touch the collection itself, but still can modify the elements in it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	ad27bf40e6	mutation_partition: Pin mutable access to rows Some callers of mutation_partition::clustered_rows() don't want (and shouldn't) modify the tree of rows, while they may want to modify the rows themselves. This patch explicitly locates those that need to modify the collection, because the next patch will return immutable collection for the others. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Pavel Emelyanov	a9b4fa9db3	mutation_partition: Shuffle declarations Its methods that provide access to enclosed collections of rows and range tombstones are intermixed, so group them for smoother next patching and mark noexcept while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-07-27 20:06:53 +03:00
Tomasz Grabiec	7fa4e10aa0	row_cache: Use generic LRU for eviction In preparation for tracking different kinds of objects, not just rows_entry, in the LRU, switch to the LRU implementation form utils/lru.hh which can hold arbitrary element type.	2021-07-02 10:25:58 +02:00
Pavel Solodovnikov	76bea23174	treewide: reduce header interdependencies Use forward declarations wherever possible. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com> Closes #8813	2021-06-07 15:58:35 +03:00
Avi Kivity	a55b434a2b	treewide: extent copyright statements to present day	2021-06-06 19:18:49 +03:00
Pavel Solodovnikov	fff7ef1fc2	treewide: reduce boost headers usage in scylla header files `dev-headers` target is also ensured to build successfully. Signed-off-by: Pavel Solodovnikov <pa.solodovnikov@scylladb.com>	2021-05-20 01:33:18 +03:00
Pavel Emelyanov	64074f45ce	code: Relax position_in_partition::tri_compare users There are some pieces left doing res <=> 0 with the res now being a strong_ordering itself. All these can be just dropped. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-04-09 18:20:39 +03:00
Pavel Emelyanov	8bbe2eae5e	btree: Convert comparator to <=> It turned out that all the users of btree can already be converted to use safer std::strong_ordering. The only meaningful change here is the btree code itself -- no more ints there. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210330153648.27049-1-xemul@scylladb.com>	2021-04-01 12:56:08 +03:00
Pavel Emelyanov	9baf1226dc	test/memory_footpring: Print radix tree node sizes After switching cells storage onto compact radix tree it becomes useful to know the tree nodes' sizes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:41:09 +03:00
Pavel Emelyanov	1bdfa355ea	row: Remove old storages Now when the 3rd storage type (radix tree) is all in, old storage can be safely removed. The result is: 1. memory footprint sizeof(class row): 112 => 16 bytes sizeof(rows_entry): 126 => 120 bytes the "in cache" value depends on the number of cells: num of cells master patch 1 752 656 2 808 712 3 864 768 4 920 824 5 968 936 6 1136 992 ... 16 1840 1672 17 1904 1992 (+88) 18 1976 2048 (+72) 19 2048 2104 (+56) 20 2120 2160 (+40) 21 2184 2208 (+24) 22 2256 2264 ( +8) 23 2328 2320 ... 32 2960 2808 After 32 cells the storage switches into rbtree with 24-bytes per-cell overhead and the radix tree improvement rocketlaunches 64 7872 6056 128 15040 9512 256 29376 18568 2. perf_mutation test is enhanced by this series and the results differ depending on the number of columns used tps value --column-count master patch 1 59.9k 57.6k (-3.8%) 2 59.9k 57.5k 4 59.8k 57.6k 8 57.6k 57.7k <- eq 16 56.3k 57.6k 32 53.2k 57.4k (+7.9%) A note on this. Last time 1-column test was ~5% worse which was explained by inline storage of 5 cells that's present on current implementation and was absent in radix tree. An attempt to make inline storage for small radix trees resulted in complete loss of memory footprint gain, but gave fraction of percent to perf_mutation performance. So this version doesn't have inline nodes. The 1.2% improvement from v2 surprisingly came from the tree::clone_from() which in v2 was work-around-ed by slow walk+emplace sequence while this version has the optimized API call for cloning. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:35:06 +03:00
Pavel Emelyanov	f006acc853	row: Introduce radix tree storage type Currently class row uses a union of a vector and a set to keep the cells and switches between them. Add the 3rd type with the radix tree, but never switch to it, just to show how the operations would look like. Later on vector and set will be removed and the whole row will be immediately switched to the radix tree storage. NB: All the added places have indentation deliberately broken, so that next patch will just remove the surrounding (old) code away and (most of) the new one will happen in its place instantly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-15 20:27:00 +03:00
Pavel Emelyanov	5c0f9a8180	mutation_partition: Switch cache of rows onto B-tree The switch is pretty straightforward, and consists of - change less-compare into tri-compare - rename insert/insert_check into insert_before_hint - use tree::key_grabber in mutation_partition::apply_monotonically to exception-safely transfer a row from one tree to another - explicitly erase the row from tree in rows_entry::on_evicted, there's a O(1) tree::iterator method for this - rewrite rows_entry -> cache_entry transofrmation in the on_evicted to fit the B-tree API - include the B-tree's external memory usage into stats That's it. The number of keys per node was is set to 12 with linear search and linear extention of 20 because - experimenting with tree shows that numbers 8 through 10 keys with linear search show the best performance on stress tests for insert/find-s of keys that are memcmp-able arrays of bytes (which is an approximation of current clustring key compare). More keys work slower, but still better than any bigger value with any type of search up to 64 keys per node - having 12 keys per nodes is the threshold at which the memory footprint for B-tree becomes smaller than for boost::intrusive::set for partitions with 32+ keys - 20 keys for linear root eats the first-split peak and still performs well in linear search As a result the footpring for B tree is bigger than the one for BST only for trees filled with 21...32 keys by 0.1...0.7 bytes per key. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Pavel Emelyanov	306c40939b	rows_entry: Generalize compare Turn the rows_entry less-comparator's calls into a template as they are nothing but wrappers on top of rows_entyry tri-comparator. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-02-02 09:30:30 +03:00
Botond Dénes	9c96d74b72	mutation: remove now unused query() and query_compacted()	2021-01-22 15:36:37 +02:00
Tomasz Grabiec	15b5b286d9	Merge "frozen_mutation: better diagnostics for out-of-order and duplicate rows" from Botond Currently, frozen mutations, that contain partitions with out-of-order or duplicate rows will trigger (if they even do) an assert in `row::append_cell()`. However, this results in poor diagnostics (if at all) as the context doesn't contain enough information on what exactly went wrong. This results in a cryptic error message and an investigation that can only start after looking at a coredump. This series remedies this problem by explicitly checking for out-of-order and duplicate rows, as early as possible, when the supposedly empty row is created. If the row already existed (is a duplicate) or it is not the last row in the partition (out-of-order row) an exception is thrown and the deserialization is aborted. To further improve diagnostics, the partition context is also added to the exception. Tests: unit(release) * botond/frozen-mutation-bad-row-diagnostics/v3: frozen_mutation: add partition context to errors coming from deserializing partition_builder: accept_row(): use append_clustering_row() mutation_partition: add append_clustered_row()	2021-01-10 19:30:12 +02:00
Pavel Emelyanov	72c2482f73	mutation-partition: Construct rows_entry directly from clustering_row When a rows_entry is added to row_cache it's constructed from clustering_row by unpacking all its internals and putting them into the rows_entry's deletable_row. There's a shorter way -- the clustering_row already has the deletale_row onboard from which rows_entry can copy-construct its. This lets keeping the rows_entry and deletable_row set of constructors a bit shorter. tests: unit(dev) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20201224161112.20394-1-xemul@scylladb.com>	2020-12-24 18:13:44 +02:00
Botond Dénes	63ea36e277	mutation_partition: add append_clustered_row() A variant of `clutered_row()` which throws if the row already exists, or if any greater row already exists.	2020-12-02 15:08:32 +02:00
Tomasz Grabiec	a22645b7dd	Merge "Unfriend rows_entry, cache_tracker and mutation_partition" from Pavel Emelyanov The classes touche private data of each other for no real reason. Putting the interaction behind API makes it easier to track the usage. * xemul/br-unfriends-in-row-cache-2: row cache: Unfriend classes from each other rows_entry: Move container/hooks types declarations rows_entry: Simplify LRU unlink mutation_partition: Define .replace_with method for rows_entry mutation_partition: Use rows_entry::apply_monotonically	2020-09-22 21:18:14 +02:00
Tomasz Grabiec	1f6c4f945e	mutation_partition: Fix typo drien -> driven Message-Id: <1600103287-4948-1-git-send-email-tgrabiec@scylladb.com>	2020-09-15 10:09:15 +02:00
Pavel Emelyanov	bf4063d78e	row cache: Unfriend classes from each other Now cache_tracker, mutation_partition and rows_entry do not need to be friends. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-09-11 16:35:51 +03:00
Pavel Emelyanov	7a1265a338	rows_entry: Move container/hooks types declarations Define container types near the containing elements' hook members, so that they could be private without the need to friend classes with each other. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-09-11 16:35:51 +03:00
Pavel Emelyanov	7ed1e18a13	rows_entry: Simplify LRU unlink The cache_tracker tries to access private member of the rows_entry to unlink it, but the lru_type is auto_unlink and can unlink itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-09-11 16:35:51 +03:00
Pavel Emelyanov	7f2c6aed50	mutation_partition: Define .replace_with method for rows_entry The one is needed to hide the guts of rows_entry from mutation_partition. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2020-09-11 16:35:51 +03:00
Piotr Sarna	7b329f7102	digest: add null values to row digest With the new hashing routine, null values are taken into account when computing row digest. Previous behavior had a regression which stopped computing the hash after the first null value is encountered, but the original behavior was also prone to errors - e.g. row [1, NULL, 2] was not distinguishable from [1, 2, NULL], because their hashes were identical. This hashing is not yet active - it will only be used after the next commit introduces a proper cluster feature for it.	2020-09-10 13:16:44 +02:00
Paweł Dziepak	6f46010235	appending_hash<row>: make publicly visible appending_hash<row> specialisation is declared and defined in a .cc file which means it cannot have a dedicated unit test. This patch moves the declaration to the corresponding .hh file.	2020-09-10 12:20:32 +02:00

1 2 3 4 5

241 Commits