scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-07 23:43:31 +00:00

Author	SHA1	Message	Date
Raphael S. Carvalho	5d654a6b9a	compaction: don't copy owned ranges in cleanup ctor Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20220119142322.39791-1-raphaelsc@scylladb.com>	2022-01-20 14:05:58 +02:00
Avi Kivity	fcb8d040e8	treewide: use Software Package Data Exchange (SPDX) license identifiers Instead of lengthy blurbs, switch to single-line, machine-readable standardized (https://spdx.dev) license identifiers. The Linux kernel switched long ago, so there is strong precedent. Three cases are handled: AGPL-only, Apache-only, and dual licensed. For the latter case, I chose (AGPL-3.0-or-later and Apache-2.0), reasoning that our changes are extensive enough to apply our license. The changes we applied mechanically with a script, except to licenses/README.md. Closes #9937	2022-01-18 12:15:18 +01:00
Botond Dénes	a7f4ab6b14	compaction/compaction: remove v1 version of validate and scrub reader factory methods	2022-01-14 10:19:56 +02:00
Botond Dénes	d57634ad46	compaction: use v2 version of mutation_writer::segregate_by_partition()	2022-01-14 08:54:26 +02:00
Botond Dénes	b315d17c2a	compaction: migrate scrub and validate to v2 We add v2 version of external API but leave the old v1 in place to help incremental migration. The implementation is migrated to v2.	2022-01-14 08:54:26 +02:00
Botond Dénes	15d8ea983e	compaction: upgrade compaction::make_interposer_consumer() to v2 Almost all (except the scrub one) actual interposer consumers are v2.	2022-01-07 13:52:14 +02:00
Botond Dénes	aa3c943f4c	mutation_reader: remove unecessary stable_flattened_mutations_consumer Said wrapper was conceived to make unmovable `compact_mutation` because readers wanted movable consumers. But `compact_mutation` is movable for years now, as all its unmovable bits were moved into an `lw_shared_ptr<>` member. So drop this unnecessary wrapper and its unnecessary usages.	2022-01-07 13:52:07 +02:00
Botond Dénes	1ba19c2aa4	compaction/compaction_strategy: convert make_interposer_consumer() to v2 The underlying timestamp-based splitter is v2 already.	2022-01-07 13:51:59 +02:00
Botond Dénes	0601a465a2	mutation_writer: migrate shard_based_splitting_writer to v2	2022-01-07 13:48:53 +02:00
Asias He	a8ad385ecd	repair: Get rid of the gc_grace_seconds The gc_grace_seconds is a very fragile and broken design inherited from Cassandra. Deleted data can be resurrected if cluster wide repair is not performed within gc_grace_seconds. This design pushes the job of making the database consistency to the user. In practice, it is very hard to guarantee repair is performed within gc_grace_seconds all the time. For example, repair workload has the lowest priority in the system which can be slowed down by the higher priority workload, so that there is no guarantee when a repair can finish. A gc_grace_seconds value that is used to work might not work after data volume grows in a cluster. Users might want to avoid running repair during a specific period where latency is the top priority for their business. To solve this problem, an automatic mechanism to protect data resurrection is proposed and implemented. The main idea is to remove the tombstone only after the range that covers the tombstone is repaired. In this patch, a new table option tombstone_gc is added. The option is used to configure tombstone gc mode. For example: 1) GC a tombstone after gc_grace_seconds cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'timeout'} ; This is the default mode. If no tombstone_gc option is specified by the user. The old gc_grace_seconds based gc will be used. 2) Never GC a tombstone cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'disabled'}; 3) GC a tombstone immediately cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'immediate'}; 4) GC a tombstone after repair cqlsh> ALTER TABLE ks.cf WITH tombstone_gc = {'mode':'repair'}; In addition to the 'mode' option, another option 'propagation_delay_in_seconds' is added. It defines the max time a write could possibly delay before it eventually arrives at a node. A new gossip feature TOMBSTONE_GC_OPTIONS is added. The new tombstone_gc option can only be used after the whole cluster supports the new feature. A mixed cluster works with no problem. Tests: compaction_test.py, ninja test Fixes #3560 [avi: resolve conflicts vs data_dictionary]	2022-01-04 19:48:14 +02:00
Raphael S. Carvalho	e05859c3f9	compaction: kill unused code for resharding_compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211217162728.114936-2-raphaelsc@scylladb.com>	2021-12-20 18:21:31 +02:00
Raphael S. Carvalho	d1f2fd7f03	compaction: rename compacting_sstable_writer to compacted_fragments_writer the name compacting_sstable_writer is misleading as it doesn't perform any compaction. let's rename it to a name that reflects more what it does. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211217162728.114936-1-raphaelsc@scylladb.com>	2021-12-20 18:21:31 +02:00
Benny Halevy	c89876c975	compaction: scrub_validate_mode_validate_reader: throw compaction_stopped_exception if stop is requested Currently when scrub/validate is stopped (e.g. via the api), scrub_validate_mode_validate_reader co_return:s without closing the reader passed to it - causing a crash due to internal error check, see #9766. Throwing a compaction_stopped_exception rather than co_return:ing an exception will be handled as any other exeption, including closing the reader. Fixes #9766 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20211213125528.2422745-1-bhalevy@scylladb.com>	2021-12-14 11:15:23 +02:00
Raphael S. Carvalho	9b8aa1e9ae	compaction: Move mutation compaction into producer for TWCS If interposer is enabled, like the timestamp-based one for TWCS, data from different buckets (e.g. windows) cannot be compacted together because mutation compaction happens inside each consumer, where each consumer will be belong to a different bucket. To remove this limitation, let's move the mutation compactor from consumer into producer, such that compacted data will be feeded into the interposer, before it segregates data. We're short-circuiting this logic if TWCS isn't in use as compacting reader adds overhead to compaction, given that this reader will pop fragments from combined sstable reader, compact them using mutation_compactor and finally push them out to the underlying reader. without compacting reader (e.g. STCS + no interposer): 228255.92 +- 1519.53 partitions / sec (50 runs, 1 concurrent ops) 224636.13 +- 1165.05 partitions / sec (100 runs, 1 concurrent ops) 224582.38 +- 1050.71 partitions / sec (100 runs, 1 concurrent ops) with compacting reader (e.g. TWCS + interposer): 221376.19 +- 1282.11 partitions / sec (50 runs, 1 concurrent ops) 216611.65 +- 985.44 partitions / sec (100 runs, 1 concurrent ops) 215975.51 +- 930.79 partitions / sec (100 runs, 1 concurrent ops) So the cost of compacting data across buckets is ~3.5%, which happens only with interposer enabled and GC writer disabled. Fixes #9662. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-10 17:14:44 -03:00
Raphael S. Carvalho	484269cd8f	compaction: make enable_garbage_collected_sstable_writer() more precise we only want to enable GC writer if incremental compaction is required. let's make it more precise by checking that size limit for sstable isn't disabled, so GC writer will only be enabled for compaction strategies that really need it. So strategies that don't need it won't pay the penalty. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-10 15:22:08 -03:00
Mikołaj Sielużycki	504efe0607	table: Prevent resurrecting data from memtable on compaction Mutations are not guaranteed to come in the order of their timestamps. If there is an expired tombstone in the sstable and a repair inserts old data into memtable, the compaction would not consider memtable data and purge the tombstone leading to data resurrection. The solution is to disallow purging tombstones newer than min memtable timestamp.	2021-12-09 13:22:14 +01:00
Botond Dénes	2e5440bdf2	Merge 'Convert compaction to flat_mutation_reader_v2' from Raphael Carvalho Since sstable reader was already converted to flat_mutation_reader_v2, compaction layer can naturally be converted too. There are many dependencies that use v1. Those strictly needed like readers in sstable set, which links compaction to sstable reader, were converted to v2 in this series. For those that aren't essential we're relying on V1<-->V2 adaptors, and conversion work on them will be postponed. Those being postponed are: scrub specialized reader (needs a validator for mutation_fragment_v2), interposer consumer, combined reader which is used by incremental selector. incremental selector itself was converted to v2. tests: unit(debug). Closes #9725 * github.com:scylladb/scylla: compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2 sstable_set: update make_crawling_reader() to flat_mutation_reader_v2 sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2 sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2 sstable_set: update incremental_reader_selector to flat_mutation_reader_v2	2021-12-07 15:17:38 +02:00
Raphael S. Carvalho	2435bd14c6	compaction: update compaction::make_sstable_reader() to flat_mutation_reader_v2 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-07 09:37:57 -03:00
Raphael S. Carvalho	c6399005a3	sstable_set: update make_crawling_reader() to flat_mutation_reader_v2 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-07 09:37:55 -03:00
Raphael S. Carvalho	aebbe68239	sstable_set: update make_range_sstable_reader() to flat_mutation_reader_v2 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-07 09:37:53 -03:00
Raphael S. Carvalho	c3c070a5ca	sstable_set: update make_local_shard_sstable_reader() to flat_mutation_reader_v2 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-07 09:37:51 -03:00
Avi Kivity	395b30bca8	mutation_reader: update make_filtering_reader() to flat_mutation_reader_v2 As part of the drive to move over to flat_mutation_reader_v2, update make_filtering_reader(). Since it doesn't examine range tombstones (only the partition_start, to filter the key) the entire patch is just glue code upgrading and downgrading users in the pipeline (or removing a conversion, in one case). Test: unit (dev) Closes #9723	2021-12-07 12:18:07 +02:00
Benny Halevy	cc122984d6	compaction: scrub: add quarantine_mode option Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:29:04 +02:00
Benny Halevy	bbe275f37d	compaction: scrub_sstables_validate_mode: quarantine invalid sstables When invalid sstables are detected, move them to the quarantine subdirectory so they won't be selected for regular compaction. Refs #7658 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:14:16 +02:00
Benny Halevy	13e7b00f2e	sstables: add is_quarantined Quarantined sstables will reside in a "quarantine" subdirectory and are also not eligible for regular compaction. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:00:44 +02:00
Benny Halevy	07c5ddf182	sstables: add is_eligible_for_compaction Currently compaction_manager tracks sstables based on !requires_view_building() and similarly, table::in_strategy_sstables picks up only sstables that are not in staging. is_eligible_for_compaction() generalizes this condition in preparation for adding a quarantine subdirectory for invalid sstables that should not be compacted as well. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-12-05 18:00:44 +02:00
Raphael S. Carvalho	0e3d388ebb	compaction: Log skip of fully expired sstables Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-03 12:25:48 -03:00
Raphael S. Carvalho	7a7a2467fa	compaction: kill useless on_skipped_expired_sstable() It was introduced by commit `5206a97915` because fully expired sstable wouldn't be registed and therefore could be never removed from backlog tracker. This is no longer possible as table is now responsible for removing all input sstables. So let's kill on_skipped_expired_sstable() as it's now only boilerplate we don't need. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-03 12:19:29 -03:00
Raphael S. Carvalho	32c2534e91	compaction: merge _total_input_sstables and _ancestors Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-12-03 12:19:23 -03:00
Raphael S. Carvalho	3006394312	compaction: Allow incremental compaction with interposer consumer Until commit `c94e6f8567`, interposer consumer wouldn't work with our GC writer, needed for incremental compaction correctness. Now that the technical debt is gone, let's allow incremental compaction with interposer consumer. The only change needed is serialization of replacer as two consumers cannot step on each toe, like when we have concurrent bucket writers with TWCS. sstable_compaction_test.test_bug_6472 passes with this change, which was added when #6472 was fixed by not allowing incremental compaction with interposer consumer. Refs #6472. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211126191000.43292-1-raphaelsc@scylladb.com>	2021-11-30 15:24:17 +02:00
Raphael S. Carvalho	06405729ce	compaction: stop including database.hh after switching to table_state, compaction code can finally stop including database.hh Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-19 22:06:03 -03:00
Raphael S. Carvalho	69ab5c9dff	compaction: switch to table_state in get_fully_expired_sstables() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-19 22:06:02 -03:00
Raphael S. Carvalho	d89edad9fb	compaction: switch to table_state Make compaction procedure switch to table_state. Only function in compaction.cc still directly using table is get_fully_expired_sstables(T,...), but subsequently we'll make it switch to table_state and then we can finally stop including database.hh in the compaction code. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-19 22:06:01 -03:00
Raphael S. Carvalho	c94e6f8567	compaction: Merge GC writer into regular compaction writer Turns out most of regular writer can be reused by GC writer, so let's merge the latter into the former. We gain a lot of simplification, lots of duplication is removed, and additionally, GC writer can now be enabled with interposer as it can be created on demand by each interposer consumer (will be done in a later patch). Refs #6472. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211119120841.164317-1-raphaelsc@scylladb.com>	2021-11-19 14:19:50 +02:00
Raphael S. Carvalho	4b1bb26d5a	compaction: Make maybe_replace_exhausted_sstables_by_sst() more robust Make it more robust by tracking both partial and sealed sstables. This way, maybe_r__e__s__by_sst() won't pick partial sstables as part of incremental compaction. It works today because interposer consumer isn't enabled with incremental compaction, so there's a single consumer which will have sealed the sstable before the function for early replacement is called, but the story is different if both is enabled. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211117135817.16274-1-raphaelsc@scylladb.com>	2021-11-17 17:21:53 +02:00
Avi Kivity	bc75e2c1d1	treewide: wrap runtime formats with fmt::runtime for fmt 8 fmt 8 checks format strings at compile time, and requires that non-compile-time format strings be wrapped with fmt::runtime(). Do that, and to allow coexistence with fmt 7, supply our own do-nothing version of fmt::runtime() if needed. Strictly speaking we shouldn't be introducing names into the fmt namespace, but this is transitional only. Closes #9640	2021-11-17 15:21:36 +02:00
Raphael S. Carvalho	29df862f57	compaction: make table param of get_fully_expired_sstables() const Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-09 10:41:54 -03:00
Raphael S. Carvalho	5af9a690c1	compaction: move incremental_owned_ranges_checker into cleanup_compaction let's move checker into cleanup as it's not needed elsewhere. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 09:49:44 -03:00
Raphael S. Carvalho	04ef2124c6	compaction: make owned ranges const in cleanup_compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 09:47:12 -03:00
Raphael S. Carvalho	d86c2491d4	compaction: replace outdated comment in regular_compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 09:45:34 -03:00
Raphael S. Carvalho	b344db1696	compaction: give a more descriptive name to compaction_data info is no longer descriptive, as compaction now works with compaction_data instead of compaction_info. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-11-04 09:43:08 -03:00
Raphael S. Carvalho	ab0217e30e	compaction: Improve overall efficiency by not diluting it with relatively inefficient jobs Compaction efficiency can be defined as how much backlog is reduced per byte read or written. We know a few facts about efficiency: 1) the more files are compacted together (the fan-in) the higher the efficiency will be, however... 2) the bigger the size difference of input files the worse the efficiency, i.e. higher write amplification. so compactions with similar-sized files are the most efficient ones, and its efficiency increases with a higher number of files. However, in order to not have bad read amplification, number of files cannot grow out of bounds. So we have to allow parallel compaction on different tiers, but to avoid "dilution" of overall efficiency, we will only allow a compaction to proceed if its efficiency is greater than or equal to the efficiency of ongoing compactions. By the time being, we'll assume that strategies don't pick candidates with wildly different sizes, so efficiency is only calculated as a function of compaction fan-in. Now when system is under heavy load, then fan-in threshold will automatically grow to guarantee that overall efficiency remains stable. Please note that fan-in is defined in number of runs. LCS compaction on higher levels will have a fan-in of 2. Under heavy load, it may happen that LCS will temporarily switch to size-tiered mode for compaction to keep up with amount of data being produced. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Message-Id: <20211103215110.135633-2-raphaelsc@scylladb.com>	2021-11-03 20:03:23 +02:00
Botond Dénes	6ad0a2989c	compaction/scrub: segregate input only in segregate mode scrub_compaction assumes that `make_interposer_consumer()` is called only when `use_interposer_consumer()` returns true. This is false, so in effect scrub always ends up using the segregating interposer. Fix this by short-circuiting the former method when the latter returns true, returning the passed-in consumer unchanged. Tests: unit(dev) Fixes #9541 Closes #9564	2021-11-02 15:25:22 +02:00
Botond Dénes	eaf4454ac8	compaction: scrub_compaction: add bucket count to finish message It is useful to know how many buckets (output sstables) scrub produced in total. The end compaction message will only report those still open when the scrub finished, but will omit those that were closed in the middle.	2021-11-02 12:24:37 +02:00
Botond Dénes	f2f529855d	compaction,test: use the new in-memory segregator for scrub	2021-11-02 09:00:44 +02:00
Benny Halevy	5483269dfb	compaction_manager: pass owned_ranges via cleanup/upgrade options So they can be easily computed using an async task before constructing the compaction object in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2021-10-13 14:17:46 +03:00
Botond Dénes	cc65c9d0da	compaction: scrub/segregate: adjust partition-estimate as buckets accumulate Scrub compaction in segregate mode can split the input sstable into as many as hundreds or even thousands of output sstables in the extreme case. But even at a few dozen output sstables, most of these will only have a few partitions with a few rows. These sstables however will still have their bloom filter allocated according to the original partition-count estimate, causing memory bloat or even OOM in the extreme case. This patch solves this by aggressively adjusting the partition count downwards after the second bucket has been created. Each subsequent bucket will halve the partition estimate, which will quickly reach 1. Fixes: #9463 Closes #9464	2021-10-12 12:44:42 +03:00
Avi Kivity	1bac93e075	Merge "simplifications and layer violation fix for compaction manager" from Raphael "This series removes layer violation in compaction, and also simplifies compaction manager and how it interacts with compaction procedure." * 'compaction_manager_layer_violation_fix/v4' of github.com:raphaelsc/scylla: compaction: split compaction info and data for control compaction_manager: use task when stopping a given compaction type compaction: remove start_size and end_size from compaction_info compaction_manager: introduce helpers for task compaction_manager: introduce explicit ctor for task compaction: kill sstables field in compaction_info compaction: kill table pointer in compaction_info compaction: simplify procedure to stop ongoing compactions compaction: move management of compaction_info to compaction_manager compaction: move output run id from compaction_info into task	2021-10-04 13:09:31 +03:00
Botond Dénes	61e7d3de90	Merge 'Cleanup compaction_stop_exception' from Benny Halevy The gist of this series is splitting `compaction_abort_exception` from `compaction_stop_exception` and their respective error messages to differentiate between compaction being stopped due to e.g. shutdown or api event vs. compaction aborting due to scrub validation error. While at it, cleanup the existing retry logic related to `compaction_stop_exception`. Test: unit(dev) Dtest: nodetool_additional_test.py:TestNodetool.{{scrub,validate}_sstable_with_invalid_fragment_test,{scrub,validate}_ks_sstable_with_invalid_fragment_test,{scrub,validate}_with_one_node_expect_data_loss_test} (dev, w/ https://github.com/scylladb/scylla-dtest/pull/2267) Closes #9321 * github.com:scylladb/scylla: compaction: split compaction_aborted_exception from compaction_stopped_exception compaction_manager: maybe_stop_on_error: rely on retry=false default compaction_manager: maybe_stop_on_error: sync return value with error message. compaction: drop retry parameter from compaction_stop_exception compaction_manager: move errors stats accounting to maybe_stop_on_error	2021-10-04 07:27:11 +03:00
Raphael S. Carvalho	9067a13eac	compaction: split compaction info and data for control compaction_info must only contain info data to be exported to the outside world, whereas compaction_data will contain data for controlling compaction behavior and stats which change as compaction progresses. This separation makes the interface clearer, also allowing for future improvements like removing direct references to table in compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-30 13:16:57 -03:00

1 2

96 Commits