scylladb

Author	SHA1	Message	Date
Raphael S. Carvalho	d61b4f9dfb	compaction_manager: Delete compaction_state's move constructor compaction_state shouldn't be moved once emplaced. moving it could theoretically cause task's gate holder to have a dangling pointer to compaction_state's gate, but turns out gate's move ctor will actually fail under this assertion: assert(!_count && "gate reassigned with outstanding requests"); Cannot happen today, but let's make it more future proof. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #12167	2022-12-02 20:56:57 +03:00
Avi Kivity	f565db75ce	compaction: don't compare signed and unsigned compaction counts gcc warns as this can lead to incorrect results. Cast the threshold to an unsigned type (we know it's positive at this point) to avoid the warning.	2022-11-28 21:41:56 +02:00
Benny Halevy	8b81635d95	compaction: refactor dht::subtract_ranges out of get_ranges_for_invalidation The algorithm is generic and can be used elsewhere. Add a unit test for the function before it gets optimized in the following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-21 15:48:26 +02:00
Benny Halevy	7c6f60ae72	compaction_manager: needs_cleanup: get first/last tokens from sstable decorated keys Currently, the function is inefficient in two ways: 1. unnecessary copy of first/last keys to automatic variables 2. redecorating the partition keys with the schema passed to needs_cleanup. We canjust use the tokens from the sstable first/last decorated keys. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-21 15:44:32 +02:00
Avi Kivity	994603171b	Merge 'Add validator to the mutation compactor' from Botond Dénes Fragment reordering and fragment dropping bugs have been plaguing us since forever. To fight them we added a validator to the sstable write path to prevent really messed up sstables from being written. This series adds validation to the mutation compactor. This will cover reads and compaction among others, hopefully ridding us of such bugs on the read path too. This series fixes some benign looking issues found by unit tests after the validator was added -- although how benign a producer emitting two partition-ends depends entirely on how the consumer reacts to it, so no such bug is actually benign. Fixes: https://github.com/scylladb/scylladb/issues/11174 Closes #11532 * github.com:scylladb/scylladb: mutation_compactor: add validator mutation_fragment_stream_validator: add a 'none' validation level test/boost/mutation_query_test: test_partition_limit: sort input data querier: consume_page(): use partition_start as the sentinel value treewide: use ::for_partition_end() instead of ::end_of_partition_tag_t{} treewide: use ::for_partition_start() instead of ::partition_start_tag_t{} position_in_partition: add for_partition_{start,end}()	2022-11-20 20:33:26 +02:00
Aleksandra Martyniuk	7ead1a7857	compaction: request abort only once in compaction_data::stop compaction_manager::task (and thus compaction_data) can be stopped because of many different reasons. Thus, abort can be requested more than once on compaction_data abort source causing a crash. To prevent this before each request_abort() we check whether an abort was requested before. Closes #12004	2022-11-17 12:44:59 +02:00
Raphael S. Carvalho	b88acffd66	replica: Allow one compaction_backlog_tracker for each compaction_group Today, compaction_backlog_tracker is managed in each compaction_strategy implementation. So every compaction strategy is managing its own tracker and providing a reference to it through get_backlog_tracker(). But this prevents each group from having its own tracker, because there's only a single compaction_strategy instance per table. To remove this limitation, compaction_strategy impl will no longer manage trackers but will instead provide an interface for trackers to be created, such that each compaction group will be allowed to have its own tracker, which will be managed by compaction manager. On compaction strategy change, table will update each group with the new tracker, which is created using the previously introduced ompaction_group_sstable_set_updater. Now table's backlog will be the sum of all compaction_group backlogs. The normalization factor is applied on the sum, so we don't have to adjust each individual backlog to any factor. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:22:51 -03:00
Raphael S. Carvalho	d862dd815c	compaction: Make compaction_state available for compaction tasks being stopped compaction_backlog_tracker will be managed by compaction_manager, in the per table state. As compaction tasks can access the tracker throughout its lifetime, remove() can only deregister the state once we're done stopping all tasks which map to that state. remove() extracted the state upfront, then performed the stop, to prevent new tasks from being registered and left behind. But we can avoid the leak of new tasks by only closing the gate, which waits for all tasks (which are stopped a step earlier) and once closed, prevents new tasks from being registered. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:22:51 -03:00
Raphael S. Carvalho	0a152a2670	compaction: Implement move assignment for compaction_backlog_tracker That's needed for std::optional to work on its behalf. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:22:49 -03:00
Raphael S. Carvalho	fe305cefd0	compaction: Fix compaction_backlog_tracker move ctor Luckily it's not used anywhere. Default move ctor was picked but it won't clear _manager of old object, meaning that its destructor will incorrectly deregister the tracker from compaction_backlog_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:37 -03:00
Raphael S. Carvalho	8e1e30842d	compaction: Use table_state's backlog tracker in compaction_read_monitor_generator A step closer towards a separate backlog tracker for each compaction group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:37 -03:00
Raphael S. Carvalho	fedafd76eb	compaction: kill undefined get_unimplemented_backlog_tracker() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:37 -03:00
Raphael S. Carvalho	244efddb22	Fix exception safety when transferring ongoing charges to new backlog tracker When setting a new strategy, the charges of old tracker is transferred to the new one. The problem is that we're not reverting changes if exception is triggered before the new strategy is successfully set. To fix this exception safety issue, let's copy the charges instead of moving them. If exception is triggered, the old tracker is still the one used and remain intact. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:37 -03:00
Raphael S. Carvalho	1ec0ef18a5	compaction/table_state: Introduce get_backlog_tracker() This interface will be helpful for allowing replica::table, unit tests and sstables::compaction to access the compaction group's tracker which will be managed by the compaction manager, once we complete the decoupling work. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-11-11 09:17:36 -03:00
Botond Dénes	f1a039fc2b	treewide: use ::for_partition_start() instead of ::partition_start_tag_t{} We just added a convenience static factory method for partition start, change the present users of the clunky constructor+tag to use it instead.	2022-11-11 09:58:18 +02:00
Botond Dénes	3aff59f189	Merge 'staging sstables: filter tokens for view update generation' from Benny Halevy This mini-series introduces dht::tokens_filter and uses it for consuming staging sstable in the view_update_generator. The tokens_filter uses the token ranges owned by the current node, as retrieved by get_keyspace_local_ranges. Refs #9559 Closes #11932 * github.com:scylladb/scylladb: db: view_update_generator: always clean up staging sstables compaction: extract incremental_owned_ranges_checker out to dht	2022-11-10 07:00:51 +02:00
Benny Halevy	fd3e66b0cc	compaction: extract incremental_owned_ranges_checker out to dht It is currently used by cleanup_compaction partition filter. Factor it out so it can be used to filter staging sstables in the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-11-09 07:32:56 +02:00
Raphael S. Carvalho	a57724e711	Make off-strategy compaction wait for view building completion Prior to off-strategy compaction, streaming / repair would place staging files into main sstable set, and wait for view building completion before they could be selected for regular compaction. The reason for that is that view building relies on table providing a mutation source without data in staging files. Had regular compaction mixed staging data with non-staging one, table would have a hard time providing the required mutation source. After off-strategy compaction, staging files can be compacted in parallel to view building. If off-strategy completes first, it will place the output into the main sstable set. So a parallel view building (on sstables used for off-strategy) may potentially get a mutation source containing staging data from the off-strategy output. That will mislead view builder as it won't be able to detect changes to data in main directory. To fix it, we'll do what we did before. Filter out staging files from compaction, and trigger the operation only after we're done with view building. We're piggybacking on off-strategy timer for still allowing the off-strategy to only run at the end of the node operation, to reduce the amount of compaction rounds on the data introduced by repair / streaming. Fixes #11882. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11919	2022-11-08 08:53:58 +02:00
Pavel Emelyanov	907fd2d355	system_keyspace: De-static compaction history update Compaction manager now has the weak reference on the system keyspace object and can use it to update its stats. It only needs to take care and keep the shared pointer until the respective future resolves. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 16:20:59 +03:00
Pavel Emelyanov	3e0b61d707	compaction_manager: Relax history paths There's a virtual method on table_state to update the entry in system keyspace. It's an overkill to facilitate tests that don't want this. With new system_keyspace weak referencing it can be made simpled by moving the updating call to the compaction_manager itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 16:20:59 +03:00
Pavel Emelyanov	f9b57df471	database: Plug/unplug system_keyspace There's a circular dependency between system_keyspace and database. The former needs the latter because it needs to execula local requests via query_processor. The latter needs the former via compaction manager and large data handler, database depends on both and these too need to insert their entries into system keyspace. To cut this loop the compaction manager and large data handler both get a weak reference on the system keysace. Once system keyspace starts is activcates this reference via the database call. When system keyspace is shutdown-ed on stop, it deactivates the reference. Technically the weak reference is implemented by marking the system_k.s. object as async_sharded_service, and the "reference" in question is the shared_from_this() pointer. When compaction manager or large data handler need to update a system keyspace's table, they both hold an extra reference on the system keyspace until the entry is committed, thus making sure that sys._k.s. doesn't stop from under their feet. At the same time, unplugging the reference on shutdown makes sure that no new entries update will appear and the system_k.s. will eventually be released. It's not a C++ classical reference, because system_keyspace starts after and stops before database. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-10 16:20:59 +03:00
Raphael S. Carvalho	14d6459efc	compaction: Make compaction_manager stop more robust Commit `aba475fe1d` accidentally fixed a race, which happens in the following sequence of events: 1) storage service starts drain() via API for example 2) main's abort source is triggered, calling compaction_manager's do_stop() via subscription. 2.1) do_stop() initiates the stop but doesn't wait for it. 2.2) compaction_manager's state is set to stopped, such that compaction_manager::stop() called in defer_verbose_shutdown() will wait for the stop and not start a new one. 3) drain() calls compaction_manager::drain() changing the state from stopped to disabled. 4) main calls compaction_manager::stop() (as described in 2.2) and incorrectly tries to stop the manager again, because the state was changed in step 3. `aba475fe1d` accidentally fixed this problem because drain() will no longer take place if it detects the shutdown process was initiated (it does so by ignoring drain request if abort source's subscription was unlinked). This shows us that looking at the state to determine if stop should be performed is fragile, because once the state changes from A to B, manager doesn't know the state was A. To make it robust, we can instead check if the future that stores stop's promise is engaged, meaning that the stop was already initiated and we don't have to start a new one. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11711	2022-10-06 13:49:26 +02:00
Pavel Emelyanov	d22b130af1	compaction_manager: Swallow ENOSPCs in ::stop() When being stopped compaction manager may step on ENOSPC. This is not a reason to fail stopping process with abort, better to warn this fact in logs and proceed as if nothing happened refs: #11245 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-03 18:54:48 +03:00
Taras Borodin	c155ae1182	add utf8:validate to operator<< partition_key with_schema.	2022-09-22 16:42:31 +03:00
Botond Dénes	05ef13a627	Merge 'Add support to split large partitions across SSTables' from Raphael "Raph" Carvalho Introduces support to split large partitions during compaction. Today, compaction can only split input data at partition boundary, so a large partition is stored in a single file. But that can cause many problems, like memory pressure (e.g.: https://github.com/scylladb/scylladb/issues/4217), and incremental compaction can also not fulfill its promise as the file storing the large partition can only be released once exhausted. The first step was to add clustering range metadata for first and last partition keys (retrieved from promoted index), which is crucial to determine disjointness at clustering level, and also the order at which the disjoint files should be opened for incremental reading. The second step was to extend sstable_run to look at clustering dimension, so a set of files storing disjoint ranges for the same partition can live in the same sstable run. The final step was to introduce the option for compaction to split large partition being written if it has exceeded the size threshold. What's next? Following this series, a reader will be implemented for sstable_run that will incrementally open the readers. It can be safely built on the assumption of the disjoint invariant after the second step aforementioned. Closes #11233 * github.com:scylladb/scylladb: test: Add test for large partition splitting on compaction compaction: Add support to split large partitions sstable: Extend sstable_run to allow disjointness on the clustering level sstables: simplify will_introduce_overlapping() test: move sstable_run_disjoint_invariant_test into sstable_datafile_test test: lib: Fix inefficient merging of mutations in make_sstable_containing() sstables: Keep track of first partition's first pos and last partition's last pos sstables: Rename min/max position_range to a descriptive name sstables_manager: Add sstable metadata reader concurrency semaphore sstables: Add ability to find first or last position in a partition	2022-09-15 16:08:56 +03:00
Raphael S. Carvalho	a04047f390	compaction: Properly handle stop request for off-strategy If user stops off-strategy via API, compaction manager can decide to give up on it completely, so data will sit unreshaped in maintenance set, preventing it from being compacted with data in the main set. That's problematic because it will probably lead to a significant increase in read and space amplification until off-strategy is triggered again, which cannot happen anytime soon. Let's handle it by moving data in maintenance set into main one, even if unreshaped. Then regular compaction will be able to continue from where off-strategy left off. Fixes #11543. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #11545	2022-09-15 09:21:22 +03:00
Raphael S. Carvalho	e2ccafbe38	compaction: Add support to split large partitions Adds support for splitting large partitions during compaction. Large partitions introduce many problems, like memory overhead and breaks incremental compaction promise. We want to split large partitions across fixed-size fragments. We'll allow a partition to exceed size limit by 10%, as we don't want to unnecessarily split partitions that just crossed the limit boundary. To avoid having to open a minimal of 2 fragments in a read, partition tombstone will be replicated to every fragment storing the partition. The splitting isn't enabled by default, and can be used by strategies that are run aware like ICS. LCS still cannot support it as it's still using physical level metadata, not run id. An incremental reader for sstable runs will follow soon. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:23:16 -03:00
Nadav Har'El	8ece63c433	Merge 'Safemode - Introduce TimeWindowCompactionStrategy Guardrails' This series introduces two configurable options when working with TWCS tables: - `restrict_twcs_default_ttl` - a LiveUpdate-able tri_mode_restriction which defaults to WARN and will notify the user whenever a TWCS table is created without a `default_time_to_live` setting - `twcs_max_window_count` - Which forbids the user from creating TWCS tables whose window count (buckets) are past a certain threshold. We default to 50, which should be enough for most use cases, and a setting of 0 effectively disables the check. Refs: #6923 Fixes: #9029 Closes #11445 * github.com:scylladb/scylladb: tests: cql_query_test: add mixed tests for verifying TWCS guard rails tests: cql_query_test: add test for TWCS window size tests: cql_query_test: add test for TWCS tables with no TTL defined cql: add configurable restriction of default_time_to_live when for TimeWindowCompactionStrategy tables cql: add max window restriction for TimeWindowCompactionStrategy time_window_compaction_strategy: reject invalid window_sizes cql3 - create/alter_table_statement: Make check_restricted_table_properties accept a schema_ptr	2022-09-12 23:55:51 +03:00
Felipe Mendes	f1ffb501f0	time_window_compaction_strategy: reject invalid window_sizes Scylla mistakenly allows an user to configure an invalid TWCS window_size <= 0, which effectively breaks the notion of compaction windows. Interestingly enough, a <= 0 window size should be considered an undefined behavior as either we would create a new window every 0 duration (?) or the table would behave as STCS, the reader is encouraged to figure out which one of these is true. :-) Cassandra, on the other hand, will properly throw a ConfigurationException when receiving such invalid window sizes and we now match the behavior to the same as Cassandra's. Refs: #2336	2022-09-11 16:40:03 -03:00
Raphael S. Carvalho	44913ebbd0	compaction_manager: restore indentation Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Raphael S. Carvalho	888660fa44	compaction_manager: Make remove() and stop_ongoing_compactions() noexcept stop_ongoing_compactions() is made noexcept too as it's called from remove() and we want to make the latter noexcept, to allow compaction group to qualify its stop function as noexcept too. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Benny Halevy	e9cfe9e572	tombstone_gc: deglobalize repair_history_maps Move the thread-local instances of the per-table repair history maps into compaction_manager. Fixes #11208 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Benny Halevy	d86810d22c	mutation_partition: compact_for_compaction_v2: get tombstone_gc_state To be passed down to compact_mutation_state in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Benny Halevy	7e4612d3aa	mutation_readers: pass tombstone_gc_state to compating_reader To be passed further done to `compact_mutation_state` in a following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:14 +03:00
Benny Halevy	572d534d0d	sstables: get_gc_before_: get tombstone_gc_state from caller Pass the tombstone_gc_state from the compaction_strategy to sstables get_gc_before_ functions using the table state to get to the tombstone_gc_state. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:05:39 +03:00
Benny Halevy	2cd3fc2f36	compaction: table_state: add virtual get_tombstone_gc_state method and override it in table::table_state to get the tombstone_gc_state from the table's compaction_manager. It is going to be used in the next patched to pass the gc state from the compaction_strategy down to sstables and compaction. table_state_for_test was modified to just keep a null tombstone_gc_state. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:05:39 +03:00
Benny Halevy	8b841e1207	compaction_manager: add tombstone_gc_state Add a tombstone_gc_state member and methods to get it. Currently the tombstone_gc_state is default constructed, but a following patch will move the thread-local repair history maps into the compaction_manager as a member and then the _tombstone_gc_state member will be initialized from that member. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-06 23:02:54 +03:00
Botond Dénes	b89b84ad3c	compaction: scrub/abort: be more verbose Currently abort-mode scrub exits with a message which basically says "some problem was found", with no details on what problem it found. Add a detailed error report on the found problem before aborting the scrub. Closes #11418	2022-09-06 11:42:34 +03:00
Benny Halevy	7747b8fa33	sstables: define run_identifier as a strong tagged_uuid type Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11321	2022-08-18 19:03:10 +03:00
Benny Halevy	14faa3b6f4	compaction_manager: perform_cleanup, perform_sstable_upgrade: use a lw_shared_ptr for owned token ranges And completely get rid of the dependency on replica::database. Also, add respective rest_api tests. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 08:08:11 +03:00
Benny Halevy	e1fe598760	compaction: cleanup, upgrade: use a lw_shared_ptr for owned token ranges Currently they are copied for the get_sstables function so this change reduces copies. Also, it will allow further decoupling of compaction_manager from replica::database, by letting the caller of perform_cleanup and perform_sstable_upgrade get the owned token ranges from db and pass it to the perform_* functions in the following patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:57:41 +03:00
Benny Halevy	e4e92d44ae	main: start compaction_manager as a sharded service And pass a reference to it to the database rather than having the database construct its own compaction_manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:50:15 +03:00
Benny Halevy	7f70949693	compaction_manager: keep config as member Rather than keeping separate, duplicated members. And define helpers to get those members. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:48:01 +03:00
Benny Halevy	450ecd60c6	backlog_controller: scheduling_group: define default member initializers To prepare for the next patch, implement default initialization of the scheduling_group and io_priority_class, to the default values. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:38:40 +03:00
Aleksandra Martyniuk	6ea5bc96d7	scrub compaction: return status indicating aborted operations over the rest api Performing compaction scrub user did not know whether an operation was aborted. If compaction scrub is aborted, return status the user gets over rest api is set to 1.	2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk	f1980f8dc6	scrub compaction: count validation errors and return status over the rest api Performing compaction scrub user did not know whether any validation errors were encountered. The number of validation errors per given compaction scrub is gathered and summed from each shard. Basing on that value return status over the rest api is set to 3 if any validation errors were encountered.	2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk	7d457cffb8	scrub compaction: count validation errors for specific scrub task The number of validation errors per given compaction scrub on given shard is passed up to perform_task() function.	2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk	3a805a9d9b	compaction: extract statistics in compaction_result Statistics from compaction_result are extracted to new struct compaction_stats and stored as a field of compaction_result.	2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk	a80c187b20	scrub compaction: register validation errors in metrics The number of validation errors is registered in metrics. Metrics provide common counters for all scrub operation within a compaction manager, though. Thus, to check the exact number of validation errors, the comparison of counters before and after scrub operation needs to be done.	2022-07-29 09:35:20 +02:00
Aleksandra Martyniuk	ab85dab05d	scrub compaction: count validation errors The number of validation errors encountered during scrub compaction is counted.	2022-07-29 09:35:20 +02:00

1 2 3 4 5 ...

458 Commits