scylladb

Author	SHA1	Message	Date
Avi Kivity	5e764d1de2	Merge 'Drop v2 and flat from reader and related names' from Botond Dénes Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names. Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant). The changes in this PR are entirely mechanical, mostly just search-and-replace. Code cleanup, no backport required. Closes scylladb/scylladb#24087 * github.com:scylladb/scylladb: test/boost/mutation_reader_another_test: drop v2 from reader and related names test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/ test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/ test/boost/mutation_test: s/consumer_v2/consumer/ test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/ readers/mutation_readers: s/generating_reader_v2/generating_reader/ readers/mutation_readers: s/delegating_reader_v2/delegating_reader/ readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/ readers/mutation_source: s/make_reader_v2/make_mutation_reader/ readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/ readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/ mutation/mutation_compactor: drop v2 from compactor and related names replica/table: s/make_reader_v2/make_mutation_reader/ mutation_writer: s/bucket_writer_v2/bucket_writer/ readers/queue: drop v2 from reader and related names readers/multishard: drop v2 from reader and related names readers/evictable: drop v2 from reader and related names readers/multi_range: remove flat from name	2025-05-11 22:22:35 +03:00
Botond Dénes	674d41e3e6	readers/mutation_source: s/make_reader_v2/make_mutation_reader/	2025-05-09 07:53:29 -04:00
Botond Dénes	b5170e27d0	replica/table: s/make_reader_v2/make_mutation_reader/	2025-05-09 07:53:29 -04:00
Botond Dénes	7ba3c3fec3	readers/multi_range: remove flat from name	2025-05-09 07:53:25 -04:00
Raphael S. Carvalho	28056344ba	replica: Fix take_storage_snapshot() running concurrently to merge completion Some background: When merge happens, a background fiber wakes up to merge compaction groups of sibling tablets into main one. It cannot happen when rebuilding the storage group list, since token metadata update is not preemptable. So a storage group, post merge, has the main compaction group and two other groups to be merged into the main. When the merge happens, those two groups are empty and will be freed. Consider this scenario: 1) merge happens, from 2 to 1 tablet 2) produces a single storage group, containing main and two other compaction groups to be merged into main. 3) take_storage_snapshot(), triggered by migration post merge, gets a list of pointer to all compaction groups. 4) t__s__s() iterates first on main group, yields. 5) background fiber wakes up, moves the data into main and frees the two groups 6) t__s__s() advances to other groups that are now freed, since step 5. 7) segmentation fault In addition to memory corruption, there's also a potential for data to escape the iteration in take_storage_snapshot(), since data can be moved across compaction groups in background, all belonging to the same storage group. That could result in data loss. Readers should all operate on storage group level since it can provide a view on all the data owned by a tablet replica. The movement of sstable from group A to B is atomic, but iteration first on A, then later on B, might miss data that was moved from B to A, before the iteration reached B. By switching to storage group in the interface that retrieves groups by token range, we guarantee that all data of a given replica can be found regardless of which compaction group they sit on. Fixes #23162. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24058	2025-05-09 14:07:06 +03:00
Raphael S. Carvalho	434c2c4649	replica: Fix use-after-free with concurrent schema change and sstable set update When schema is changed, sstable set is updated according to the compaction strategy of the new schema (no changes to set are actually made, just the underlying set type is updated), but the problem is that it happens without a lock, causing a use-after-free when running concurrently to another set update. Example: 1) A: sstable set is being updated on compaction completion 2) B: schema change updates the set (it's non deferring, so it happens in one go) and frees the set used by A. 3) when A resumes, system will likely crash since the set is freed already. ASAN screams about it: SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ... Fix is about deferring update of the set on schema change to compaction, which is triggered after new schema is set. Only strategy state and backlog tracker are updated immediately, which is fine since strategy doesn't depend on any particular implementation of sstable set, since patch "sstables: Implement sstable_set_impl::all_sstable_runs()". Fixes #22040. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-05-06 10:06:55 -03:00
Raphael S. Carvalho	c77f710a0c	sstables: Fix quadratic space complexity in partitioned_sstable_set Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	21d1e78457	compaction: Wire table_state into make_sstable_set() This will be useful for feeding token range owned by compaction group into sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	59dad2121f	compaction: Introduce token_range() to table_state This provides a way for compaction layer to know compaction group's token range. It will be important for sstable set impl to know the token range of underlying group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Wojciech Mitros	d7bd86591e	view_info: don't re-set base_info after construction In the previous commits we made sure that the base info is not dependent on the base schema version, and the info dependent on the base schema version is calculated when it's needed. In this patch we remove the unnecessary re-setting of the base_info. The set_base_info method isn't removed completely, because it also has a secondary function - zeroing the view_info fields other than base_info. Because of this, in this patch we rename it accordingly and limit its use to the updates caused by a base schema change.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	ea462efa3d	base_info: remove base_info snapshot semantics The base info in view schemas no longer changes on base schema updates, so saving the base info with a view schema from a specific point in time doesn't provide any additional benefits. In this patch we remove the code using the base_and_view snapshots as it's no longer useful.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	05fce91945	schema_registry: store base info instead of base schema for view entries In the following patch we plan to remove the base schema from the base_info to make the base_info immutable. To do that, we first prepare the schema registry for the change; we need to be able to create view schemas from frozen schemas there and frozen schemas have no information about the base table. Unless we do this change, after base schemas are removed from the base info, we'll no longer be able to load a view schema to the schema registry without looking up the base schema in the database. This change also required some updates to schema building: * we add a method for unfreezing a view schema with base info instead of a base schema * we make it possible to use schema_builder with a base info instead of a base schema * we add a method for creating a view schema from mutations with a base info instead of a base schema * we add a view_info constructor withat base info instead of a base schema * we update the naming in schema_registry to reflect the usage of base info instead of base schema	2025-04-24 01:08:39 +02:00
Botond Dénes	6172ff501f	readers: mv reversing_v2.hh reversing.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	a9d75c4f9d	readers: mv empty_v2.hh empty.hh Completely mechanical change.	2025-04-16 04:32:56 -04:00
Benny Halevy	e1fe82ed33	utils: phased_barrier, pluggable: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	7342a57cbb	replica: table: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	52e1ce7f0d	replica: compaction_group, storage_group: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Tomasz Grabiec	06b49bdf69	Merge 'row_cache: don't garbage-collect tombstones which cover data in memtables' from Botond Dénes The row cache can garbage-collect tombstones in two places: 1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it; 2) During reads - reads now compact data including garbage collection; In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables. This PR includes fixes for (2), which were not handled at all currently. (1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included. Fixes: https://github.com/scylladb/scylladb/issues/23291 Fixes: https://github.com/scylladb/scylladb/issues/23252 The fix will need backport to all live release. Closes scylladb/scylladb#23255 * github.com:scylladb/scylladb: test/boost/row_cache_test: add memtable overlap check tests replica/table: add error injection to memtable post-flush phase utils/error_injection: add a way to set parameters from error injection points test/cluster: add test_data_resurrection_in_memtable.py test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts replica/mutation_dump: don't assume cells are live replica/database: do_apply() add error injection point replica: improve memtable overlap checks for the cache replica/memtable: add is_merging_to_cache() db/row_cache: add overlap-check for cache tombstone garbage collection mutation/mutation_compactor: copy key passed-in to consume_new_partition()	2025-04-08 17:26:58 +02:00
Raphael S. Carvalho	0f59deffaa	replica: Fix truncate and drop table after tablet migration happens When running those operations after a tablet replica is migrated away from a shard, an assert can fail resulting in a crash. Status quo (around the assert in truncate procedure): 1) Highest RP seen by table is saved in low_mark, and the current time in low_mark_at. 2) Then compaction is disabled in order to not mix data written before truncate, and data written later. 3) Then memtable is flushed in order for the data written before truncate to be available in sstables and then removed. 4) Now, current time is saved in truncated_at, which is supposedly the time of truncate to decide which sstables to remove. Note: truncated_at is likely above low_mark_at due to steps 2 and 3. The interesting part of the assert is: (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp) Note: RP in the assert above is the highest RP among all sstables generated before truncated_at. RP is retrieved by table::discard_sstables(). If truncated_at > low_mark_at, maybe newer data was written during steps 2 and 3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with RP > low_mark. So assert's 2nd condition is there to defend against the scenario above. truncated_at and low_mark_at uses millisecond granularity, so even if truncated_at == low_mark_at, data could have been written in steps 2 and 3 (during same MS window), failing the assert. This is fragile. Reproducer: To reproduce the problem, truncated_at must be > low_mark_at, which can easily happen with both drop table and truncate due to steps 2 and 3. If a shard has 2 or more tablets, the table's highest RP refer to just one tablet in that shard. If the tablet with the highest RP is migrated away, then the sstables in that shard will have lower RP than the recorded highest RP (it's a table wide state, which makes sense since CL is shared among tablets). So when either drop table or truncate runs, low_mark will be potentially bigger than highest RP retrieved from sstables. Proposed solution: The current assert is hacked to not fail if writes sneak in, during steps 2 and 3, but it's still fragile and seems not to serve its real purpose, since it's allowing for RP > low_mark. We should be able to say that low_mark >= RP, as a way of asserting we're not leaving data targeted by truncate behind (or that we're not removing the wrong data). But the problem is that we're saving low_mark in step 1, before preparation steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying all data written so far is targeted for removal. But as of today, low_mark refers to all data written up to step 1. So low_mark is now only one set before issuing flush, and also accounts for all potentially flushed data. Fixes #18059. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23560	2025-04-08 07:32:58 +03:00
Botond Dénes	6c1f6427b3	replica/table: add error injection to memtable post-flush phase After the memtable was flushed to disk, but before it is merged to cache. The injection point will only active for the table specified in the "table_name" injection parameter.	2025-04-08 00:11:36 -04:00
Botond Dénes	d126ea09ba	replica: improve memtable overlap checks for the cache The current memtable overlap check that is used by the cache -- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only checks the active memtable, so memtables which are either being flushed or are already flushed and also have active reads against them do not participate in the overlap check. This can result in temporary data resurrection, where a cache read can garbage-collect a tombstone which still covers data in a flushing or flushed memtable, which still have active read against it. To prevent this, extend the overlap check to also consider all of the memtable list. Furthermore, memtable_list::erase() now places the removed (flushed) memtable in an intrusive list. These entries are alive only as long as there are readers still keeping an `lw_shared_ptr<memtable>` alive. This list is now also consulted on overlap checks.	2025-04-08 00:11:35 -04:00
Botond Dénes	6b5b563ef7	db/row_cache: add overlap-check for cache tombstone garbage collection The cache should not garbage-collect tombstone which cover data in the memtable. Add overlap checks (get_max_purgeable) to garbage collection to detect tombstones which cover data in the memtable and to prevent their garbage collection.	2025-04-08 00:11:35 -04:00
Lakshmi Narayanan Sreethar	750f4baf44	replica/table::do_apply : do not check for async gate's closure The `table::do_apply()` method verifies if the compaction group's async gate is open to determine if the compaction group is active. Closing this async gate prevents any new operations but waits for existing holders to exit, allowing their operations to complete. When holding a gate, holders will observe the gate as closed when it is being closed, but this is irrelevant as they are already inside the gate and are allowed to complete. All the callers of `table::do_apply()` already enter the gate before calling the method. So, the async gate check inside `table::do_apply()` will erroneously throw an exception when the compaction group is closing despite holding the gate. This commit removes the check to prevent this from happening. Fixes #23348 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#23579	2025-04-07 13:27:22 +03:00
Botond Dénes	a0d8102a1f	replica/memtable: s/make_flat_reader/make_mutation_reader/ Following the recent refactoring of removing "flat" and "v2" from reader names, replacing all the fully qualified names with simply "mutation_reader". Closes scylladb/scylladb#23346	2025-04-01 17:58:13 +03:00
Michał Chojnowski	48c06c7e4b	database: add take_sstable_set_snapshot() We want a method that will allow us to take a stable snapshot of SSTables, to asynchronously compute some stats on them. But `take_storage_snapshot` is overly invasive for that, because it flushes memtables on each call. (If `take_storage_snapshot` was, for example, called repetitively, it could create a ton of small memtables and lead to trouble). This commit adds a weaker version which only takes a snapshot of existing SSTables, and doesn't flush memtables by itself. This will be useful for dictionary training, which doesn't care about the semantics of SSTables, only their rough statistical properties.	2025-04-01 00:07:28 +02:00
Michał Chojnowski	e23fdc0799	table: fix a race in table::take_storage_snapshot() `safe_foreach_sstable` doesn't do its job correctly. It iterates over an sstable set under the sstable deletion lock in an attempt to ensure that SSTables aren't deleted during the iteration. The thing is, it takes the deletion lock after the SSTable set is already obtained, so SSTables might get unlinked before we take the lock. Remove this function and fix its usages to obtain the set and iterate over it under the lock. Closes scylladb/scylladb#23397	2025-03-31 09:40:32 +03:00
Gleb Natapov	cb2b874942	table: use host id based get_endpoint_state_ptr and skip id->ip translation	2025-03-11 12:09:20 +02:00
Raphael S. Carvalho	fedd838b9d	replica: Fix race of some operations like cleanup with snapshot There are two semaphores in table for synchronizing changes to sstable list: sstable_set_mutation_sem: used to serialize two concurrent operations updating the list, to prevent them from racing with each other. sstable_deletion_sem: A deletion guard, used to serialize deletion and iteration over the list, to prevent iteration from finding deleted files on disk. they're always taken in this order to avoid deadlocks: sstable_set_mutation_sem -> sstable_deletion_sem. problem: A = tablet cleanup B = take_snapshot() 1) A acquires sstable_set_mutation_sem for updating list 2) A acquires sstable_deletion_sem, then delete sstable before updating list 3) A releases sstable_deletion_sem, then yield 4) B acquires sstable_deletion_sem 5) B iterates through list and bumps sstable deleted in step 2 6) B fails since it cannot find the file on disk Initial reaction is to say that no procedure must delete sstable before updating the list, that's true. But we want a iteration, running concurrently to cleanup, to not find sstables being removed from the system. Otherwise, e.g. snapshot works with sstables of a tablet that was just cleaned up. That's achieved by serializing iteration with list update. Since sstable_deletion_sem is used within the scope of deletion only, it's useless for achieving this. Cleanup could acquire the deletion sem when preparing list updates, and then pass the "permit" to deletion function, but then sstable_deletion_sem would essentially become sstable_set_mutation_sem, which was created exactly to protect the list update. That being said, it makes sense to merge both semaphores. Also things become easier to reason about, and we don't have to worry about deadlocks anymore. The deletion goes through sstable_list_builder, which holds a permit throughout its lifetime, which guarantees that list updates and deletion are atomic to other concurrent operations. The interface becomes less error prone with that. It allowed us to find discard_sstables() was doing deletion without any permit, meaning another race could happen between truncate and snapshot. So we're fixing race of (truncate\|cleanup) with take_snapshot, as far as we know. It's possible another unknown races are fixed as well. Fixes #23049. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23117	2025-03-06 11:00:48 +02:00
Pavel Emelyanov	86b3e9b50b	code: Move checked-file-impl.hh to util/ fixes: #22100 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23123	2025-03-06 10:22:05 +02:00
Kefu Chai	5be39740a8	tree: migrate from boost::find to std::ranges algorithms Replace boost::find() calls with std::ranges::find() and std::ranges::contains() to leverage modern C++ standard library features. This change reduces external dependencies and modernizes the codebase. The following changes were made: - Replaced boost::find() with std::ranges::find() where index/iterator is needed - Used std::ranges::contains() for simple element presence checks Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22920	2025-02-20 09:28:57 +03:00
Kefu Chai	a18069fad7	service: migrate from boost::range::remove_if() to std::ranges::remove_if Replace boost::range::remove_if() with the standard library's std::ranges::remove_if() to reduce external dependencies and simplify the codebase. This change eliminates the requirement for boost::range and makes the implementation more maintainable. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-02-11 09:15:14 +08:00
Avi Kivity	6b85c03221	Merge 'split: run set_split_mode() on all storage groups during all_storage_groups_split()' from Ferenc Szili `tablet_storage_group_manager::all_storage_groups_split()` calls `set_split_mode()` for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using `std::ranges::all_of()` which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurrence of the predicate (`set_split_mode()`) returning false. `set_split_mode()` creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups. The missing split compaction groups are later created in `tablet_storage_group_manager::split_all_storage_groups()` which also calls `set_split_mode()`, and that is the reason why split completes successfully. The problem is that `tablet_storage_group_manager::all_storage_groups_split()` runs under a group0 guard, but `tablet_storage_group_manager::split_all_storage_groups()` does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE Fixes #22431 This is a bugfix and should be back ported to versions with tablets: 6.1 6.2 and 2025.1 Closes scylladb/scylladb#22330 * github.com:scylladb/scylladb: test: add reproducer and test for fix to split ready CG creation table: run set_split_mode() on all storage groups during all_storage_groups_split()	2025-01-27 13:13:42 +01:00
Benny Halevy	23284f038f	table: flush: synchronize with stop() When the table is stopped, all compaction groups are stopped, and as part of that, they are flushing their memtables. To synchronize with stop-induced flush operation, move _pending_flushes_phaser.stop() later in table::stop(), after all compaction groups are flushed and stopped. This way, in table::flush, if we see that the phaser is already closed, we know that there is nothing to flush, otherwise we start a flush operation that would be waited on by a parallel table::stop(). Fixes #22243 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#22339	2025-01-22 09:23:09 +02:00
Ferenc Szili	8bff7786a8	test: add reproducer and test for fix to split ready CG creation This adds a reproducer for #22431 In cases where a tablet storage group manager had more than one storage group, it was possible to create compaction groups outside the group0 guard, which could create problems with operations which should exclude with compaction group creation.	2025-01-21 18:43:10 +01:00
Ferenc Szili	24e8d2a55c	table: run set_split_mode() on all storage groups during all_storage_groups_split() tablet_storage_group_manager::all_storage_groups_split() calls set_split_mode() for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using std::ranges::all_of() which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurance of the predicate (set_split_mode()) returning false. set_split_mode() creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups. The missing split compaction groups are later created in tablet_storage_group_manager::split_all_storage_groups() which also calls set_split_mode(), and that is the reason why split completes successfully. The problem is that tablet_storage_group_manager::all_storage_groups_split() runs under a group0 guard, and tablet_storage_group_manager::split_all_storage_groups() does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE	2025-01-21 18:42:53 +01:00
Tomasz Grabiec	8059090a29	Merge 'Cache base info for view schemas in the schema registry' from Wojciech Mitros Currently, when we load a frozen schema into the registry, we lose the base info if the schema was of a view. Because of that, in various places we need to set the base info again, and in some codepaths we may miss it completely, which may make us unable to process some requests (for example, when executing reverse queries on views). Even after setting the base info, we may still lose it if the schema entry gets deactivated due to all `schema_ptr`s temporarily dying. To fix this, this patch adds the base schema to the registry, alongside the view schema. We store just the frozen base schema, so that we can transfer it across shards. With the base schema, we can now set the base info when returning the schema from the registry. As a result, we can now assume that all view schemas returned by the registry have base_info set. In this series we also make sure that the view schemas in the registry are kept up-to-date in regards to base schema changes. Fixes https://github.com/scylladb/scylladb/issues/21354 This issue is a bug, so adding backport labels 6.1 and 6.2 Closes scylladb/scylladb#21862 * github.com:scylladb/scylladb: test: add test for schema registry maintaining base info for views schema_registry: avoid setting base info when getting the schema from registry schema_registry: update cached base schemas when updating a view schema_registry: cache base schemas for views db: set base info before adding schema to registry	2025-01-21 00:17:54 +01:00
Botond Dénes	47989b1503	Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk In this change, tablet_virtual_task starts supporting tablet resize (i.e. split and merge). Users can see running resize tasks - finished tasks are not presented with the task manager API. A new task state "suspended" is added. If a resize was revoked, it will appear to users as suspended. We assume that the resize was revoked when the tablet number didn't change. Fixes: #21366. Fixes: #21367. No backport, new feature Closes scylladb/scylladb#21891 * github.com:scylladb/scylladb: test: boost: check resize_task_info in tablet_test.cc test: add tests to check revoked resize virtual tasks test: add tests to check the list of resize virtual tasks test: add tests to check spilt and merge virtual tasks status test: test_tablet_tasks: generalize functions replica: service: add split virtual task's children replica: service: pass parent info down to storage_group::split tasks: children of virtual tasks aren't internal by default tasks: initialize shard in task_info ctor service: extend tablet_virtual_task::abort service: retrun status_helper struct from tablet_virtual_task::get_status_helper service: extend tablet_virtual_task::wait tasks: add suspended task state service: extend tablet_virtual_task::get_status service: extend tablet_virtual_task::contains service: extend tablet_virtual_task::get_stats service: add service::task_manager_module::get_nodes tasks: add task_manager::get_nodes tasks: drop noexcept from module::get_nodes replica: service: add resize_task_info static column to system.tablets locator: extend tablet_task_info to cover resize tasks	2025-01-17 14:24:07 +02:00
Botond Dénes	55963f8f79	replica: remove noexcept from token -> tablet resolution path The methods to resolve a key/token/range to a table are all noexcept. Yet the method below all of these, `storage_group_for_id()` can throw. This means that if due to any mistake a tablet without local replica is attempted to be looked up, it will result in a crash, as the exception bubbles up into the noexcept methods. There is no value in pretending that looking up the tablet replica is noexcept, remove the noexcept specifiers so that any bad lookup only fails the operation at hand and doesn't crash the node. This is especially relevant to replace, which still has a window where writes can arrive for tablets that don't (yet) have a local replica. Currently, this results in a crash. After this patch, this will only fail the writes and the replace can move on. Fixes: #21480 Closes scylladb/scylladb#22251	2025-01-17 11:24:09 +03:00
Kefu Chai	7215d4bfe9	utils: do not include unused headers these unused includes were identifier by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. please note, because quite a few source files relied on `utils/to_string.hh` to pull in the specialization of `fmt::formatter<std::optional<T>>`, after removing `#include <fmt/std.h>` from `utils/to_string.hh`, we have to include `fmt/std.h` directly. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2025-01-14 07:56:39 -05:00
Aleksandra Martyniuk	062f155fd6	replica: service: add split virtual task's children offstrategy_compaction_task_executor and split_compaction_task_executor running as a part of the split become children of a split virtual task.	2025-01-10 10:03:08 +01:00
Aleksandra Martyniuk	7ef6900837	replica: service: pass parent info down to storage_group::split Pass task_info down to storage_group::split. In the following patches, it will be used to set the parent of offstrategy_compaction_task_executor and split_compaction_task_executor running as a part of the split. The task_info param will contain task info of a split virtual task.	2025-01-10 10:03:08 +01:00
Tomasz Grabiec	4c89e62470	Merge 'Phased barrier improvements' from Benny Halevy - utils: phased_barrier: advance_and_await: allocate new gate only when needed - utils: phased_barrier: add close() method - and use in existing services * Improvement. No backport needed Closes scylladb/scylladb#22018 * github.com:scylladb/scylladb: utils: phased_barrier: add close() method utils: phased_barrier: advance_and_await: allocate new gate only when needed	2025-01-03 18:51:23 +01:00
Avi Kivity	4905b1bf76	Merge 'table: make update_effective_replication_map sync again' from Benny Halevy Commit `f2ff701489` introduced a yield in update_effective_replication_map that might cause the storage_group manager to be inconsistent with the new effective_replication_map (e.g. if yielding right before calling `handle_tablet_split_completion`. Also, yielding inside storage_service::replicate_to_all_cores update loop means that base tables and their views aren't updated atomically, that caused scylladb/scylladb#17786 This change essentially reverts `f2ff701489` and makes handle_tablet_split_completion synchronous too. The stopped compaction groups future is kept as a member and storage_group_manager::stop() consumes this future during table::stop(). - storage_service: replicate_to_all_cores: update base and view tables atomically Currently, the loop updating all tables (including views) with the new effective_replication_map may yield, and therefore expose a state where the base and view tables effective_replication_map and topology are out of sync (as seen in scylladb/scylladb#17786) To prevent that, loop over all base tables and for each table update the base table and all views atomically, without yielding, and so allow yielding only between base tables. * Regression was introduced in `f2ff701489`, so backport is required to 6.x, 2024.2 Closes scylladb/scylladb#21781 * github.com:scylladb/scylladb: storage_service: replicate_to_all_cores: clear_gently pending erms test_mv_topology_change: drop delay_after_erm_update injection case storage_service: replicate_to_all_cores: update base and view tables atomically table: make update_effective_replication_map sync again	2024-12-30 23:42:06 +02:00
Wojciech Mitros	82f2e1b44c	schema_registry: update cached base schemas when updating a view The schema registry now holds base schemas for view schemas. The base schema may change without changing the view schema, so to preserve the change in the schema registry, we also update the base schema in the registry when updating the base info in the view schema.	2024-12-30 14:56:18 +01:00
Wojciech Mitros	6f11edbf3f	db: set base info before adding schema to registry In the following patches, we'll assure that view schemas returned by the schema registry always have base info set. To prepare for that, make sure that the base info is always set before inserting it into schema registry,	2024-12-30 14:56:17 +01:00
Kefu Chai	6acc5294a4	treewide: migrate from boost::copy_range to std::ranges::to now that we are allowed to use C++23. we now have the luxury of using `std::ranges::to`. in this change, we: - replace `boost::copy_range` to `std::ranges::to` - remove unused `#include` of boost headers Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21880	2024-12-26 11:46:26 +02:00
Benny Halevy	a25c3eaa1c	utils: phased_barrier: add close() method When services are stopped we generally want to call advance_and_await(), but we should also prevent starting new operations, so close() would do that be closing the phased_barrier active gate (which implicitly also awaits past operations similar to advance_and_await()). Add unit tests for that and use in existing services. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-26 06:54:07 +02:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Raphael S. Carvalho	013e0d53ff	replica: Fix use-after-free due to a race between split and cleanup There is an assumption that every destroyed compaction_group will be stopped first. Otherwise, the group is still referenced by compaction manager and can use it after freed. That's what happened in issue #21867 in the context of merge. The issue is pre-existing but was made more likely with merge. One problem is a race between split and cleanup, where if split is emitted while cleanup is stopping groups, it can happen split preparation adds new groups that will never be closed, since cleanup is already past the group stopping step. Another problem found is that split completion handler is not accounting for possible existence of merging groups, if split happens right after merge. Split completion handler should stop all empty groups that previously had data split from them. The problems will be fixed by guaranteeing that new groups will not be added for a tablet being migrated away, and that empty groups are properly closed when handling split completion. A reproducer was added. Fixes #21867. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#21920	2024-12-16 13:19:26 +02:00
Benny Halevy	10c4cf930c	table: make update_effective_replication_map sync again Commit `f2ff701489` introduced a yield in update_effective_replication_map that might cause the storage_group manager to be inconsistent with the new effective_replication_map (e.g. if yielding right before calling `handle_tablet_split_completion`. Also, yielding inside storage_service::replicate_to_all_cores update loop means that base tables and their views aren't updated atomically, that caused scylladb/scylladb#17786 This change essentially reverts `f2ff701489` and makes handle_tablet_split_completion synchronous too. The stopped compaction groups future is kept as a memebr and storage_group_manager::stop() consumes this future during table::stop(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-15 11:45:08 +02:00

1 2 3 4 5 ...

613 Commits