scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 04:37:00 +00:00

Author	SHA1	Message	Date
Raphael S. Carvalho	28056344ba	replica: Fix take_storage_snapshot() running concurrently to merge completion Some background: When merge happens, a background fiber wakes up to merge compaction groups of sibling tablets into main one. It cannot happen when rebuilding the storage group list, since token metadata update is not preemptable. So a storage group, post merge, has the main compaction group and two other groups to be merged into the main. When the merge happens, those two groups are empty and will be freed. Consider this scenario: 1) merge happens, from 2 to 1 tablet 2) produces a single storage group, containing main and two other compaction groups to be merged into main. 3) take_storage_snapshot(), triggered by migration post merge, gets a list of pointer to all compaction groups. 4) t__s__s() iterates first on main group, yields. 5) background fiber wakes up, moves the data into main and frees the two groups 6) t__s__s() advances to other groups that are now freed, since step 5. 7) segmentation fault In addition to memory corruption, there's also a potential for data to escape the iteration in take_storage_snapshot(), since data can be moved across compaction groups in background, all belonging to the same storage group. That could result in data loss. Readers should all operate on storage group level since it can provide a view on all the data owned by a tablet replica. The movement of sstable from group A to B is atomic, but iteration first on A, then later on B, might miss data that was moved from B to A, before the iteration reached B. By switching to storage group in the interface that retrieves groups by token range, we guarantee that all data of a given replica can be found regardless of which compaction group they sit on. Fixes #23162. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24058	2025-05-09 14:07:06 +03:00
Raphael S. Carvalho	21d1e78457	compaction: Wire table_state into make_sstable_set() This will be useful for feeding token range owned by compaction group into sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Benny Halevy	52e1ce7f0d	replica: compaction_group, storage_group: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Raphael S. Carvalho	fedd838b9d	replica: Fix race of some operations like cleanup with snapshot There are two semaphores in table for synchronizing changes to sstable list: sstable_set_mutation_sem: used to serialize two concurrent operations updating the list, to prevent them from racing with each other. sstable_deletion_sem: A deletion guard, used to serialize deletion and iteration over the list, to prevent iteration from finding deleted files on disk. they're always taken in this order to avoid deadlocks: sstable_set_mutation_sem -> sstable_deletion_sem. problem: A = tablet cleanup B = take_snapshot() 1) A acquires sstable_set_mutation_sem for updating list 2) A acquires sstable_deletion_sem, then delete sstable before updating list 3) A releases sstable_deletion_sem, then yield 4) B acquires sstable_deletion_sem 5) B iterates through list and bumps sstable deleted in step 2 6) B fails since it cannot find the file on disk Initial reaction is to say that no procedure must delete sstable before updating the list, that's true. But we want a iteration, running concurrently to cleanup, to not find sstables being removed from the system. Otherwise, e.g. snapshot works with sstables of a tablet that was just cleaned up. That's achieved by serializing iteration with list update. Since sstable_deletion_sem is used within the scope of deletion only, it's useless for achieving this. Cleanup could acquire the deletion sem when preparing list updates, and then pass the "permit" to deletion function, but then sstable_deletion_sem would essentially become sstable_set_mutation_sem, which was created exactly to protect the list update. That being said, it makes sense to merge both semaphores. Also things become easier to reason about, and we don't have to worry about deadlocks anymore. The deletion goes through sstable_list_builder, which holds a permit throughout its lifetime, which guarantees that list updates and deletion are atomic to other concurrent operations. The interface becomes less error prone with that. It allowed us to find discard_sstables() was doing deletion without any permit, meaning another race could happen between truncate and snapshot. So we're fixing race of (truncate\|cleanup) with take_snapshot, as far as we know. It's possible another unknown races are fixed as well. Fixes #23049. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23117	2025-03-06 11:00:48 +02:00
Kefu Chai	57b14220ce	tree: remove unused "#include"s these unused includes were identified by clang-include-cleaner. after auditing these source files, all of the reports have been confirmed. in which, instead of using `seastarx.hh`, `readers/mutation_reader.hh`, use `using seastar::future` to include `future` in the global namespace, this makes `readers/mutation_reader.hh` a header exposing `future<>`, but this is not a good practice, because, unlike `seastarx.hh` or `seastar/core/future.hh`, `reader/mutation_reader.hh` is not responsible for exposing seastar declarations. so, we trade the using statement for `#include "seastarx.hh"` in that file to decouple the source files including it from this header because of this statement. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22439	2025-01-28 14:12:06 +03:00
Botond Dénes	47989b1503	Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk In this change, tablet_virtual_task starts supporting tablet resize (i.e. split and merge). Users can see running resize tasks - finished tasks are not presented with the task manager API. A new task state "suspended" is added. If a resize was revoked, it will appear to users as suspended. We assume that the resize was revoked when the tablet number didn't change. Fixes: #21366. Fixes: #21367. No backport, new feature Closes scylladb/scylladb#21891 * github.com:scylladb/scylladb: test: boost: check resize_task_info in tablet_test.cc test: add tests to check revoked resize virtual tasks test: add tests to check the list of resize virtual tasks test: add tests to check spilt and merge virtual tasks status test: test_tablet_tasks: generalize functions replica: service: add split virtual task's children replica: service: pass parent info down to storage_group::split tasks: children of virtual tasks aren't internal by default tasks: initialize shard in task_info ctor service: extend tablet_virtual_task::abort service: retrun status_helper struct from tablet_virtual_task::get_status_helper service: extend tablet_virtual_task::wait tasks: add suspended task state service: extend tablet_virtual_task::get_status service: extend tablet_virtual_task::contains service: extend tablet_virtual_task::get_stats service: add service::task_manager_module::get_nodes tasks: add task_manager::get_nodes tasks: drop noexcept from module::get_nodes replica: service: add resize_task_info static column to system.tablets locator: extend tablet_task_info to cover resize tasks	2025-01-17 14:24:07 +02:00
Botond Dénes	55963f8f79	replica: remove noexcept from token -> tablet resolution path The methods to resolve a key/token/range to a table are all noexcept. Yet the method below all of these, `storage_group_for_id()` can throw. This means that if due to any mistake a tablet without local replica is attempted to be looked up, it will result in a crash, as the exception bubbles up into the noexcept methods. There is no value in pretending that looking up the tablet replica is noexcept, remove the noexcept specifiers so that any bad lookup only fails the operation at hand and doesn't crash the node. This is especially relevant to replace, which still has a window where writes can arrive for tablets that don't (yet) have a local replica. Currently, this results in a crash. After this patch, this will only fail the writes and the replace can move on. Fixes: #21480 Closes scylladb/scylladb#22251	2025-01-17 11:24:09 +03:00
Aleksandra Martyniuk	7ef6900837	replica: service: pass parent info down to storage_group::split Pass task_info down to storage_group::split. In the following patches, it will be used to set the parent of offstrategy_compaction_task_executor and split_compaction_task_executor running as a part of the split. The task_info param will contain task info of a split virtual task.	2025-01-10 10:03:08 +01:00
Avi Kivity	4905b1bf76	Merge 'table: make update_effective_replication_map sync again' from Benny Halevy Commit `f2ff701489` introduced a yield in update_effective_replication_map that might cause the storage_group manager to be inconsistent with the new effective_replication_map (e.g. if yielding right before calling `handle_tablet_split_completion`. Also, yielding inside storage_service::replicate_to_all_cores update loop means that base tables and their views aren't updated atomically, that caused scylladb/scylladb#17786 This change essentially reverts `f2ff701489` and makes handle_tablet_split_completion synchronous too. The stopped compaction groups future is kept as a member and storage_group_manager::stop() consumes this future during table::stop(). - storage_service: replicate_to_all_cores: update base and view tables atomically Currently, the loop updating all tables (including views) with the new effective_replication_map may yield, and therefore expose a state where the base and view tables effective_replication_map and topology are out of sync (as seen in scylladb/scylladb#17786) To prevent that, loop over all base tables and for each table update the base table and all views atomically, without yielding, and so allow yielding only between base tables. * Regression was introduced in `f2ff701489`, so backport is required to 6.x, 2024.2 Closes scylladb/scylladb#21781 * github.com:scylladb/scylladb: storage_service: replicate_to_all_cores: clear_gently pending erms test_mv_topology_change: drop delay_after_erm_update injection case storage_service: replicate_to_all_cores: update base and view tables atomically table: make update_effective_replication_map sync again	2024-12-30 23:42:06 +02:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Kefu Chai	e65fc35b5e	replica: do not include unused headers these unused includes are identified by clang-include-cleaner. after auditing the source files, all of the reports have been confirmed. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21836	2024-12-18 13:52:57 +02:00
Raphael S. Carvalho	013e0d53ff	replica: Fix use-after-free due to a race between split and cleanup There is an assumption that every destroyed compaction_group will be stopped first. Otherwise, the group is still referenced by compaction manager and can use it after freed. That's what happened in issue #21867 in the context of merge. The issue is pre-existing but was made more likely with merge. One problem is a race between split and cleanup, where if split is emitted while cleanup is stopping groups, it can happen split preparation adds new groups that will never be closed, since cleanup is already past the group stopping step. Another problem found is that split completion handler is not accounting for possible existence of merging groups, if split happens right after merge. Split completion handler should stop all empty groups that previously had data split from them. The problems will be fixed by guaranteeing that new groups will not be added for a tablet being migrated away, and that empty groups are properly closed when handling split completion. A reproducer was added. Fixes #21867. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#21920	2024-12-16 13:19:26 +02:00
Benny Halevy	10c4cf930c	table: make update_effective_replication_map sync again Commit `f2ff701489` introduced a yield in update_effective_replication_map that might cause the storage_group manager to be inconsistent with the new effective_replication_map (e.g. if yielding right before calling `handle_tablet_split_completion`. Also, yielding inside storage_service::replicate_to_all_cores update loop means that base tables and their views aren't updated atomically, that caused scylladb/scylladb#17786 This change essentially reverts `f2ff701489` and makes handle_tablet_split_completion synchronous too. The stopped compaction groups future is kept as a memebr and storage_group_manager::stop() consumes this future during table::stop(). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-12-15 11:45:08 +02:00
Raphael S. Carvalho	70b3963b8d	replica: Implement merging of compaction groups on merge completion When handling merge completion, compaction groups that belonged to sibling tablets are placed into the same storage group, since those tablets become one after merge. In order to merge two groups, the source group needs its memtable to be flushed first, such that all the data can be moved into the destination. The handling happens in update_effective_replication_map() which cannot afford to wait for I/O, so the group merge will happen in background. There's a fiber that will wake up on merge completion and will iterate through the new set of storage groups (after merge), and will work on merging additional compaction groups into the main one. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-12-03 21:47:01 -03:00
Raphael S. Carvalho	907739f3d1	replica: Handle tablet merge completion Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-12-03 21:47:01 -03:00
Lakshmi Narayanan Sreethar	91148e7747	compaction: remove unused `update_sstable_lists_on_off_strategy_completion` Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-11-28 11:25:11 +05:30
Lakshmi Narayanan Sreethar	4f7b1c0bcd	compaction_group: rename `update_main_sstable_list_on_compaction_completion` Rename `update_main_sstable_list_on_compaction_completion` to `update_sstable_sets_on_compaction_completion` as the method updates both main and maintenance sstable sets now. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-11-28 11:25:11 +05:30
Lakshmi Narayanan Sreethar	c4db4abcae	replica/table: implement `get_token_range_after_split()` wrappers Expose the functionality of `tablet_map::get_token_range_after_split()` via the replica::table class. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-11-11 12:24:00 +05:30
Lakshmi Narayanan Sreethar	6a357b55e3	compaction_group: track maximum timestamp across all sstables This will be used in a following patch to decide if the compacting reader has to check the memtables before purging a tombstone. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-10-18 19:19:11 +05:30
Ferenc Szili	e472844a78	database::table: add tombstone_gc_enabled(locator::tablet_id) This change adds the flag tombstone_gc_enabled to compaction_group. The value of this flag will be set in tablet_storage_group_manager::update_effective_replication_map().	2024-10-02 16:24:45 +02:00
Raphael S. Carvalho	cf58674029	replica: Fix schema change during migration cleanup During migration cleanup, there's a small window in which the storage group was stopped but not yet removed from the list. So concurrent operations traversing the list could work with stopped groups. During a test which emitted schema changes during migrations, a failure happened when updating the compaction strategy of a table, but since the group was stopped, the compaction manager was unable to find the state for that group. In order to fix it, we'll skip stopped groups when traversing the list since they're unused at this stage of migration and going away soon. Fixes #20699. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#20798	2024-09-30 17:30:38 +03:00
Botond Dénes	ea29fe579b	Merge 'replica: ignore cleanup of deallocated storage group' from Aleksandra Martyniuk Cleanup of a deallocated tablet throws an exception. Since failed cleanup is retried, we end up in an infinite loop. Ignore cleanup of deallocated storage groups. Fixes: #19752. Needs to be backported to all branches with tablets (6.0 and later) Closes scylladb/scylladb#20584 * github.com:scylladb/scylladb: test: check if cleanup of deallocated sg is ignored replica: ignore cleanup of deallocated storage group	2024-09-16 09:22:56 +03:00
Aleksandra Martyniuk	20d6cf55f2	replica: ignore cleanup of deallocated storage group Currently, attempt to cleanup deallocated storage group throws an exception. Failed tablet cleanup is retried, stucking in an endless loop. Ignore cleanup of deallocated storage group.	2024-09-13 13:00:53 +02:00
Benny Halevy	6f202cf48b	compaction_group, storage_group, table_state: add extended timestamp stats getters To return the minimum live timestamp and live row-marker timestamp across a compaction_group, storage_group, or table_state. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-09-10 19:05:57 +03:00
Aleksandra Martyniuk	5005e19de7	compaction: fix compaction group stop reason compaction_manager::remove passes "table removal" as a reason of stopping ongoing compactions, but currently remove method is also called when a tablet is migrated or split. Pass the actual reason of compaction stop, so that logs aren't misleading.	2024-08-21 12:42:09 +02:00
Raphael S. Carvalho	74612ad358	tablets: Fix race between repair and split Consider the following: T 0 split prepare starts 1 repair starts 2 split prepare finishes 3 repair adds unsplit sstables 4 repair ends 5 split executes If repair produces sstable after split prepare phase, the replica will not split that sstable later, as prepare phase is considered completed already. That causes split execution to fail as replicas weren't really prepared. This also can be triggered with load-and-stream which shares the same write (consumer) path. The approach to fix this is the same employed to prevent a race between split and migration. If migration happens during prepare phase, it can happen source misses the split request, but the tablet will still be split on the destination (if needed). Similarly, the repair writer becomes responsible for splitting the data if underlying table is in split mode. That's implemented in replica::table for correctness, so if node crashes, the new sstable missing split is still split before added to the set. Fixes #19378. Fixes #19416. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-08-12 17:28:51 -03:00
Raphael S. Carvalho	518677d7f9	replica: Fix comment about compaction group there's not a 1:1 relationship between compaction group count and tablet count. a tablet replica has a storage group instance, which may map to multiple compaction groups during split mode. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-07-12 16:24:51 -03:00
Raphael S. Carvalho	f139aa1df6	replica: remove unused compaction_group_vector Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-07-12 16:16:47 -03:00
Raphael S. Carvalho	c539b7c861	replica: remove rwlock for protecting iteration over storage group map rwlock was added to protect iterations against concurrent updates to the map. the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup). the rwlock is very problematic because it can result in topology changes blocked, as updating token metadata takes the exclusive lock, which is serialized with table wide ops like split / major / explicit flush (and those can take a long time). to get rid of the lock, we can copy the storage group map and guard individual groups with a gate (not a problem since map is expected to have a maximum of ~100 elements). so cleanup can close that gate (carefully closed after stopping individual groups such that migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered by nodetool flush) can skip a group that was closed, as such a group is being migrated out. Check documentation added to compaction_group.hh to understand how concurrent iterations and updates to the map work without the rwlock. Yielding variants that iterate over groups are no longer returning group id since id stability can no longer be guaranteed without serializing split finalization and iteration. Fixes #18821. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-07-09 16:59:24 -03:00
Raphael S. Carvalho	ad5c5bca5f	replica: get rid of fragile compaction group intrusive list It was added to make integration of storage groups easier, but it's complicated since it's another source of truth and we could have problems if it becomes inconsistent with the group map. Fixes #18506. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-07-09 16:53:35 -03:00
Kefu Chai	e87b64b7bb	compaction: not include unused headers these unused includes were identified by clangd. see https://clangd.llvm.org/guides/include-cleaner#unused-include-warning for more details on the "Unused include" warning. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-07-02 14:06:42 +08:00
Avi Kivity	9ecf4ada49	compaction: compaction_group: define destructor out-of-line Define compaction_group::~compaction_group() out-of-line to prevent problems instantiating compaction_group::_table_state, which is an std::unique_ptr. In C++23, std::unique_ptr is constexpr, which means its methods (in this case the destructor) require seeing the definition of the class at the point of instantiation.	2024-06-27 17:54:12 +03:00
Raphael S. Carvalho	6214dda506	replica: return storage_group by reference on storage_group_for*() those functions cannot return nullptr, will throw when group is not found, so better return ref instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-06-12 11:53:06 -03:00
Raphael S. Carvalho	9c1d3bcc02	replica: devirtualize storage_group_of() can be made private to tablet_storage_group_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-06-12 11:29:49 -03:00
Raphael S. Carvalho	7b41630299	replica: Refresh mutation source when allocating tablet replicas Consider the following: 1) table A has N tablets and views 2) migration starts for a tablet of A from node 1 to 2. 3) migration is at write_both_read_old stage 4) coordinator will push writes to both nodes (pending and leaving) 5) A has view, so writes to it will also result in reads (table::push_view_replica_updates()) 6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in) 7) so read on step 5 is not being able to find sstable set for tablet migrating in Causes the following error: "tablets - SSTable set wasn't found for tablet 21 of table mview.users" which means loss of write on pending replica. The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot. It's not a problem to refresh the cache snapshot as long as the logical state of the data hasn't changed, which is true when allocating new tablet replicas. That's also done in the context of compactions for example. Fixes #19052. Fixes #19033. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#19099	2024-06-11 06:59:04 +03:00
Raphael S. Carvalho	abcc68dbe7	storage_service: Fix race between tablet split and stats retrieval If tablet split is finalized while retrieving stats, the saved erm, used by all shards, will be invalidated. It can either cause incorrect behavior or crash if id is not available. It's worked by feeding local tablet map into the "coordinator" collecting stats from all shards. We will also no longer have a snapshot of erm shared between shards to help intra-node migration. This is simplified by serializing token metadata changes and the retrieval of the stats (latter should complete pretty fast, so it shouldn't block the former for any significant time). Fixes #18085. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-22 09:25:29 -03:00
Raphael S. Carvalho	715ae689c0	Implement fast streaming for intra-node migration With intra-node migration, all the movement is local, so we can make streaming faster by just cloning the sstable set of leaving replica and loading it into the pending one. This cloning is underlying storage specific, but s3 doesn't support snapshot() yet (th sstables::storage procedure which clone is built upon). It's only supported by file system, with help of hard links. A new generation is picked for new cloned sstable, and it will live in the same directory as the original. A challenge I bumped into was to understand why table refused to load the sstable at pending replica, as it considered them foreign. Later I realized that sharder (for reads) at this stage of migration will point only to leaving replica. It didn't fail with mutation based streaming, because the sstable writer considers the shard -- that the sstable was written into -- as its owner, regardless of what sharder says. That was fixed by mimicking this behavior during loading at pending. test: ./test.py --mode=dev intranode --repeat=100 passes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-16 00:28:47 +02:00
Botond Dénes	afa870a387	Merge 'Some sstable set related improvements' from Raphael "Raph" Carvalho Closes scylladb/scylladb#18616 * github.com:scylladb/scylladb: replica: Make it explicit table's sstable set is immutable replica: avoid reallocations in tablet_sstable_set replica: Avoid compound set if only one sstable set is filled	2024-05-13 14:17:24 +03:00
Raphael S. Carvalho	7faba69f28	replica: Make it explicit table's sstable set is immutable Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-10 11:58:08 -03:00
Raphael S. Carvalho	35a0d47408	replica: Avoid compound set if only one sstable set is filled Most of the time only main set is filled, so we can avoid one layer of indirection (= compound set) when maintenance set is empty. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-10 10:44:34 -03:00
Aleksandra Martyniuk	b4371a0ea0	replica: allocate storage groups dynamically Currently empty storage_groups are allocated for tablets that are not on this shard. Allocate storage groups dynamically, i.e.: - on table creation allocate only storage groups that are on this shard; - allocate a storage group for tablet that is moved to this shard; - deallocate storage group for tablet that is cleaned up. Stop compaction group before it's deallocated. Add a flag to table::cleanup_tablet deciding whether to deallocate sgs and use it in commitlog tests.	2024-05-10 15:08:21 +02:00
Aleksandra Martyniuk	c283746b32	replica: add rwlock to storage_group_manager Add rwlock which prevents storage groups from being added/deleted while some other layers itereates over them (or their compaction groups). Add methods to iterate over storage groups with the lock held.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	8505389963	replica: drop single_compaction_group_if_available Drop single_compaction_group_if_available as it's unused.	2024-05-10 14:56:38 +02:00
Raphael S. Carvalho	9f93dd9fa3	replica: Use flat_hash_map for tablet storage The reason that we want to switch to flat_hash_map is that only a small subset of tablets will be allocated on any given shard, therefore it's wasteful to use a sparse array, and iterations are slow. Also, the map gives greater development flexibility as one doesn't have to worry about empty entries. perf result: -- reads scylla_with_chunked_vector-read-no-tablets.txt median 73223.28 tps ( 62.3 allocs/op, 13.3 tasks/op, 41932 insns/op, 0 errors) median 74952.87 tps ( 62.3 allocs/op, 13.3 tasks/op, 41969 insns/op, 0 errors) median 73016.37 tps ( 62.3 allocs/op, 13.3 tasks/op, 41934 insns/op, 0 errors) median 74078.14 tps ( 62.3 allocs/op, 13.3 tasks/op, 41938 insns/op, 0 errors) median 75323.07 tps ( 62.3 allocs/op, 13.3 tasks/op, 41944 insns/op, 0 errors) scylla_with_hash_map-read-no-tablets.txt median 74963.30 tps ( 62.3 allocs/op, 13.3 tasks/op, 41926 insns/op, 0 errors) median 74032.09 tps ( 62.3 allocs/op, 13.3 tasks/op, 41918 insns/op, 0 errors) median 74850.09 tps ( 62.3 allocs/op, 13.3 tasks/op, 41937 insns/op, 0 errors) median 74239.37 tps ( 62.3 allocs/op, 13.3 tasks/op, 41921 insns/op, 0 errors) median 74798.14 tps ( 62.3 allocs/op, 13.3 tasks/op, 41925 insns/op, 0 errors) scylla_with_chunked_vector-read-tablets-1.txt median 74234.27 tps ( 62.1 allocs/op, 13.3 tasks/op, 41903 insns/op, 0 errors) median 75775.98 tps ( 62.1 allocs/op, 13.3 tasks/op, 41910 insns/op, 0 errors) median 76481.56 tps ( 62.1 allocs/op, 13.2 tasks/op, 41874 insns/op, 0 errors) median 74056.67 tps ( 62.1 allocs/op, 13.3 tasks/op, 41894 insns/op, 0 errors) median 75287.68 tps ( 62.1 allocs/op, 13.3 tasks/op, 41894 insns/op, 0 errors) scylla_with_hash_map-read-tablets-1.txt median 75613.63 tps ( 62.1 allocs/op, 13.2 tasks/op, 41990 insns/op, 0 errors) median 74819.51 tps ( 62.1 allocs/op, 13.2 tasks/op, 41973 insns/op, 0 errors) median 75648.41 tps ( 62.1 allocs/op, 13.3 tasks/op, 42025 insns/op, 0 errors) median 74170.89 tps ( 62.1 allocs/op, 13.2 tasks/op, 42002 insns/op, 0 errors) median 75447.72 tps ( 62.1 allocs/op, 13.3 tasks/op, 41952 insns/op, 0 errors) scylla_with_chunked_vector-read-tablets-128.txt median 73788.57 tps ( 62.1 allocs/op, 13.2 tasks/op, 41956 insns/op, 0 errors) median 76563.63 tps ( 62.1 allocs/op, 13.3 tasks/op, 42006 insns/op, 0 errors) median 75536.12 tps ( 62.1 allocs/op, 13.2 tasks/op, 42005 insns/op, 0 errors) median 74679.17 tps ( 62.1 allocs/op, 13.3 tasks/op, 41958 insns/op, 0 errors) median 75380.95 tps ( 62.1 allocs/op, 13.2 tasks/op, 41946 insns/op, 0 errors) scylla_with_hash_map-read-tablets-128.txt median 75459.99 tps ( 62.1 allocs/op, 13.3 tasks/op, 42055 insns/op, 0 errors) median 74280.11 tps ( 62.1 allocs/op, 13.3 tasks/op, 42085 insns/op, 0 errors) median 74502.61 tps ( 62.1 allocs/op, 13.3 tasks/op, 42063 insns/op, 0 errors) median 74692.27 tps ( 62.1 allocs/op, 13.3 tasks/op, 41994 insns/op, 0 errors) median 75402.64 tps ( 62.1 allocs/op, 13.3 tasks/op, 42015 insns/op, 0 errors) -- writes scylla_with_chunked_vector-write-no-tablets.txt median 68635.17 tps ( 58.4 allocs/op, 13.3 tasks/op, 52709 insns/op, 0 errors) median 68716.36 tps ( 58.4 allocs/op, 13.3 tasks/op, 52691 insns/op, 0 errors) median 68512.76 tps ( 58.4 allocs/op, 13.3 tasks/op, 52721 insns/op, 0 errors) median 68606.14 tps ( 58.4 allocs/op, 13.3 tasks/op, 52696 insns/op, 0 errors) median 68619.25 tps ( 58.4 allocs/op, 13.3 tasks/op, 52697 insns/op, 0 errors) scylla_with_hash_map-write-no-tablets.txt median 67678.10 tps ( 58.4 allocs/op, 13.3 tasks/op, 52723 insns/op, 0 errors) median 67966.06 tps ( 58.4 allocs/op, 13.3 tasks/op, 52736 insns/op, 0 errors) median 67881.47 tps ( 58.4 allocs/op, 13.3 tasks/op, 52743 insns/op, 0 errors) median 67856.81 tps ( 58.4 allocs/op, 13.3 tasks/op, 52730 insns/op, 0 errors) median 67812.58 tps ( 58.4 allocs/op, 13.3 tasks/op, 52740 insns/op, 0 errors) scylla_with_chunked_vector-write-tablets-1.txt median 67741.83 tps ( 58.4 allocs/op, 13.3 tasks/op, 53425 insns/op, 0 errors) median 68014.20 tps ( 58.4 allocs/op, 13.3 tasks/op, 53455 insns/op, 0 errors) median 68228.48 tps ( 58.4 allocs/op, 13.3 tasks/op, 53447 insns/op, 0 errors) median 67950.96 tps ( 58.4 allocs/op, 13.3 tasks/op, 53443 insns/op, 0 errors) median 67832.69 tps ( 58.4 allocs/op, 13.3 tasks/op, 53462 insns/op, 0 errors) scylla_with_hash_map-write-tablets-1.txt median 66873.70 tps ( 58.4 allocs/op, 13.3 tasks/op, 53548 insns/op, 0 errors) median 67568.23 tps ( 58.4 allocs/op, 13.3 tasks/op, 53547 insns/op, 0 errors) median 67653.70 tps ( 58.4 allocs/op, 13.3 tasks/op, 53525 insns/op, 0 errors) median 67389.21 tps ( 58.4 allocs/op, 13.3 tasks/op, 53536 insns/op, 0 errors) median 67437.91 tps ( 58.4 allocs/op, 13.3 tasks/op, 53537 insns/op, 0 errors) scylla_with_chunked_vector-write-tablets-128.txt median 67115.41 tps ( 58.3 allocs/op, 13.3 tasks/op, 53341 insns/op, 0 errors) median 66836.07 tps ( 58.3 allocs/op, 13.3 tasks/op, 53342 insns/op, 0 errors) median 67214.07 tps ( 58.3 allocs/op, 13.3 tasks/op, 53303 insns/op, 0 errors) median 67198.25 tps ( 58.3 allocs/op, 13.3 tasks/op, 53347 insns/op, 0 errors) median 67368.78 tps ( 58.3 allocs/op, 13.3 tasks/op, 53374 insns/op, 0 errors) scylla_with_hash_map-write-tablets-128.txt median 66273.50 tps ( 58.3 allocs/op, 13.3 tasks/op, 53400 insns/op, 0 errors) median 66564.89 tps ( 58.3 allocs/op, 13.3 tasks/op, 53432 insns/op, 0 errors) median 66568.52 tps ( 58.3 allocs/op, 13.3 tasks/op, 53408 insns/op, 0 errors) median 66368.00 tps ( 58.3 allocs/op, 13.3 tasks/op, 53441 insns/op, 0 errors) median 66293.55 tps ( 58.3 allocs/op, 13.3 tasks/op, 53408 insns/op, 0 errors) Fixes #18010. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18093	2024-04-04 16:25:48 +03:00
Raphael S. Carvalho	29f9f7594f	replica: Kill table::storage_group_id_for_token() storage_group_id_for_token() was only needed from within tablet_storage_group_manager, so we can kill table::storage_group_id_for_token(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18134	2024-04-02 09:23:23 +03:00
Raphael S. Carvalho	2c9b13d2d1	compaction: Check for key presence in memtable when calculating max purgeable timestamp It was observed that some use cases might append old data constantly to memtable, blocking GC of expired tombstones. That's because timestamp of memtable is unconditionally used for calculating max purgeable, even when the memtable doesn't contain the key of the tombstone we're trying to GC. The idea is to treat memtable as we treat L0 sstables, i.e. it will only prevent GC if it contains data that is possibly shadowed by the expired tombstone (after checking for key presence and timestamp). Memtable will usually have a small subset of keys in largest tier, so after this change, a large fraction of keys containing expired tombstones can be GCed when memtable contains old data. Fixes #17599. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17835	2024-03-18 13:37:44 +02:00
Benny Halevy	0745865914	storage_group_manager: add make_sstable_set Move the responsibility for preparing the table_set covering all sstables in the table to the storage_group_manager so it can specialize the sstable_set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-06 10:35:36 +02:00
Benny Halevy	7f203f0551	table: move compaction_group_list and storage_group_vector to storage_group_manager So the storage_group_manager can be used later by table_sstable_set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-06 10:35:33 +02:00
Benny Halevy	bfe13daed4	compaction_group, table: make_compound_sstable_set: declare as const It does not modify the compaction_group/table respectively. This is required by the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-06 10:15:34 +02:00
Benny Halevy	d7b1851449	tablet_storage_group_manager: precalculate my_host_id and _tablet_map The node host_id never changes, so get it once, when the object is constructed. A pointer to the tablet_map is taken when constructed using the effective_replication_map and it is updated whenever the e_r_m changes, using a newly added `update_effective_replication_map` method. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-06 10:15:34 +02:00

1 2

88 Commits