scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 15:52:13 +00:00

Author	SHA1	Message	Date
Wojciech Mitros	5920647617	mv: remove queue length limit from the view update read concurrency semaphore Each view update is correlated to a write that generates it (aside from view building which is throttled separately). These writes are limited by a throttling mechanism, which effectively works by performing the writes with CL=ALL if ongoing writes exceed some memory usage limit When writes generate view updates, they usually also need to perform a read. This read goes through a read concurrency semaphore where it can get delayed or killed. The semaphore allows up to 100 concurrent reads and puts all remaining reads in a queue. If the number of queued reads exceeds a specific limit, the view update will fail on the replica, causing inconsistencies. This limit is not necessary. When a read gets queued on the semaphore, the write that's causing the view update is paused, so the write takes part in the regular write throttling. If too many writes get stuck on view update reads, they will get throttled, so their number is limited and the number of queued reads is also limited to the same amount. In this patch we remove the specified queue length limit for the view update read concurrency semaphore. Instead of this limit, the queue will be now limited indirectly, by the base write throttling mechanism. This may allow the queue grow longer than with the previous limit, but it shouldn't ever cause issues - we only perform up to 100 actual reads at once, and the remaining ones that get queued use a tiny amount of memory, less than the writes that generated them and which are getting limited directly. Fixes https://github.com/scylladb/scylladb/issues/23319 Closes scylladb/scylladb#24112	2025-05-14 18:29:30 +03:00
Pavel Emelyanov	ed3ce0f6af	distributed_loader: Don't create owned ranges if skip-cleanup is true In order to make reshard compaction task run cleanup, the owner-ranges pointer is passed to it. If it's nullptr, the cleanup is not performed. So to do the skip-cleanup, the easiest (but not the most apparent) way is not to initialize the pointer and keep it nullptr. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-13 16:52:15 +03:00
Pavel Emelyanov	4ab049ac8d	code: Push bool skip_cleanup flag around Just put the boolean into the callstack between API and distributed loader to reduce the churn in the next patches. No functional changes, flag is false and unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-05-13 16:51:21 +03:00
Avi Kivity	5e764d1de2	Merge 'Drop v2 and flat from reader and related names' from Botond Dénes Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names. Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant). The changes in this PR are entirely mechanical, mostly just search-and-replace. Code cleanup, no backport required. Closes scylladb/scylladb#24087 * github.com:scylladb/scylladb: test/boost/mutation_reader_another_test: drop v2 from reader and related names test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/ test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/ test/boost/mutation_test: s/consumer_v2/consumer/ test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/ readers/mutation_readers: s/generating_reader_v2/generating_reader/ readers/mutation_readers: s/delegating_reader_v2/delegating_reader/ readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/ readers/mutation_source: s/make_reader_v2/make_mutation_reader/ readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/ readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/ mutation/mutation_compactor: drop v2 from compactor and related names replica/table: s/make_reader_v2/make_mutation_reader/ mutation_writer: s/bucket_writer_v2/bucket_writer/ readers/queue: drop v2 from reader and related names readers/multishard: drop v2 from reader and related names readers/evictable: drop v2 from reader and related names readers/multi_range: remove flat from name	2025-05-11 22:22:35 +03:00
Botond Dénes	674d41e3e6	readers/mutation_source: s/make_reader_v2/make_mutation_reader/	2025-05-09 07:53:29 -04:00
Botond Dénes	7af0690762	mutation/mutation_compactor: drop v2 from compactor and related names	2025-05-09 07:53:29 -04:00
Botond Dénes	b5170e27d0	replica/table: s/make_reader_v2/make_mutation_reader/	2025-05-09 07:53:29 -04:00
Botond Dénes	ca7f557e86	readers/multishard: drop v2 from reader and related names	2025-05-09 07:53:29 -04:00
Botond Dénes	7ba3c3fec3	readers/multi_range: remove flat from name	2025-05-09 07:53:25 -04:00
Raphael S. Carvalho	28056344ba	replica: Fix take_storage_snapshot() running concurrently to merge completion Some background: When merge happens, a background fiber wakes up to merge compaction groups of sibling tablets into main one. It cannot happen when rebuilding the storage group list, since token metadata update is not preemptable. So a storage group, post merge, has the main compaction group and two other groups to be merged into the main. When the merge happens, those two groups are empty and will be freed. Consider this scenario: 1) merge happens, from 2 to 1 tablet 2) produces a single storage group, containing main and two other compaction groups to be merged into main. 3) take_storage_snapshot(), triggered by migration post merge, gets a list of pointer to all compaction groups. 4) t__s__s() iterates first on main group, yields. 5) background fiber wakes up, moves the data into main and frees the two groups 6) t__s__s() advances to other groups that are now freed, since step 5. 7) segmentation fault In addition to memory corruption, there's also a potential for data to escape the iteration in take_storage_snapshot(), since data can be moved across compaction groups in background, all belonging to the same storage group. That could result in data loss. Readers should all operate on storage group level since it can provide a view on all the data owned by a tablet replica. The movement of sstable from group A to B is atomic, but iteration first on A, then later on B, might miss data that was moved from B to A, before the iteration reached B. By switching to storage group in the interface that retrieves groups by token range, we guarantee that all data of a given replica can be found regardless of which compaction group they sit on. Fixes #23162. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24058	2025-05-09 14:07:06 +03:00
Botond Dénes	e5d944f986	Merge 'replica: Fix use-after-free with concurrent schema change and sstable set update' from Raphael Raph Carvalho When schema is changed, sstable set is updated according to the compaction strategy of the new schema (no changes to set are actually made, just the underlying set type is updated), but the problem is that it happens without a lock, causing a use-after-free when running concurrently to another set update. Example: 1) A: sstable set is being updated on compaction completion 2) B: schema change updates the set (it's non deferring, so it happens in one go) and frees the set used by A. 3) when A resumes, system will likely crash since the set is freed already. ASAN screams about it: SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ... Fix is about deferring update of the set on schema change to compaction, which is triggered after new schema is set. Only strategy state and backlog tracker are updated immediately, which is fine since strategy doesn't depend on any particular implementation of sstable set. Fixes #22040. Closes scylladb/scylladb#23680 * github.com:scylladb/scylladb: replica: Fix use-after-free with concurrent schema change and sstable set update sstables: Implement sstable_set_impl::all_sstable_runs()	2025-05-08 06:56:16 +03:00
Botond Dénes	0a9ca52cfd	replica/database: memtable_list: save ref to memtable_table_shared_data This is passed by reference to the constructor, but a copy is saved into the _table_shared_data member. A reference to this member is passed down to all memtable readers. Because of the copy, the memtable readers save a reference to the memtable_list's member, which goes away together with the memtable_list when the storage_group is destroyed. This causes use-after-free when a storage group is destroyed while a memtable read is still ongoing. The memtable reader keeps the memtable alive, but its reference to the memtable_table_shared_data becomes stale. Fix by saving a reference in the memtable_list too, so memtable readers receive a reference pointing to the original replica::table member, which is stable accross tablet migrations and merges. The copy was introduced by `2a76065e3d`. There was a copy even before this commit, but in the previous vnode-only world this was fine -- there was one memtable_list per table and it was around until the table itself was. In the tablet world, this is no longer given, but the above commit didn't account for this. A test is included, which reproduces the use-after-free on memtable migration. The test is somewhat artificial in that the use-after-free would be prevented by holding on to an ERM, but this is done intentionaly to keep the test simple. Migration -- unlike merge where this use-after-free was originally observed -- is easy to trigger from unit tests. Fixes: #23762 Closes scylladb/scylladb#23984	2025-05-06 22:13:17 +03:00
Raphael S. Carvalho	434c2c4649	replica: Fix use-after-free with concurrent schema change and sstable set update When schema is changed, sstable set is updated according to the compaction strategy of the new schema (no changes to set are actually made, just the underlying set type is updated), but the problem is that it happens without a lock, causing a use-after-free when running concurrently to another set update. Example: 1) A: sstable set is being updated on compaction completion 2) B: schema change updates the set (it's non deferring, so it happens in one go) and frees the set used by A. 3) when A resumes, system will likely crash since the set is freed already. ASAN screams about it: SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ... Fix is about deferring update of the set on schema change to compaction, which is triggered after new schema is set. Only strategy state and backlog tracker are updated immediately, which is fine since strategy doesn't depend on any particular implementation of sstable set, since patch "sstables: Implement sstable_set_impl::all_sstable_runs()". Fixes #22040. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-05-06 10:06:55 -03:00
Raphael S. Carvalho	c77f710a0c	sstables: Fix quadratic space complexity in partitioned_sstable_set Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	21d1e78457	compaction: Wire table_state into make_sstable_set() This will be useful for feeding token range owned by compaction group into sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	59dad2121f	compaction: Introduce token_range() to table_state This provides a way for compaction layer to know compaction group's token range. It will be important for sstable set impl to know the token range of underlying group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Nadav Har'El	262530f27c	Merge 'mv: make base_info in view schemas immutable' from Wojciech Mitros Currently, the base_info may or may not be set in view schemas. Even when it's set, it may be modified. This necessitates extra checks when handling view schemas, as we'll as potentially causing errors when we forget to set it at some point. Instead, we want to make the base info an immutable member of view schemas (inside view_info). To achieve this, in this series we remove all base_info members that can change due to a base schema update, and we calculate the remaining values during view update generation, using the most up-to-date base schema version. To calculate the values that depend on the base schema version, we need to iterate over the view primary key and find the corresponding columns, which adds extra overhead for each batch of view updates. However, this overhead should be relatively small, as when creating a view update, we need to prepare each of its columns anyway. And if we need to read the old value of the base row, the relative overhead is even lower. After this change, the base info in view schemas stays the same for all base schema updates, so we'll no longer get issues with base_info being incompatible with a base schema version. Additionally, it's a step towards making the schema objects immutable, which we sometimes incorrectly assumed in the past (they're still not completely immutable yet, as some other fields in view_info other than base_info are initialized lazily and may depend on the base schema version). Fixes https://github.com/scylladb/scylladb/issues/9059 Fixes https://github.com/scylladb/scylladb/issues/21292 Fixes https://github.com/scylladb/scylladb/issues/22194 Fixes https://github.com/scylladb/scylladb/issues/22410 Closes scylladb/scylladb#23337 * github.com:scylladb/scylladb: test: remove flakiness from test_schema_is_recovered_after_dying mv: add a test for dropping an index while it's building base_info: remove the lw_shared_ptr variant view_info: don't re-set base_info after construction base_info: remove base_info snapshot semantics base_info: remove base schema from the base_info schema_registry: store base info instead of base schema for view entries base_info: make members non-const view_info: move the base info to a separate header view_info: move computation of view pk columns not in base pk to view_updates view_info: move base-dependent variables into base_info view_info: set base info on construction	2025-04-27 19:12:12 +03:00
Wojciech Mitros	d7bd86591e	view_info: don't re-set base_info after construction In the previous commits we made sure that the base info is not dependent on the base schema version, and the info dependent on the base schema version is calculated when it's needed. In this patch we remove the unnecessary re-setting of the base_info. The set_base_info method isn't removed completely, because it also has a secondary function - zeroing the view_info fields other than base_info. Because of this, in this patch we rename it accordingly and limit its use to the updates caused by a base schema change.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	ea462efa3d	base_info: remove base_info snapshot semantics The base info in view schemas no longer changes on base schema updates, so saving the base info with a view schema from a specific point in time doesn't provide any additional benefits. In this patch we remove the code using the base_and_view snapshots as it's no longer useful.	2025-04-24 01:08:40 +02:00
Wojciech Mitros	05fce91945	schema_registry: store base info instead of base schema for view entries In the following patch we plan to remove the base schema from the base_info to make the base_info immutable. To do that, we first prepare the schema registry for the change; we need to be able to create view schemas from frozen schemas there and frozen schemas have no information about the base table. Unless we do this change, after base schemas are removed from the base info, we'll no longer be able to load a view schema to the schema registry without looking up the base schema in the database. This change also required some updates to schema building: * we add a method for unfreezing a view schema with base info instead of a base schema * we make it possible to use schema_builder with a base info instead of a base schema * we add a method for creating a view schema from mutations with a base info instead of a base schema * we add a view_info constructor withat base info instead of a base schema * we update the naming in schema_registry to reflect the usage of base info instead of base schema	2025-04-24 01:08:39 +02:00
Aleksandra Martyniuk	c1618c7de5	test: test table drop during flush	2025-04-23 14:29:28 +02:00
Aleksandra Martyniuk	91b57e79f3	replica: skip flush of dropped table	2025-04-23 14:29:28 +02:00
Nadav Har'El	6db666a1c1	replica: fix 10-second pause during shutdown As noticed in issue #23687, if we shut down Scylla while a paged read is in progress - or even a paged read that the client had no intention of ever resume it - the shutdown pauses for 10 seconds. The problem was the stop() order - we must stop the "querier cache" before we can close sstables - the "querier cache" is what holds paged readers alive waiting for clients to resume those reads, and while a reader is alive it holds on to sstables so they can't be closed. The querier cache's querier_cache::default_entry_ttl is set to 10 seconds, which is why the shutdown was un-paused after 10 seconds. This fix in this patch is obvious: We need to stop the querier cache (and have it release all the readers it was holding) before we close the sstables. Fixes #23687 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23770	2025-04-16 20:35:44 +03:00
Avi Kivity	0206da5232	Merge 'readers: strip "flat" and "v2" from names' from Botond Dénes Continue the effort of normalizing reader names, stripping legacy qualifying terms like "flat" and "v2". Flat and v2 readers are the default now, we only need to add qualifying terms to readers which are different than the normal. One such reader remains: `make_generating_reader_v1()`. This PR contains mostly mechanical changes, done with a sed script. Commits which only contain such mechanical renames are marked as such in the commitlog. Code cleanup, no backport needed. Closes scylladb/scylladb#23767 * github.com:scylladb/scylladb: readers: mv reversing_v2.hh reversing.hh readers: mv generating_v2.hh generating.hh tree: s/make_generating_reader_v2/make_generating_reader/ readers: mv from_mutations_v2.hh from_mutations.hh tree: s/make_mutation_reader_from_mutations_v2/make_mutation_reader_from_mutations/s readers: mv from_fragments_v2.hh from_fragments.hh readers: mv forwardable_v2.hh forwardable.hh readers: mv empty_v2.hh empty.hh tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/ readers/empty_v2.hh: replace forward declarations with include of fwd header readers/mutation_reader_fwd.hh: forward declare reader_permit readers: mv delegating_v2.hh delegating.hh readers/delegating_v2.hh: move reader definition to _impl.hh file	2025-04-16 20:21:51 +03:00
Botond Dénes	6172ff501f	readers: mv reversing_v2.hh reversing.hh Completely mechanical change.	2025-04-16 04:46:08 -04:00
Botond Dénes	f1bd2553ed	readers: mv forwardable_v2.hh forwardable.hh Completely mechanical change.	2025-04-16 04:33:50 -04:00
Botond Dénes	a9d75c4f9d	readers: mv empty_v2.hh empty.hh Completely mechanical change.	2025-04-16 04:32:56 -04:00
Botond Dénes	05829f98f3	tree: s/make_empty_flat_reader_v2/make_empty_mutation_reader/ Completely mechanical change.	2025-04-16 04:32:56 -04:00
Botond Dénes	f5125ffa18	Merge 'Ensure raft group0 RPCs use the gossip scheduling group.' from Sergey Zolotukhin Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore. Fixes scylladb/scylladb#21637 Backport: 6.2 and 6.1 Closes scylladb/scylladb#22779 * github.com:scylladb/scylladb: Ensure raft group0 RPCs use the gossip scheduling group Move RAFT operations verbs to GOSSIP group.	2025-04-16 09:11:29 +03:00
Sergey Zolotukhin	e05c082002	Ensure raft group0 RPCs use the gossip scheduling group Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group. For Raft group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This commit adds a check to ensure that the raft group0 RPCs are executed with the `gossiper` scheduling group.	2025-04-14 17:10:46 +02:00
Benny Halevy	e1fe82ed33	utils: phased_barrier, pluggable: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:47:00 +03:00
Benny Halevy	7342a57cbb	replica: table: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Benny Halevy	52e1ce7f0d	replica: compaction_group, storage_group: use named gate Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-12 11:28:49 +03:00
Pavel Emelyanov	d9853efa7c	Merge '[Out-of-space prevention] db: backup: prioritize sstables that were deleted from the table' from Benny Halevy The motivation behind this change to free up disk space as early as possible. The reason is that snapshot locks the space of all SSTables in the snapshot, and deleting form the table, for example, by compaction, or tablet migration, won't free-up their capacity until they are uploaded to object storage and deleted from the snapshot. This series adds prioritization of deleted sstables in two cases: First, after the snapshot dir is processed, the list of SSTable generation is cross-referenced with the list of SSTables presently in the table and any generation that is not in the table is prioritized to be uploaded earlier. In addition, a subscription mechanism was added to sstables_manager and it is used in backup to prioritize SSTables that get deleted from the table directory during backup. This is particularly important when backup happens during high disk utilization (e.g. 90%). Without it, even if the cluster is scaled up and tablets are migrated away from the full nodes to new nodes, tablet cleanup might not free any space if all the tablet sstables are hardlinked to the snapshot taken for backup. * Enhancement, no backport needed Closes scylladb/scylladb#23241 * github.com:scylladb/scylladb: db: snapshot: backup_task: prioritize sstables deleted during upload sstables_manager: add subscriptions db: snapshot: backup_task: limit concurrency sstables: directory_semaphore: expose get_units db: snapshot: backup_task: add sharded sstables_manager database: expose get_sstables_manager(schema) db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table db: snapshot-ctl: pass table_id to backup_task db: snapshot-ctl: expose sharded db() getter db: snapshot: backup_task: do_backup: organize components by sstable generation db: snapshot: coroutinize backup_task db: snapshot: backup_task: refactor backup_file out of uploads_worker db: snapshot: backup_task: refactor uploads_worker out of do_backup db: snapshot: backup_task: process_snapshot_dir: initialize total progress utils/s3: upload_progress: init members to 0 db: snapshot: backup_task: do_backup: refactor process_snapshot_dir db: snapshot: backup_task: keep expection as member	2025-04-09 15:32:11 +03:00
Benny Halevy	b270d552fb	database: expose get_sstables_manager(schema) Return either the system or use sstables manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-04-09 08:54:07 +03:00
Tomasz Grabiec	06b49bdf69	Merge 'row_cache: don't garbage-collect tombstones which cover data in memtables' from Botond Dénes The row cache can garbage-collect tombstones in two places: 1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it; 2) During reads - reads now compact data including garbage collection; In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables. This PR includes fixes for (2), which were not handled at all currently. (1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included. Fixes: https://github.com/scylladb/scylladb/issues/23291 Fixes: https://github.com/scylladb/scylladb/issues/23252 The fix will need backport to all live release. Closes scylladb/scylladb#23255 * github.com:scylladb/scylladb: test/boost/row_cache_test: add memtable overlap check tests replica/table: add error injection to memtable post-flush phase utils/error_injection: add a way to set parameters from error injection points test/cluster: add test_data_resurrection_in_memtable.py test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts replica/mutation_dump: don't assume cells are live replica/database: do_apply() add error injection point replica: improve memtable overlap checks for the cache replica/memtable: add is_merging_to_cache() db/row_cache: add overlap-check for cache tombstone garbage collection mutation/mutation_compactor: copy key passed-in to consume_new_partition()	2025-04-08 17:26:58 +02:00
Raphael S. Carvalho	0f59deffaa	replica: Fix truncate and drop table after tablet migration happens When running those operations after a tablet replica is migrated away from a shard, an assert can fail resulting in a crash. Status quo (around the assert in truncate procedure): 1) Highest RP seen by table is saved in low_mark, and the current time in low_mark_at. 2) Then compaction is disabled in order to not mix data written before truncate, and data written later. 3) Then memtable is flushed in order for the data written before truncate to be available in sstables and then removed. 4) Now, current time is saved in truncated_at, which is supposedly the time of truncate to decide which sstables to remove. Note: truncated_at is likely above low_mark_at due to steps 2 and 3. The interesting part of the assert is: (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp) Note: RP in the assert above is the highest RP among all sstables generated before truncated_at. RP is retrieved by table::discard_sstables(). If truncated_at > low_mark_at, maybe newer data was written during steps 2 and 3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with RP > low_mark. So assert's 2nd condition is there to defend against the scenario above. truncated_at and low_mark_at uses millisecond granularity, so even if truncated_at == low_mark_at, data could have been written in steps 2 and 3 (during same MS window), failing the assert. This is fragile. Reproducer: To reproduce the problem, truncated_at must be > low_mark_at, which can easily happen with both drop table and truncate due to steps 2 and 3. If a shard has 2 or more tablets, the table's highest RP refer to just one tablet in that shard. If the tablet with the highest RP is migrated away, then the sstables in that shard will have lower RP than the recorded highest RP (it's a table wide state, which makes sense since CL is shared among tablets). So when either drop table or truncate runs, low_mark will be potentially bigger than highest RP retrieved from sstables. Proposed solution: The current assert is hacked to not fail if writes sneak in, during steps 2 and 3, but it's still fragile and seems not to serve its real purpose, since it's allowing for RP > low_mark. We should be able to say that low_mark >= RP, as a way of asserting we're not leaving data targeted by truncate behind (or that we're not removing the wrong data). But the problem is that we're saving low_mark in step 1, before preparation steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying all data written so far is targeted for removal. But as of today, low_mark refers to all data written up to step 1. So low_mark is now only one set before issuing flush, and also accounts for all potentially flushed data. Fixes #18059. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23560	2025-04-08 07:32:58 +03:00
Botond Dénes	6c1f6427b3	replica/table: add error injection to memtable post-flush phase After the memtable was flushed to disk, but before it is merged to cache. The injection point will only active for the table specified in the "table_name" injection parameter.	2025-04-08 00:11:36 -04:00
Botond Dénes	df09b3f970	replica/mutation_dump: don't assume cells are live Currently the dumper unconditionally extracts the value of atomic cells, assuming they are live. This doesn't always hold of course and attempting to get the value of a dead cell will lead to marshalling errors. Fix by checking is_live() before attempting to get the cell value. Fix for both regular and collection cells.	2025-04-08 00:11:36 -04:00
Botond Dénes	cb76cafb60	replica/database: do_apply() add error injection point So writes (to user tables) can be failed on a replica, via error injection. Should simplify tests which want to create differences in what writes different replicas receive.	2025-04-08 00:11:35 -04:00
Botond Dénes	d126ea09ba	replica: improve memtable overlap checks for the cache The current memtable overlap check that is used by the cache -- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only checks the active memtable, so memtables which are either being flushed or are already flushed and also have active reads against them do not participate in the overlap check. This can result in temporary data resurrection, where a cache read can garbage-collect a tombstone which still covers data in a flushing or flushed memtable, which still have active read against it. To prevent this, extend the overlap check to also consider all of the memtable list. Furthermore, memtable_list::erase() now places the removed (flushed) memtable in an intrusive list. These entries are alive only as long as there are readers still keeping an `lw_shared_ptr<memtable>` alive. This list is now also consulted on overlap checks.	2025-04-08 00:11:35 -04:00
Botond Dénes	7e600a0747	replica/memtable: add is_merging_to_cache() And set it when the memtable is merged to cache.	2025-04-08 00:11:35 -04:00
Botond Dénes	6b5b563ef7	db/row_cache: add overlap-check for cache tombstone garbage collection The cache should not garbage-collect tombstone which cover data in the memtable. Add overlap checks (get_max_purgeable) to garbage collection to detect tombstones which cover data in the memtable and to prevent their garbage collection.	2025-04-08 00:11:35 -04:00
Lakshmi Narayanan Sreethar	750f4baf44	replica/table::do_apply : do not check for async gate's closure The `table::do_apply()` method verifies if the compaction group's async gate is open to determine if the compaction group is active. Closing this async gate prevents any new operations but waits for existing holders to exit, allowing their operations to complete. When holding a gate, holders will observe the gate as closed when it is being closed, but this is irrelevant as they are already inside the gate and are allowed to complete. All the callers of `table::do_apply()` already enter the gate before calling the method. So, the async gate check inside `table::do_apply()` will erroneously throw an exception when the compaction group is closing despite holding the gate. This commit removes the check to prevent this from happening. Fixes #23348 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#23579	2025-04-07 13:27:22 +03:00
Pavel Emelyanov	10376b5b85	db: Re-use database::snapshot_table_on_all_shards() There are two snapshot-on-all-shards methods on the database -- the one that snapshots a keyspace and the one that snapshots a vector of tables. The latter snapshots a single table with a neat helper, while the former has the helper open-coded. Re-using the helper in keyspace snapshot is worth it, but needs to patch the helper to work on uuid, rather than ks:cf pair of strings. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23532	2025-04-07 11:55:43 +02:00
Botond Dénes	a0d8102a1f	replica/memtable: s/make_flat_reader/make_mutation_reader/ Following the recent refactoring of removing "flat" and "v2" from reader names, replacing all the fully qualified names with simply "mutation_reader". Closes scylladb/scylladb#23346	2025-04-01 17:58:13 +03:00
Avi Kivity	69684e16d8	Merge 'sstables: add SSTable compression with shared dictionaries ' from Michał Chojnowski This PR extends Scylla's SSTable compression with the ability to use compression dictionaries shared across compression chunks. This involves several changes: - We refactor `compression_parameters` and friends (`compressor`, `sstables::local_compression`, `sstables::compression`) to prepare for making the construction of `compressor`s asynchronous, to enable sharing pieces of compressors (the dictionaries) across shards. - We introduce the notion of "hidden compression options" which are written to `CompressionInfo.db` and used to construct decompressors, like regular options, but don't appear in the schema. (We later stuff the SSTable's dictionary into `CompressionInfo.db` using a sequence of such options). - We add a cluster feature which guards the creation of dictionary-compressed SSTables. - We introduce a central "compressor factory" (one instance shared by all shards), which from this point onward is used to construct all `compressor` objects (one per SSTable) used to process the SSTables. When constructing a compressor for writing, it uses the "current"/"recommended" dictionary (which is passed to the factory from the actively-observed contents of the group0-managed `system.dicts`). When constructing a compressor for reading, it uses the dictionary written in the hidden compression options in CompressionInfo.db. And it keeps dictionaries deduplicated, so that each unique live dictionary blob has only one instance in memory, shared across shards. - We teach the relevant `lz4` and `zstd` compressor wrappers about the dictionaries. - We add a HTTP API call which samples pieces of the given table (i.e. the Data.db files) from across the cluster, trains a dictionary on it, and publishes it via `system.dicts` as the new current dictionary for that table. (And we add some RPC verbs to support that). - We add a HTTP API call which estimates the impact of various available compression configurations on the compression ratio. - We add an autotrainer fiber which periodically retrains dicts for dict-aware tables and publishes them if they seem to be a significant improvement. Known imperfections: - The factory currently keeps one dictionary instance on the entire node, but we probably want one copy per NUMA node. I didn't do that because exposing NUMA knowledge to Scylla seems to require some changes in Seastar first. New feature, no backporting involved. Closes scylladb/scylladb#23025 * github.com:scylladb/scylladb: docs: add user-facing documentation for SSTable compression with shared dicts docs/dev: add sstable-compression-dicts.md test: add test_sstable_compression_dictionaries_autotrain.py test: add test_sstable_compression_dictionaries_basic.py test/pylib/rest_client: add `keyspace_upgrade_sstables` helper main: run a sstable_dict_autotrainer api: add the estimate_compression_ratios API call dict_autotrainer: introduce sstable_dict_autotrainer db/system_keyspace: add query_dict_timestamp compress: add ZstdWithDictsCompressor and LZ4WithDictsCompressor main: clean up sstable compression dicts after table drops sstables/compress: discard hidden compression options after the decompressor is created compress: change compressor_ptr from shared_ptr to unique_ptr api: add the retrain_dict API call storage_service: add some dict-related routines main: in compression_dict_updated_callback, recognize and use SSTable compression dicts storage_service: add do_sample_sstables() messaging_service: add SAMPLE_SSTABLES and ESTIMATE_SSTABLE_VOLUME verbs db/system_keyspace: let `system.dicts` helpers be used for dicts other than the RPC compression dict raft/group0_state_machine: on `system.dicts` mutations, pass the affected partitition keys to the callback database: add sample_data_files() database: add take_sstable_set_snapshot() compress: teach `lz4_processor` about dictionaries compress: teach `zstd_processor` about dictionaries sstables: delegate compressor creation to the compressor factory sstables: plug an `sstable_compressor_factory` into `sstables_manager` sstables: introduce sstable_compressor_factory utils/hashers: add get_sha256() gms/feature_service: add the SSTABLE_COMPRESSION_DICTS cluster feature compress: add hidden dictionary options compress: remove `compression_parameters::get_compressor()` sstables/compress: remove get_sstable_compressor() sstables/compress: move ownership of `compressor` to `sstable::compression` compress: remove compressor::option_names() compress: clean up the constructor of zstd_processor compress: squash zstd.cc into compress.cc sstables/compress: break the dependency of `compression_parameters` on `compressor` compress.hh: switch compressor::name() from an instance member to a virtual call bytes: adapt fmt_hex to std::span<const std::byte>	2025-04-01 12:47:34 +03:00
Pavel Emelyanov	b5a124f60c	sstable_directory: Move highest_generation_seen() to distributed_loader.cc This method is only used by the loader code (and tests). Also, There's the highest_version_seen() peer that sits in the loader code either. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#23324	2025-04-01 09:15:14 +03:00
Michał Chojnowski	d920ab5366	database: add sample_data_files() Add a helper for sampling the Data files for a given table. We will use it to take samples for dictionary training.	2025-04-01 00:07:29 +02:00
Michał Chojnowski	48c06c7e4b	database: add take_sstable_set_snapshot() We want a method that will allow us to take a stable snapshot of SSTables, to asynchronously compute some stats on them. But `take_storage_snapshot` is overly invasive for that, because it flushes memtables on each call. (If `take_storage_snapshot` was, for example, called repetitively, it could create a ton of small memtables and lead to trouble). This commit adds a weaker version which only takes a snapshot of existing SSTables, and doesn't flush memtables by itself. This will be useful for dictionary training, which doesn't care about the semantics of SSTables, only their rough statistical properties.	2025-04-01 00:07:28 +02:00

1 2 3 4 5 ...

1523 Commits