scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-12 19:02:12 +00:00

Author	SHA1	Message	Date
Aleksandra Martyniuk	5297084bd1	replica: database: change type of tables_metadata::_ks_cf_to_uuid If there is a lot of tables, a node reports oversized allocation in _ks_cf_to_uuid of type flat_hash_map. Change the type to std::unordered_map to prevent oversized allocations. Fixes: https://github.com/scylladb/scylladb/issues/26787. Closes scylladb/scylladb#27165 (cherry picked from commit `19a7d8e248`) Closes scylladb/scylladb#27192	2025-12-03 12:19:12 +03:00
Calle Wilund	9631beeafd	system_keyspace: Prune dropped tables from truncation on start/drop Fixes #25683 Once a table drop is complete, there should be no reason to retain truncation records for it, as any replay should skip mutations anyway (no CF), and iff we somehow resurrect a dropped table, this replay-resurrected data is the least problem anyway. Adds a prune phase to the startup drop_truncation_rp_records run, which ignores updating, and instead deletes records for non-existant tables (which should patch any existing servers with lingering data as well). Also does an explicit delete of records on actual table DROP, to ensure we don't grow this table more than needed even in long uptime nodes. Small unit test included. Closes scylladb/scylladb#25699 (cherry picked from commit `bc20861afb`) Closes scylladb/scylladb#25811	2025-09-04 08:41:30 +03:00
Dawid Mędrek	8ba65e4014	main: Log RF-rack-invalid keyspaces at startup When the configuration option `rf_rack_valid_keyspaces` is enabled and there is an RF-rack-invalid keyspace, starting a node fails. However, when the configuration option is disabled, but there still is a keyspace that violates the condition, we'd like Scylla to print a warning informing the user about the fact. That's what happens in this commit. We provide a validation test. (cherry picked from commit `837d267cbf`)	2025-08-25 18:45:06 +02:00
Ferenc Szili	75c1ed0e86	truncate: change check for write during truncate into a log warning TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, the truncated_at timepoint, the offending replay positions which caused the check to fail. Fixes: #25173 Fixes: #25013 (cherry picked from commit `268ec72dc9`)	2025-08-06 00:51:06 +00:00
Benny Halevy	a3025520d2	replica: database: get and expose a mutable locator::shared_token_metadata Prepare for next patch, the will use this shared_token_metadata to make mutable_token_metadata_ptr:s Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `493a2303da`)	2025-07-21 09:59:17 +03:00
Raphael S. Carvalho	f926083fbd	replica: Fix truncate assert failure Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed a preexisting fragility which I missed. 1) truncate gets RP mark X, truncated_at = second T 2) new sstable written during snapshot or later, also at second T (difference of MS) 3) discard_sstables() get RP Y > saved RP X, since creation time of sstable with RP Y is equal to truncated_at = second T. So the problem is that truncate is using a clock of second granularity for filtering out sstables written later, and after we got low mark and truncate time, it can happen that a sstable is flushed later within the same second, but at a different millisecond. By switching to a millisecond clock (db_clock), we allow sstables written later within the same second from being filtered out. It's not perfect but extremely unlikely a new write lands and get flushed in the same millisecond we recorded truncated_at timepoint. In practice, truncate will not be used concurrently to writes, so this should be enough for our tests performing such concurrent actions. We're moving away from gc_clock which is our cheap lowres_clock, but time is only retrieved when creating sstable objects, which frequency of creation is low enough for not having significant consequences, and also db_clock should be cheap enough since it's usually syscall-less. Fixes #23771. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24426 (cherry picked from commit `2d716f3ffe`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24875	2025-07-09 17:39:19 +03:00
Botond Dénes	00402cb4c5	sstables: add corrupt_data_handler to sstables::sstables Similar to how large_data_handler is handled, propagate through sstables::sstables_manager and store its owner: replica::database. Tests and tools are also patched. Mostly mechanical changes, updating constructors and patching callers. (cherry picked from commit `ebd9420687`)	2025-07-02 14:04:09 +03:00
Wojciech Mitros	70b21012cd	view_info: don't re-set base_info after construction In the previous commits we made sure that the base info is not dependent on the base schema version, and the info dependent on the base schema version is calculated when it's needed. In this patch we remove the unnecessary re-setting of the base_info. The set_base_info method isn't removed completely, because it also has a secondary function - zeroing the view_info fields other than base_info. Because of this, in this patch we rename it accordingly and limit its use to the updates caused by a base schema change. (cherry picked from commit `d7bd86591e`)	2025-05-27 21:40:23 +02:00
Wojciech Mitros	cadc3eeae8	mv: remove queue length limit from the view update read concurrency semaphore Each view update is correlated to a write that generates it (aside from view building which is throttled separately). These writes are limited by a throttling mechanism, which effectively works by performing the writes with CL=ALL if ongoing writes exceed some memory usage limit When writes generate view updates, they usually also need to perform a read. This read goes through a read concurrency semaphore where it can get delayed or killed. The semaphore allows up to 100 concurrent reads and puts all remaining reads in a queue. If the number of queued reads exceeds a specific limit, the view update will fail on the replica, causing inconsistencies. This limit is not necessary. When a read gets queued on the semaphore, the write that's causing the view update is paused, so the write takes part in the regular write throttling. If too many writes get stuck on view update reads, they will get throttled, so their number is limited and the number of queued reads is also limited to the same amount. In this patch we remove the specified queue length limit for the view update read concurrency semaphore. Instead of this limit, the queue will be now limited indirectly, by the base write throttling mechanism. This may allow the queue grow longer than with the previous limit, but it shouldn't ever cause issues - we only perform up to 100 actual reads at once, and the remaining ones that get queued use a tiny amount of memory, less than the writes that generated them and which are getting limited directly. Fixes https://github.com/scylladb/scylladb/issues/23319 Closes scylladb/scylladb#24112 (cherry picked from commit `5920647617`) Closes scylladb/scylladb#24168	2025-05-16 11:46:15 +03:00
Aleksandra Martyniuk	bfdf7c944b	test: test table drop during flush (cherry picked from commit `c1618c7de5`)	2025-05-06 09:52:42 +02:00
Aleksandra Martyniuk	238cf27471	replica: skip flush of dropped table (cherry picked from commit `91b57e79f3`)	2025-05-06 09:52:30 +02:00
Raphael S. Carvalho	75cd8e9492	replica: Fix truncate and drop table after tablet migration happens When running those operations after a tablet replica is migrated away from a shard, an assert can fail resulting in a crash. Status quo (around the assert in truncate procedure): 1) Highest RP seen by table is saved in low_mark, and the current time in low_mark_at. 2) Then compaction is disabled in order to not mix data written before truncate, and data written later. 3) Then memtable is flushed in order for the data written before truncate to be available in sstables and then removed. 4) Now, current time is saved in truncated_at, which is supposedly the time of truncate to decide which sstables to remove. Note: truncated_at is likely above low_mark_at due to steps 2 and 3. The interesting part of the assert is: (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp) Note: RP in the assert above is the highest RP among all sstables generated before truncated_at. RP is retrieved by table::discard_sstables(). If truncated_at > low_mark_at, maybe newer data was written during steps 2 and 3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with RP > low_mark. So assert's 2nd condition is there to defend against the scenario above. truncated_at and low_mark_at uses millisecond granularity, so even if truncated_at == low_mark_at, data could have been written in steps 2 and 3 (during same MS window), failing the assert. This is fragile. Reproducer: To reproduce the problem, truncated_at must be > low_mark_at, which can easily happen with both drop table and truncate due to steps 2 and 3. If a shard has 2 or more tablets, the table's highest RP refer to just one tablet in that shard. If the tablet with the highest RP is migrated away, then the sstables in that shard will have lower RP than the recorded highest RP (it's a table wide state, which makes sense since CL is shared among tablets). So when either drop table or truncate runs, low_mark will be potentially bigger than highest RP retrieved from sstables. Proposed solution: The current assert is hacked to not fail if writes sneak in, during steps 2 and 3, but it's still fragile and seems not to serve its real purpose, since it's allowing for RP > low_mark. We should be able to say that low_mark >= RP, as a way of asserting we're not leaving data targeted by truncate behind (or that we're not removing the wrong data). But the problem is that we're saving low_mark in step 1, before preparation steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying all data written so far is targeted for removal. But as of today, low_mark refers to all data written up to step 1. So low_mark is now only one set before issuing flush, and also accounts for all potentially flushed data. Fixes #18059. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23560 (cherry picked from commit `0f59deffaa`) (cherry picked from commit 7554d4bbe09967f9b7a55575b5dfdde4f6616862) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23649	2025-04-11 10:52:37 +03:00
Avi Kivity	3335557075	Merge '[Backport 2025.1] row_cache: don't garbage-collect tombstones which cover data in memtables' from Scylladb[bot] The row cache can garbage-collect tombstones in two places: 1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it; 2) During reads - reads now compact data including garbage collection; In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables. This PR includes fixes for (2), which were not handled at all currently. (1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included. Fixes: https://github.com/scylladb/scylladb/issues/23291 Fixes: https://github.com/scylladb/scylladb/issues/23252 The fix will need backport to all live release. - (cherry picked from commit `c2518cdf1a`) - (cherry picked from commit `6b5b563ef7`) - (cherry picked from commit `7e600a0747`) - (cherry picked from commit `d126ea09ba`) - (cherry picked from commit `cb76cafb60`) - (cherry picked from commit `df09b3f970`) - (cherry picked from commit `e5afd9b5fb`) - (cherry picked from commit `34b18d7ef4`) - (cherry picked from commit `f7938e3f8b`) - (cherry picked from commit `6c1f6427b3`) - (cherry picked from commit `0d39091df2`) Parent PR: #23255 Closes scylladb/scylladb#23673 * github.com:scylladb/scylladb: test/boost/row_cache_test: add memtable overlap check tests replica/table: add error injection to memtable post-flush phase utils/error_injection: add a way to set parameters from error injection points test/cluster: add test_data_resurrection_in_memtable.py test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts replica/mutation_dump: don't assume cells are live replica/database: do_apply() add error injection point replica: improve memtable overlap checks for the cache replica/memtable: add is_merging_to_cache() db/row_cache: add overlap-check for cache tombstone garbage collection mutation/mutation_compactor: copy key passed-in to consume_new_partition()	2025-04-10 21:42:28 +03:00
Botond Dénes	39ca3463b3	replica/database: do_apply() add error injection point So writes (to user tables) can be failed on a replica, via error injection. Should simplify tests which want to create differences in what writes different replicas receive. (cherry picked from commit `cb76cafb60`)	2025-04-10 03:17:27 -04:00
Botond Dénes	1c7a6ba140	replica: improve memtable overlap checks for the cache The current memtable overlap check that is used by the cache -- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only checks the active memtable, so memtables which are either being flushed or are already flushed and also have active reads against them do not participate in the overlap check. This can result in temporary data resurrection, where a cache read can garbage-collect a tombstone which still covers data in a flushing or flushed memtable, which still have active read against it. To prevent this, extend the overlap check to also consider all of the memtable list. Furthermore, memtable_list::erase() now places the removed (flushed) memtable in an intrusive list. These entries are alive only as long as there are readers still keeping an `lw_shared_ptr<memtable>` alive. This list is now also consulted on overlap checks. (cherry picked from commit `d126ea09ba`)	2025-04-10 03:17:27 -04:00
Botond Dénes	02d89435a9	Merge '[Backport 2025.1] Ignore wrapped exceptions `gate_closed_exception` and `rpc::closed_error` when node shuts down.' from Scylladb[bot] Normally, when a node is shutting down, `gate_closed_exception` and `rpc::closed_error` in `send_to_live_endpoints` should be ignored. However, if these exceptions are wrapped in a `nested_exception`, an error message is printed, causing tests to fail. This commit adds handling for nested exceptions in this case to prevent unnecessary error messages. Fixes scylladb/scylladb#23325 Fixes scylladb/scylladb#23305 Fixes scylladb/scylladb#21815 Backport: looks like this is quite a frequent issue, therefore backport to 2025.1. - (cherry picked from commit `6abfed9817`) - (cherry picked from commit `b1e89246d4`) - (cherry picked from commit `0d9d0fe60e`) - (cherry picked from commit `d448f3de77`) Parent PR: #23336 Closes scylladb/scylladb#23470 * github.com:scylladb/scylladb: database: Pass schema_ptr as const ref in `wrap_commitlog_add_error` database: Unify exception handling in `do_apply` and `apply_with_commitlog` storage_proxy: Ignore wrapped `gate_closed_exception` and `rpc::closed_error` when node shuts down. exceptions: Add `try_catch_nested` to universally handle nested exceptions of the same type.	2025-04-10 10:04:50 +03:00
Botond Dénes	c44362451c	replica/database: setup_scylla_memory_diagnostics_producer() un-static semaphore dump lambda The lambda which dumps the diagnostics for each semaphore, is static. Considering that said lambda captures a local (writeln) by reference, this is wrong on two levels: * The writeln captured on the shard which happens to initialize this static, will be used on all shards. * The writeln captured on the first dump, will be used on later dumps, possibly triggering a segfault. Drop the `static` to make the lambda local and resolve this problem. Fixes: scylladb/scylladb#22756 Closes scylladb/scylladb#22776 (cherry picked from commit `820f196a49`) Closes scylladb/scylladb#22938	2025-04-10 09:54:37 +03:00
Sergey Zolotukhin	bfb242b735	database: Pass schema_ptr as const ref in `wrap_commitlog_add_error` (cherry picked from commit `d448f3de77`)	2025-03-27 21:28:13 +00:00
Sergey Zolotukhin	fe94b5a475	database: Unify exception handling in `do_apply` and `apply_with_commitlog` Move exception wrapping logic from `do_apply` and `apply_with_commitlog` to `wrap_commitlog_add_error` to ensure consistent error handling. (cherry picked from commit `0d9d0fe60e`)	2025-03-27 21:28:13 +00:00
Dawid Mędrek	ecdefe801c	main: Refuse to start node when RF-rack-invalid keyspace exists When a node is started with the option `rf_rack_valid_keyspaces` enabled, the initialization will fail if there is an RF-rack-invalid keyspace. We want to force the user to adjust their existing keyspaces when upgrading to 2025.* so that the invariant that every keyspace is RF-rack-valid is always satisfied. Fixes scylladb/scylladb#23300 (cherry picked from commit `0e04a6f3eb`)	2025-03-21 12:27:04 +00:00
Aleksandra Martyniuk	4c39943b3f	replica: mark registry entry as synch after the table is added When a replica get a write request it performs get_schema_for_write, which waits until the schema is synced. However, database::add_column_family marks a schema as synced before the table is added. Hence, the write may see the schema as synced, but hit no_such_column_family as the table hasn't been added yet. Mark schema as synced after the table is added to database::_tables_metadata. Fixes: #22347. Closes scylladb/scylladb#22348 (cherry picked from commit `328818a50f`) Closes scylladb/scylladb#22604	2025-02-13 09:39:13 +02:00
Botond Dénes	9116fc635e	Merge '[Backport 2025.1] split: run set_split_mode() on all storage groups during all_storage_groups_split()' from Scylladb[bot] `tablet_storage_group_manager::all_storage_groups_split()` calls `set_split_mode()` for each of its storage groups to create split ready compaction groups. It does this by iterating through storage groups using `std::ranges::all_of()` which is not guaranteed to iterate through the entire range, and will stop iterating on the first occurrence of the predicate (`set_split_mode()`) returning false. `set_split_mode()` creates the split compaction groups and returns false if the storage group's main compaction group or merging groups are not empty. This means that in cases where the tablet storage group manager has non-empty storage groups, we could have a situation where split compaction groups are not created for all storage groups. The missing split compaction groups are later created in `tablet_storage_group_manager::split_all_storage_groups()` which also calls `set_split_mode()`, and that is the reason why split completes successfully. The problem is that `tablet_storage_group_manager::all_storage_groups_split()` runs under a group0 guard, but `tablet_storage_group_manager::split_all_storage_groups()` does not. This can cause problems with operations which should exclude with compaction group creation. i.e. DROP TABLE/DROP KEYSPACE Fixes #22431 This is a bugfix and should be back ported to versions with tablets: 6.1 6.2 and 2025.1 - (cherry picked from commit `24e8d2a55c`) - (cherry picked from commit `8bff7786a8`) Parent PR: #22330 Closes scylladb/scylladb#22560 * github.com:scylladb/scylladb: test: add reproducer and test for fix to split ready CG creation table: run set_split_mode() on all storage groups during all_storage_groups_split()	2025-02-13 09:36:23 +02:00
Botond Dénes	319626e941	reader_concurrency_semaphore: with_permit(): proper clean-up after queue overload with_permit() creates a permit, with a self-reference, to avoid attaching a continuation to the permit's run function. This self-reference is used to keep the permit alive, until the execution loop processes it. This self reference has to be carefully cleared on error-paths, otherwise the permit will become a zombie, effectively leaking memory. Instead of trying to handle all loose ends, get rid of this self-reference altogether: ask caller to provide a place to save the permit, where it will survive until the end of the call. This makes the call-site a little bit less nice, but it gets rid of a whole class of possible bugs. Fixes: #22588 Closes scylladb/scylladb#22624 (cherry picked from commit `f2d5819645`) Closes scylladb/scylladb#22704	2025-02-06 10:08:19 +02:00
Ferenc Szili	fe869fd902	test: add reproducer and test for fix to split ready CG creation This adds a reproducer for #22431 In cases where a tablet storage group manager had more than one storage group, it was possible to create compaction groups outside the group0 guard, which could create problems with operations which should exclude with compaction group creation. (cherry picked from commit `8bff7786a8`)	2025-01-29 10:10:28 +00:00
Tomasz Grabiec	8059090a29	Merge 'Cache base info for view schemas in the schema registry' from Wojciech Mitros Currently, when we load a frozen schema into the registry, we lose the base info if the schema was of a view. Because of that, in various places we need to set the base info again, and in some codepaths we may miss it completely, which may make us unable to process some requests (for example, when executing reverse queries on views). Even after setting the base info, we may still lose it if the schema entry gets deactivated due to all `schema_ptr`s temporarily dying. To fix this, this patch adds the base schema to the registry, alongside the view schema. We store just the frozen base schema, so that we can transfer it across shards. With the base schema, we can now set the base info when returning the schema from the registry. As a result, we can now assume that all view schemas returned by the registry have base_info set. In this series we also make sure that the view schemas in the registry are kept up-to-date in regards to base schema changes. Fixes https://github.com/scylladb/scylladb/issues/21354 This issue is a bug, so adding backport labels 6.1 and 6.2 Closes scylladb/scylladb#21862 * github.com:scylladb/scylladb: test: add test for schema registry maintaining base info for views schema_registry: avoid setting base info when getting the schema from registry schema_registry: update cached base schemas when updating a view schema_registry: cache base schemas for views db: set base info before adding schema to registry	2025-01-21 00:17:54 +01:00
Kefu Chai	569f8e9246	treewide: fix misspellings these misspellings were identified by codespell. let's fix them. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#22154	2025-01-05 16:13:09 +02:00
Piotr Dulikowski	7383013f43	replica/database: add reader concurrency semaphore groups Replace the reader concurrency semaphores for user reads and view updates with the newly introduced reader concurrency semaphore group, which assigns a semaphore for each service level. Each group is statically assigned to some pool of memory on startup and dynamically distribute this memory between the semaphores, relative to the number of shares of the corresponding scheduling group. The intent of having a separate reader concurrency semaphore for each scheduling group is to prevent priority inversion issues due to reads with different priorities waiting on the same semaphore, as well as make memory allocation more fair between service levels due to the adjusted number of shares.	2025-01-02 07:13:34 +01:00
Wojciech Mitros	6f11edbf3f	db: set base info before adding schema to registry In the following patches, we'll assure that view schemas returned by the schema registry always have base info set. To prepare for that, make sure that the base info is always set before inserting it into schema registry,	2024-12-30 14:56:17 +01:00
Avi Kivity	f3eade2f62	treewide: relicense to ScyllaDB-Source-Available-1.0 Drop the AGPL license in favor of a source-available license. See the blog post [1] for details. [1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/	2024-12-18 17:45:13 +02:00
Gleb Natapov	ca55d1e658	replica/database: drop usage of ip in favor of host id in get_keyspace_local_ranges	2024-12-15 11:31:11 +02:00
Botond Dénes	5d040e0206	Merge 'truncate: commit log replay positions are not saved correctly' from Ferenc Szili TRUNCATE TABLE saves the current commit log replay positions in case there is a crash so that replay knows where to begin replaying the mutations. These are collected and saved per shard into `system.truncated`. In case a shard received no mutations, its replay position will be an empty, default constructed object of type `db::replay_position` with its members set to 0. Truncate will incorrectly interpret these empty replay positions as if they were coming from shard 0, and save them as such, potentially overwriting an actual valid replay position coming from the actual shard 0. In the case of a crash, this will cause the commit log on shard 0 to be replayed from the beginning, and result with data resurrection. Fixes #21719 Closes scylladb/scylladb#21722 * github.com:scylladb/scylladb: test: add test for truncate saving replay positions database: correctly save replay position for truncate	2024-12-10 10:05:30 +02:00
Kefu Chai	48c8d24345	treewide: drop support for fmt < v10 since fedora 38 is EOL. and fedora 39 comes with fmt v10.0.0, also, we've switched to the build image based on fedora 40, which ships fmt-devel v10.2.1, there is no need to support fmt < 10. in this change, we drop the support fmt < 10. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21847	2024-12-09 20:42:38 +02:00
Ferenc Szili	036d3287b3	database: correctly save replay position for truncate This commit fixes a problem with way truncate saves commit log replay positions. On shards without mutations, truncate would save the replay position into system.truncated with shard number 0 regardless of the actual shard number that the replay position was saved for.	2024-11-28 16:18:32 +01:00
Piotr Smaron	a49ed7074d	Update in-memory ks.metadata.init_tablets after ALTER KS Once e.g. `ALTER KEYSPACE` is performed, all in-memory objects should be updated accordingly, but this is not entirely true for keyspace metadata object. The reason for that is that keyspace metadata are stored in 2 system tables: `system_schema.keyspaces` and `system_schema.scylla_keyspaces`. Up until now the in-memory keyspace metadata object has been updated only with entries from the first table, and missed updates when entries from the 2nd table changed. These entries were e.g. initial tablets or storage options. This change fixes this oversight by considering both tables when checking if keyspace metadata need to be updated. From the implementation point of view, the change is simple: we're considering `system_schema.scylla_keyspaces` also in `merge_keyspaces()` and if old and new schemas have any differences, we include that when altering ks. Fixes #20768 Backport: no need, I don't think the issue is severe, atm it seems like it can only influence the tablets number, which should not bring the cluster down nor result in returning bad data, it can mostly influence the speed of the db. Closes scylladb/scylladb#20852	2024-11-28 13:46:32 +01:00
Kefu Chai	5e391eee25	treewide: use coroutine::parallel_for_each(range) when appropriate `coroutine::parallel_for_each` accepts both a range and a pair of iterators. let's use the former when appropriate. it is simpler this way. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21684	2024-11-27 21:00:47 +02:00
Ernest Zaslavsky	793f2c95d1	snapshots: Stop taking snapshots of MVs Stop taking snapshots of MVs and allow taking snapshot of individual tables, now one can take a snapshot of any base table, any view or index. Also add tests to cover new cases both boost test (using cc code) and pytest (using the API) Also, update documentation to reflect the change fixes: #21339 fixes: #20760 Closes scylladb/scylladb#21433	2024-11-26 15:27:30 +02:00
Kefu Chai	a5ee0c896b	treewide: migrate from boost::adaptors::filtered to std::views::filter Modernize the codebase by replacing Boost range adaptors with C++23 standard library views, reducing external dependencies and leveraging modern C++ language features. Key Changes: - Replace `boost::adaptors::filtered` with `std::views::filter` - Remove `#include <boost/range/adaptor/filtered.hpp>` - Utilize standard library range views Motivation: - Reduce project's external dependency footprint - Leverage standard library's range and view capabilities - Improve long-term code maintainability - Align with modern C++ best practices Implementation Challenges and Considerations: 1. Range Conversion and Move Semantics - `std::ranges::to` adaptor requires rvalue references - Necessitated updates to variable and parameter constness - Example: `cql3/restrictions/statement_restrictions.cc` modified to remove `const` from `common` to enable efficient range conversion 2. Range Iteration and Mutation - Range views may mutate internal state during iteration - Cannot pass ranges by const reference in some scenarios - Solution: Pass ranges by rvalue reference to explicitly indicate state invalidation Limitations: - One instance of `boost::adaptors::filtered` temporarily preserved due to lack of a C++23 alternative for `boost::join()` - A comprehensive replacement will be addressed in a follow-up change This change is part of our ongoing effort to modernize the codebase, reducing external dependencies and adopting modern C++ practices. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21648	2024-11-26 14:26:50 +02:00
Kefu Chai	33a0e5b892	treewide: replace boost::find_if with std::ranges::find_if now that we are allowed to use C++23. we now have the luxury of using `std::ranges::find_if`. in this change, we: - replace `boost::find_if` with `std::ranges::find_if` - remove all `#include <boost/range/algorithm/find_if.hpp>` to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-11-19 10:50:01 +08:00
Avi Kivity	b58dbe57aa	Merge 'repair: introduce and use buffer size hint for mixed-shard multishard reader' from Botond Dénes Add a buffer hint to the multishard reader. This is an internal hint, used by the multishard reader to provide a hint to the shard reader, on how much data exactly is needed by the multishard reader from the respective shard. This hint allows eliminating extraneous cross-shard round-trips and possible shard reader evict-recreate cycles. Building on this, repair sets its own row buffer size as the max buffer size on the multishard reader, ensuring that the row buffer is filled with the minimum amount of cross-shard round trips and minimal reader recreation. To further eliminate unnecessary evictions, this PR also disables the multishard reader's read-ahead which is a mechanism that was designed to reduce latency for user-reads but it can be too aggressive for repair, causing unnecessary extra congestion on the already struggling streaming semaphores. Refs: https://github.com/scylladb/scylladb/issues/18269 Fixes: https://github.com/scylladb/scylladb/issues/21113 The performance impact was measured with an SCT test, which creates a cluster of 3 nodes with 16 shards, then adds a 4th one with 12 shards. Currently, it is the bootstrap time which is the worse in the case of mixed shard clusters, see below for the improvement measured during bootstrap: \| \| master \| buffer-hint \| metric \| \| ------------ \| ------------- \| ------------- \| --------------------------------------------------- \| \| evictions \| 0.9M \| 93.0K \| scylla_database_paused_reads_permit_based_evictions \| \| read (bytes) \| 9.0T \| 3.9T \| scylla_reactor_aio_bytes_read \| \| read (ops) \| 88.0M \| 33.5M \| scylla_reactor_aio_reads \| \| time \| 56min \| 20min \| N/A \| This is a performance improvement, no backport required. Closes scylladb/scylladb#20815 * github.com:scylladb/scylladb: test/boost/mutation_reader_test: add test for multishard reader buffer hint repair/row_level: disable read-ahead db/config: introduce repair_multishard_reader_enable_read_ahead readers/multishard: implement the read_ahead flag replica/database: make_multishard_streaming_reader(): expose the read_ahead parameter readers/multishard: add read_ahead parameter repair/row_level: set max buffer size on multishard reader replica/database: make_multishard_streaming_reader(): expose buffer_hint parameter db/config: introduce enable_repair_multishard_reader_buffer_hint readers/multishard: multishard_reader: pass hint to shard_reader readers/multishard: shard_reader_v2::fill_reader_buffer(): respect the hint readers/multishard: propagate fill_buffer_hint to shard_reader:fill_reader_buffer() readers/multishard: shard_reader: extract buffer-fill into its own method	2024-11-10 12:55:19 +02:00
Botond Dénes	8938e06ebe	replica/database: make_multishard_streaming_reader(): expose the read_ahead parameter Continuing the previous patch, expose the just added read_ahead parameter of make_multishard_combining>_reader_v2(). Set to read_ahead::yes by all callers, keeping the current default.	2024-11-07 02:47:54 -05:00
Botond Dénes	e2344e28b6	replica/database: make_multishard_streaming_reader(): expose buffer_hint parameter Expose the buffer hint functionality added by the previous commits, to callers of make_multishard_streaming_reader(). All callers disable it currently, it will be used in the next patch.	2024-11-07 02:47:46 -05:00
Kefu Chai	59eb2ab119	treewide: s/boost::algorithm::any_of/std::ranges::any_of/ now that we are allowed to use C++23. we now have the luxury of using `std::ranges::any_of`. in this change, we replace `boost::algorithm::any_of` with `std::ranges::any_of` to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-11-05 14:06:09 +08:00
Kefu Chai	24d14b601b	treewide: s/boost::adaptors::map_values/std::views::values/ now that we are allowed to use C++23. we now have the luxury of using `std::views::values`. in this change, we: - replace `boost::adaptors::map_values` with `std::views::values` - update affected code to work with `std::views::values` - the places where we use `boost::join()` are not changed, because we cannot use `std::views::concat` yet. this helper is only available in C++26. to reduce the dependency to boost for better maintainability, and leverage standard library features for better long-term support. this change is part of our ongoing effort to modernize our codebase and reduce external dependencies where possible. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#21265	2024-10-27 21:32:45 +02:00
Nadav Har'El	5fd3177057	Merge 'mv: add a dedicated read concurrency semaphore for view update read before writes' from Wojciech Mitros When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload. This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency. The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory. This is fixed by using the read admission on the view update concurrency semaphore. This limits the number of concurrent view update reads to max_count_concurrent_view_update_reads, all other incoming view update reads are queued using just a small chunk of memory. Without this, the reads would also get queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue. The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction. This patch also adds a test to confirm that the view update workload doesn't impact the read latency, as well as a test which confirms that we do not run out of memory even under heavy view udpate workload. The issue of view updates causing increased latencies most often occurs in the following scenario: * we have a medium to high write workload to a table with a materialized view which requires reading from the base table before sending the update to delete the old rows * we have any read workload * one replica is slower or is handling more writes due to an imbalance of data distribution * we write with a cl<ALL, the mentioned replica is replying to write requests slower while new ones keep being sent to it. * each write performs a read first taking resources from the user read concurrency semaphore, so when enough writes accumulate the reads using the semaphore start getting queued * the queue is shared by regular reads and view update reads. When there's enough view update reads in the queue, regular reads start getting increased latencies An sct test (perf-regression-latency-mv-read-concurrency) was prepared to somewhat resemble this scenario: * the tables were prepared satisfying the conditions above * we use a medium write workload and a very low read workload * the imbalance is achieved by writing to just a few (10) partitions - some replicas (and shards) can have twice or more used partitions than others. We also keep writing to a limited (though high) number of rows, to cause overwrites which require reading before sending the view update * to minimize the test case, we use a cluster of 3 nodes and rf=2, we write with cl=ONE to have background replica writes and read with cl=ALL to wait for the slower replica to respond. In the test above: * without the fix, the latency of reads increases over 50s * with the fix, the latency of reads stays below 20ms Fixes https://github.com/scylladb/scylladb/issues/8873 Fixes https://github.com/scylladb/scylladb/issues/15805 The patch is not that small and it isn't fixing a regression, so no backports Closes scylladb/scylladb#20887 * github.com:scylladb/scylladb: test: add test for high view update concurrency causing bad_allocs test: add test for high view update concurrency degrading read latency mv: add a dedicated read concurrency semaphore for view update read before writes	2024-10-22 22:17:23 +03:00
Avi Kivity	ec543e3902	Merge 'Remove all_datadirs vector of strings from table::config' from Pavel Emelyanov The all_datadirs keeps paths to directories where local sstables can be. In fact, Scylla doesn't put sstables there, but can try to find them on boot and when checking snapshots. The 0th element of this vector, called datadir, had recently been removed by #20675, now it's time to drop all_datadirs as well. The needed paths can be obtained from table's storage options (see #20542) and db::config::data_file_directories option. Closes scylladb/scylladb#21212 * github.com:scylladb/scylladb: sstables: Open-code format_table_directory_name() moved recently replica,sstables: Move format_table_directory_name() table: Remove all_datadirs sstables: Generate table::all_datadirs from db::config and storage_options replica: Prepare vector of fs::path-s with table dirs table: Check storage options in get_snapshot_details()	2024-10-22 17:21:31 +03:00
Kefu Chai	6ead5a4696	treewide: move log.hh into utils/log.hh the log.hh under the root of the tree was created keep the backward compatibility when seastar was extracted into a separate library. so log.hh should belong to `utils` directory, as it is based solely on seastar, and can be used all subsystems. in this change, we move log.hh into utils/log.hh to that it is more modularized. and this also improves the readability, when one see `#include "utils/log.hh"`, it is obvious that this source file needs the logging system, instead of its own log facility -- please note, we do have two other `log.hh` in the tree. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>	2024-10-22 06:54:46 +03:00
Pavel Emelyanov	eeb0d637bb	replica,sstables: Move format_table_directory_name() Now this helper is not needed in replica code, as all manipulations of tables' sstables now sit in the sstables/storage.cc. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-21 15:17:30 +03:00
Pavel Emelyanov	74728d3889	table: Remove all_datadirs It's write-only now, all the places than wanted to know where table's storage is (well -- "are", there can be several directories) already use storage_options. This finishes the work started by `9fe64b5d70`. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-10-21 15:15:54 +03:00
Wojciech Mitros	242079d70b	mv: add a dedicated read concurrency semaphore for view update read before writes When writing to some tables with materialized views, we need to read from the base table first to perform a delete of the old view row. When doing so, the memory used for the read is tracked by the user read concurrency semaphore. When we have a large number of such reads, we may use up all of the semaphore units, causing the following reads to be queued. When we have some user reads coming at the same time, these reads can have very high latency due to the write workload on the base table. We want to avoid this, so that the write workload doesn't have a high impact on the latency of the read workload. This is fixed in this patch by adding a separate read concurrency semaphore just for view update read-before-writes. With the new semaphore, even if there are many view update read-before-writes, they will be queued on a different semaphore than the user reads, and they won't impact their latency. The second issue fixed by this patch is the concurrency of the view updates that is currently unlimited. Because of that view updates may take up so much memory that they we may run out of memory. This is fixed by using the read admission on the view update concurrency semaphore. This limits the number of concurrent view update reads to max_count_concurrent_view_update_reads, all other incoming view update reads are queued using just a small chunk of memory. Without this, the reads would also get queued after exceeding view_update_reader_concurrency_semaphore_serialize_limit_multiplier, but they would take much more memory while staying in the queue. The new semaphore has half the capacity of the regular user read concurrency semahpore and is currently used only for user writes - is't used independently of the scheduling group on which we base the read semaphore selection, but we use a different code path for streaming (not database::do_apply) and we shouldn't have view updates in system writes or during compaction. Fixes https://github.com/scylladb/scylladb/issues/8873 Fixes https://github.com/scylladb/scylladb/issues/15805	2024-10-21 11:02:06 +02:00
Emil Maskovsky	74bd79bbb3	tombstone_gc: refactor the repair map Move the repair_map definition to the tombstone_gc file where it is mostly being used. Refactor and add the accessors and setters for the group0 tombstone GC time.	2024-10-08 20:53:54 +02:00

1 2 3 4 5 ...

509 Commits