scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-26 19:35:12 +00:00

Author	SHA1	Message	Date
Lakshmi Narayanan Sreethar	0fc0474ccc	sstables: reclaim_memory_from_components: do not update _recognised_components When reclaiming memory from bloom filters, do not remove them from _recognised_components, as that leads to the on-disk filter component being left back on disk when the SSTable is deleted. Fixes #18398 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#18400 (cherry picked from commit `6af2659b57`) Closes #18437	2024-04-29 10:02:52 +03:00
Kefu Chai	119dbb0d43	thrift: avoid use-after-move in `make_non_overlapping_ranges()` in handler.cc, `make_non_overlapping_ranges()` references a moved instance of `ColumnSlice` when something unexpected happens to format the error message in an exception, the move constructor of `ColumnSlice` is default-generated, so the members' move constructors are used to construct the new instance in the move constructor. this could lead to undefined behavior when dereferencing the move instance. in this change, in order to avoid use-after free, let's keep a copy of the referenced member variables and reference them when formatting error message in the exception. this use-after-move issue was introduced in `822a315dfa`, which implemented `get_multi_slice` verb and this piece in the first place. since both 5.2 and 5.4 include this commit, we should backport this change to them. Refs `822a315dfa` Fixes #18356 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `1ad3744edc`) Closes #18373	2024-04-25 11:36:00 +03:00
Anna Mikhlin	dae9bef75f	release: prepare for 5.2.18 scylla-5.2.18	2024-04-19 13:30:48 +03:00
Asias He	065f7178ab	repair: Improve estimated_partitions to reduce memory usage Currently, we use the sum of the estimated_partitions from each participant node as the estimated_partitions for sstable produced by repair. This way, the estimated_partitions is the biggest possible number of partitions repair would write. Since repair will write only the difference between repair participant nodes, using the biggest possible estimation will overestimate the partitions written by repair, most of the time. The problem is that overestimated partitions makes the bloom filter consume more memory. It is observed that it causes OOM in the field. This patch changes the estimation to use a fraction of the average partitions per node instead of sum. It is still not a perfect estimation but it already improves memory usage significantly. Fixes #18140 Closes scylladb/scylladb#18141 (cherry picked from commit `642f9a1966`)	2024-04-18 16:37:05 +03:00
Botond Dénes	f17e480237	Merge '[Backport 5.2] : Track and limit memory used by bloom filters' from Lakshmi Narayanan Sreethar Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required. The feature can be manually verified by doing the following : 1. run a single-node single-shard 1GB cluster 2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter) 3. populate with tiny partitions 4. watch the bloom filter metrics get capped at 100MB The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature. Fixes https://github.com/scylladb/scylladb/issues/17747 Backported from #17771 to 5.2. Closes #18247 * github.com:scylladb/scylladb: test_bloom_filter.py: disable reclaiming memory from components sstable_datafile_test: add tests to verify auto reclamation of components test/lib: allow overriding available memory via test_env_config sstables_manager: support reclaiming memory from components sstables_manager: store available memory size sstables_manager: add variable to track component memory usage db/config: add a new variable to limit memory used by table components sstable_datafile_test: add testcase to verify reclamation from sstables sstables: support reclaiming memory from components	2024-04-17 14:34:19 +03:00
Lakshmi Narayanan Sreethar	dd9ab15bb5	test_bloom_filter.py: disable reclaiming memory from components Disabled reclaiming memory from sstable components in the testcase as it interferes with the false positive calculation. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `d86505e399`)	2024-04-16 15:50:22 +05:30
Lakshmi Narayanan Sreethar	96db5ae5e3	sstable_datafile_test: add tests to verify auto reclamation of components Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `d261f0fbea`)	2024-04-16 15:49:58 +05:30
Lakshmi Narayanan Sreethar	beea229deb	test/lib: allow overriding available memory via test_env_config Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `169629dd40`)	2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar	89367c4310	sstables_manager: support reclaiming memory from components Reclaim memory from the SSTable that has the most reclaimable memory if the total reclaimable memory has crossed the threshold. Only the bloom filter memory is considered reclaimable for now. Fixes #17747 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `a36965c474`)	2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar	32de41ecb4	sstables_manager: store available memory size The available memory size is required to calculate the reclaim memory threshold, so store that within the sstables manager. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `2ca4b0a7a2`)	2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar	0841c0084c	sstables_manager: add variable to track component memory usage sstables_manager::_total_reclaimable_memory variable tracks the total memory that is reclaimable from all the SSTables managed by it. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `f05bb4ba36`)	2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar	786c08aa59	db/config: add a new variable to limit memory used by table components A new configuration variable, components_memory_reclaim_threshold, has been added to configure the maximum allowed percentage of available memory for all SSTable components in a shard. If the total memory usage exceeds this threshold, it will be reclaimed from the components to bring it back under the limit. Currently, only the memory used by the bloom filters will be restricted. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `e8026197d2`)	2024-04-16 15:30:39 +05:30
Lakshmi Narayanan Sreethar	31251b37dd	sstable_datafile_test: add testcase to verify reclamation from sstables Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `e0b6186d16`)	2024-04-16 15:30:30 +05:30
Lakshmi Narayanan Sreethar	1b390ceb24	sstables: support reclaiming memory from components Added support to track total memory from components that are reclaimable and to reclaim memory from them if and when required. Right now only the bloom filters are considered as reclaimable components but this can be extended to any component in the future. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `4f0aee62d1`)	2024-04-16 13:03:45 +05:30
Tzach Livyatan	0bfe016beb	Update Driver root page The right term is Amazon DynamoDB not AWS DynamoDB See https://aws.amazon.com/dynamodb/ Closes scylladb/scylladb#18214 (cherry picked from commit `289793d964`)	2024-04-16 09:55:41 +03:00
Botond Dénes	280956f507	Merge '[Backport 5.2] repair: fix memory counting in repair' from Aleksandra Martyniuk Repair memory limit includes only the size of frozen mutation fragments in repair row. The size of other members of repair row may grow uncontrollably and cause out of memory. Modify what's counted to repair memory limit. Fixes: https://github.com/scylladb/scylladb/issues/16710. (cherry picked from commit `a4dc6553ab`) (cherry picked from commit `51c09a84cc`) Refs https://github.com/scylladb/scylladb/pull/17785 Closes #18237 * github.com:scylladb/scylladb: test: add test for repair_row::size() repair: fix memory accounting in repair_row	2024-04-16 07:07:15 +03:00
Aleksandra Martyniuk	97671eb935	test: add test for repair_row::size() Add test which checs whether repair_row::size() considers external memory. (cherry picked from commit `51c09a84cc`)	2024-04-09 13:29:33 +02:00
Aleksandra Martyniuk	8144134545	repair: fix memory accounting in repair_row In repair, only the size of frozen mutation fragments of repair row is counted to the memory limit. So, huge keys of repair rows may lead to OOM. Include other repair_row's members' memory size in repair memory limit. (cherry picked from commit `a4dc6553ab`)	2024-04-06 22:44:51 +00:00
Ferenc Szili	2bb5fe7311	logging: Don't log PK/CK in large partition/row/cell warning Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly. This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables. The logged data will look like this: Large cells: WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db Large rows: WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db Large partitions: WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db Fixes #18041 Closes scylladb/scylladb#18166 (cherry picked from commit `f1cc6252fd`)	2024-04-05 16:03:08 +03:00
Kefu Chai	4595f51d5c	utils/logalloc: do not allocate memory in reclaim_timer::report() before this change, `reclaim_timer::report()` calls ```c++ fmt::format(", at {}", current_backtrace()) ``` which allocates a `std::string` on heap, so it can fail and throw. in that case, `std::terminate()` is called. but at that moment, the reason why `reclaim_timer::report()` gets called is that we fail to reclaim memory for the caller. so we are more likely to run into this issue. anyway, we should not allocate memory in this path. in this change, a dedicated printer is created so that we don't format to a temporary `std::string`, and instead write directly to the buffer of logger. this avoids the memory allocation. Fixes #18099 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18100 (cherry picked from commit `fcf7ca5675`)	2024-04-02 16:38:17 +03:00
Wojciech Mitros	c0c34d2af0	mv: keep semaphore units alive until the end of a remote view update When a view update has both a local and remote target endpoint, it extends the lifetime of its memory tracking semaphore units only until the end of the local update, while the resources are actually used until the remote update finishes. This patch changes the semaphore transferring so that in case of both local and remote endpoints, both view updates share the units, causing them to be released only after the update that takes longer finishes. Fixes #17890 (cherry picked from commit `9789a3dc7c`) Closes #18104	2024-04-02 10:09:01 +02:00
Pavel Emelyanov	c34a503ef3	Update seastar submodule (iotune error path crash fix) * seastar eb093f8a...b9fd21d8 (1): > iotune: Don't close file that wasn't opened Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-03-28 10:53:51 +03:00
Beni Peled	447a3beb47	release: prepare for 5.2.17 scylla-5.2.17	2024-03-27 14:35:37 +02:00
Botond Dénes	aca50c46b7	tools/toolchain: update python driver Backports scylladb/scylladb#17604 and scylladb/scylladb#17956. Fixes scylladb/scylladb#16709 Fixes scylladb/scylladb#17353 Closes #17661	2024-03-27 08:48:25 +02:00
Wojciech Mitros	44bcaca929	mv: adjust memory tracking of single view updates within a batch Currently, when dividing memory tracked for a batch of updates we do not take into account the overhead that we have for processing every update. This patch adds the overhead for single updates and joins the memory calculation path for batches and their parts so that both use the same overhead. Fixes #17854 (cherry picked from commit `efcb718`) Closes #17999	2024-03-26 09:38:17 +02:00
Botond Dénes	2e2bf79092	Merge '[Backport 5.2] tests: utils: error injection: print time duration instead of count' from ScyllaDB before this change, we always cast the wait duration to millisecond, even if it could be using a higher resolution. actually `std::chrono::steady_clock` is using `nanosecond` for its duration, so if we inject a deadline using `steady_clock`, we could be awaken earlier due to the narrowing of the duration type caused by the duration_cast. in this change, we just use the duration as it is. this should allow the caller to use the resolution provided by Seastar without losing the precision. the tests are updated to print the time duration instead of count to provide information with a higher resolution. Fixes #15902 (cherry picked from commit `8a5689e7a7`) (cherry picked from commit `1d33a68dd7`) Closes #17911 * github.com:scylladb/scylladb: tests: utils: error injection: print time duration instead of count error_injection: do not cast to milliseconds when injecting timeout	2024-03-25 17:41:23 +02:00
Pavel Emelyanov	616199f79c	Update seastar submodule (dupliex IO queue activation fix) * seastar ad0f2d5d...eb093f8a (1): > fair_queue: Do not pop unplugged class immediately Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-03-25 12:58:24 +03:00
Wojciech Mitros	5dfb6c9ead	mv: adjust the overhead estimation for view updates In order to avoid running out of memory, we can't underestimate the memory used when processing a view update. Particularly, we need to handle the remote view updates well, because we may create many of them at the same time in contrast to local updates which are processed synchronously. After investigating a coredump generated in a crash caused by running out of memory due to these remote view updates, we found that the current estimation is much lower than what we observed in practice; we identified overhead of up to 2288 bytes for each remote view update. The overhead consists of: - 512 bytes - a write_response_handler - less than 512 bytes - excessive memory allocation for the mutation in bytes_ostream - 448 bytes - the apply_to_remote_endpoints coroutine started in mutate_MV() - 192 bytes - a continuation to the coroutine above - 320 bytes - the coroutine in result_parallel_for_each started in mutate_begin() - 112 bytes - a continuation to the coroutine above - 192 bytes - 5 unspecified allocations of 32, 32, 32, 48 and 48 bytes This patch changes the previous overhead estimate of 256 bytes to 2288 bytes, which should take into account all allocations in the current version of the code. It's worth noting that changes in the related pieces of code may result in a different overhead. The allocations seem to be mostly captures for the background tasks. Coroutines seem to allocate extra, however testing shows that replacing a coroutine with continuations may result in generating a few smaller futures/continuations with a larger total size. Besides that, considering that we're waiting for a response for each remote view update, we need the relatively large write_response_handler, which also includes the mutation in case we needed to reuse it. The change should not majorly affect workloads with many local updates because we don't keep many of them at the same time anyway, and an added benefit of correct memory utilization estimation is avoiding evictions of other memory that would be otherwise necessary to handle the excessive memory used by view updates. Fixes #17364 (cherry picked from commit `5ab3586135`) Closes #17858	2024-03-20 13:52:23 +02:00
Kefu Chai	6209f5d6d4	tests: utils: error injection: print time duration instead of count instead of casting / comparing the count of duration unit, let's just compare the durations, so that boost.test is able to print the duration in a more informative and user friendly way (line wrapped) test/boost/error_injection_test.cc(167): fatal error: in "test_inject_future_disabled": critical check wait_time > sleep_msec has failed [23839ns <= 10ms] Refs #15902 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `1d33a68dd7`)	2024-03-20 09:40:16 +00:00
Kefu Chai	ac288684c6	error_injection: do not cast to milliseconds when injecting timeout before this change, we always cast the wait duration to millisecond, even if it could be using a higher resolution. actually `std::chrono::steady_clock` is using `nanosecond` for its duration, so if we inject a deadline using `steady_clock`, we could be awaken earlier due to the narrowing of the duration type caused by the duration_cast. in this change, we just use the duration as it is. this should allow the caller to use the resolution provided by Seastar without losing the precision. Fixes #15902 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `8a5689e7a7`)	2024-03-20 09:40:16 +00:00
Raphael S. Carvalho	fc1d126f31	replica: Fix major compaction semantics by performing off-strategy first Major compaction semantics is that all data of a table will be compacted together, so user can expect e.g. a recently introduced tombstone to be compacted with the data it shadows. Today, it can happen that all data in maintenance set won't be included for major, until they're promoted into main set by off-strategy. So user might be left wondering why major is not having the expected effect. To fix this, let's perform off-strategy first, so data in maintenance set will be made available by major. A similar approach is done for data in memtable, so flush is performed before major starts. The only exception will be data in staging, which cannot be compacted until view building is done with it, to avoid inconsistency in view replicas. The serialization in comapaction manager of reshape jobs guarantee correctness if there's an ongoing off-strategy on behalf of the table. Fixes #11915. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#15792 (cherry picked from commit `ea6c281b9f`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes #17901	2024-03-20 08:48:17 +02:00
Anna Stuchlik	70edebd8d7	doc: fix the image upgrade page This commit updates the Upgrade ScyllaDB Image page. - It removes the incorrect information that updating underlying OS packages is mandatory. - It adds information about the extended procedure for non-official images. (cherry picked from commit `fc90112b97`) Closes #17885	2024-03-19 16:47:06 +02:00
Petr Gusev	dffc0fb720	repair_meta: get_estimated_partitions fix The shard_range parameter was unused. Fixes: #17863 (cherry picked from commit `b9f527bfa8`)	2024-03-18 14:27:45 +02:00
Kamil Braun	0bb338c521	test: remove test_writes_to_recent_previous_cdc_generations The test in its original form relies on the `error_injections_at_startup` feature, which 5.2 doesn't have, so I adapted the test to enable error injections after bootstrapping nodes in the backport (`9c44bbce67`). That is however incorrect, it's important for the injection to be enabled while the nodes are booting, otherwise the test will be flaky, as we observed. Details in scylladb/scylladb#17749. Remove the test from 5.2 branch. Fixes scylladb/scylladb#17749 Closes #17750	2024-03-15 10:22:22 +02:00
Tomasz Grabiec	cefa19eb93	Merge 'migration_manager: take group0 lock during raft snapshot taking' from Kamil Braun This is a backport of `0c376043eb` and follow-up fix `57b14580f0` to 5.2. We haven't identified any specific issues in test or field in 5.2/2023.1 releases, but the bug should be fixed either way, it might bite us in unexpected ways. Closes #17640 * github.com:scylladb/scylladb: migration_manager: only jump to shard 0 in migration_request during group 0 snapshot transfer raft_group0_client: assert that hold_read_apply_mutex is called on shard 0 migration_manager: fix indentation after the previous patch. messaging_service: process migration_request rpc on shard 0 migration_manager: take group0 lock during raft snapshot taking	2024-03-14 23:41:02 +01:00
Nadav Har'El	08077ff3e8	alternator, mv: fix case of two new key columns in GSI A materialized view in CQL allows AT MOST ONE view key column that wasn't a key column in the base table. This is because if there were two or more of those, the "liveness" (timestamp, ttl) of these different columns can change at every update, and it's not possible to pick what liveness to use for the view row we create. We made an exception for this rule for Alternator: DynamoDB's API allows creating a GSI whose partition key and range key are both regular columns in the base table, and we must support this. We claim that the fact that Alternator allows neither TTL (Alternator's "TTL" is a different feature) nor user-defined timestamps, does allow picking the liveness for the view row we create. But we did it wrong! We claimed in a comment - and implemented in the code before this patch - that in Alternator we can assume that both GSI key columns will have the same liveness, and in particular timestamp. But this is only true if one modifies both columns together! In fact, in general it is not true: We can have two non-key attributes 'a' and 'b' which are the GSI's key columns, and we can modify only b, without modifying a, in which case the timestamp of the view modification should be b's newer timestamp, not a's older one. The existing code took a's timestamp, assuming it will be the same as b's, which is incorrect. The result was that if we repeatedly modify only b, all view updates will receive the same timestamp (a's old timestamp), and a deletion will always win over all the modifications. This patch includes a reproducing test written by a user (@Zak-Kent) that demonstrates how after a view row is deleted it doesn't get recreated - because all the modifications use the same timestamp. The fix is, as suggested above, to use the higher of the two timestamps of both base-regular-column GSI key columns as the timestamp for the new view rows or view row deletions. The reproducer that failed before this patch passes with it. As usual, the reproducer passes on AWS DynamoDB as well, proving that the test is correct and should really work. Fixes #17119 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17172 (cherry picked from commit `21e7deafeb`)	2024-03-13 15:08:32 +02:00
Kamil Braun	405387c663	test: unflake test_topology_remove_garbage_group0 The test is booting nodes, and then immediately starts shutting down nodes and removing them from the cluster. The shutting down and removing may happen before driver manages to connect to all nodes in the cluster. In particular, the driver didn't yet connect to the last bootstrapped node. Or it can even happen that the driver has connected, but the control connection is established to the first node, and the driver fetched topology from the first node when the first node didn't yet consider the last node to be normal. So the driver decides to close connection to the last node like this: ``` 22:34:03.159 DEBUG> [control connection] Removing host not found in peers metadata: <Host: 127.42.90.14:9042 datacenter1> ``` Eventually, at the end of the test, only the last node remains, all other nodes have been removed or stopped. But the driver does not have a connection to that last node. Fix this problem by ensuring that: - all nodes see each other as NORMAL, - the driver has connected to all nodes at the beginning of the test, before we start shutting down and removing nodes. Fixes scylladb/scylladb#16373 (cherry picked from commit `a68701ed4f`) Closes #17703	2024-03-12 13:43:21 +01:00
Kamil Braun	b567364af1	migration_manager: only jump to shard 0 in migration_request during group 0 snapshot transfer Jumping to shard 0 during group 0 snapshot transfer is required because we take group 0 lock, onyl available on shard 0. But outside of Raft mode it only pessimizes performance unnecessarily, so don't do it.	2024-03-12 11:19:31 +01:00
Botond Dénes	3897d44893	repair: resolve start-up deadlock Repairs have to obtain a permit to the reader concurrency semaphore on each shard they have a presence on. This is prone to deadlocks: node1 node2 repair1_master (takes permit) repair1_follower (waits on permit) repair2_master (waits for permit) repair2_follower (takes permit) In lieu of strong central coordination, we solved this by making permits evictable: if repair2 can evict repair1's permit so it can obtain one and make progress. This is not efficient as evicting a permit usually means discarding already done work, but it prevents the deadlocks. We recently discovered that there is a window when deadlocks can still happen. The permit is made evictable when the disk reader is created. This reader is an evictable one, which effectively makes the permit evictable. But the permit is obtained when the repair constrol structrure -- repair meta -- is create. Between creating the repair meta and reading the first row from disk, the deadlock is still possible. And we know that what is possible, will happen (and did happen). Fix by making the permit evictable as soon as the repair meta is created. This is very clunky and we should have a better API for this (refs #17644), but for now we go with this simple patch, to make it easy to backport. Refs: #17644 Fixes: #17591 Closes #17646 (cherry picked from commit `c6e108a`) Backport notes: The fix above does not apply to 5.2, because on 5.2 the reader is created immediately when the repair-meta is created. So we don't need the game with a fake inactive read, we can just pause the already created reader in the repair-reader constructor. Closes #17730	2024-03-12 08:24:26 +02:00
Michał Chojnowski	54048e5613	sstables: fix a use-after-free in key_view::explode() key_view::explode() contains a blatant use-after-free: unless the input is already linearized, it returns a view to a local temporary buffer. This is rare, because partition keys are usually not large enough to be fragmented. But for a sufficiently large key, this bug causes a corrupted partition_key down the line. Fixes #17625 (cherry picked from commit `7a7b8972e5`) Closes #17725	2024-03-11 16:17:32 +02:00
Lakshmi Narayanan Sreethar	e8736ae431	reader_permit: store schema_ptr instead of raw schema pointer Store schema_ptr in reader permit instead of storing a const pointer to schema to ensure that the schema doesn't get changed elsewhere when the permit is holding on to it. Also update the constructors and all the relevant callers to pass down schema_ptr instead of a raw pointer. Fixes #16180 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#16658 (cherry picked from commit `76f0d5e35b`) Closes #17694	2024-03-08 10:56:12 +02:00
Gleb Natapov	42d25f1911	raft_group0_client: assert that hold_read_apply_mutex is called on shard 0 group0 operations a valid on shard 0 only. Assert that. (cherry picked from commit `9847e272f9`)	2024-03-05 16:51:13 +01:00
Gleb Natapov	619f75d1de	migration_manager: fix indentation after the previous patch. (cherry picked from commit `77907b97f1`)	2024-03-05 16:50:24 +01:00
Gleb Natapov	6dd31dcade	messaging_service: process migration_request rpc on shard 0 Commit `0c376043eb` added access to group0 semaphore which can be done on shard0 only. Unlike all other group0 rpcs (that already always forwarded to shard0) migration_request does not since it is an rpc that what reused from non raft days. The patch adds the missing jump to shard0 before executing the rpc. (cherry picked from commit `4a3c79625f`)	2024-03-05 16:49:23 +01:00
Gleb Natapov	dd65bf151b	migration_manager: take group0 lock during raft snapshot taking Group0 state machine access atomicity is guaranteed by a mutex in group0 client. A code that reads or writes the state needs to hold the log. To transfer schema part of the snapshot we used existing "migration request" verb which did not follow the rule. Fix the code to take group0 lock before accessing schema in case the verb is called as part of group0 snapshot transfer. Fixes scylladb/scylladb#16821 (cherry picked from commit `0c376043eb`) Backport note: introduced missing `raft_group0_client::hold_read_apply_mutex`	2024-03-05 16:40:00 +01:00
Yaron Kaikov	f4a7804596	release: prepare for 5.2.16 scylla-5.2.16	2024-03-03 14:33:44 +02:00
Botond Dénes	ba373f83e4	Merge '[Backport 5.2] repair: streaming: handle no_such_column_family from remote node' from Aleksandra Martyniuk RPC calls lose information about the type of returned exception. Thus, if a table is dropped on receiver node, but it still exists on a sender node and sender node streams the table's data, then the whole operation fails. To prevent that, add a method which synchronizes schema and then checks, if the exception was caused by table drop. If so, the exception is swallowed. Use the method in streaming and repair to continue them when the table is dropped in the meantime. Fixes: https://github.com/scylladb/scylladb/issues/17028. Fixes: https://github.com/scylladb/scylladb/issues/15370. Fixes: https://github.com/scylladb/scylladb/issues/15598. Closes #17528 * github.com:scylladb/scylladb: repair: handle no_such_column_family from remote node gracefully test: test drop table on receiver side during streaming streaming: fix indentation streaming: handle no_such_column_family from remote node gracefully repair: add methods to skip dropped table	2024-02-28 16:33:01 +02:00
Kamil Braun	d82c757323	Merge 'misc_services: fix data race from bad usage of get_next_version' from Piotr Dulikowski The function `gms::version_generator::get_next_version()` can only be called from shard 0 as it uses a global, unsynchronized counter to issue versions. Notably, the function is used as a default argument for the constructor of `gms::versioned_value` which is used from shorthand constructors such as `versioned_value::cache_hitrates`, `versioned_value::schema` etc. The `cache_hitrate_calculator` service runs a periodic job which updates the `CACHE_HITRATES` application state in the local gossiper state. Each time the job is scheduled, it runs on the next shard (it goes through shards in a round-robin fashion). The job uses the `versioned_value::cache_hitrates` shorthand to create a `versioned_value`, therefore risking a data race if it is not currently executing on shard 0. The PR fixes the race by moving the call to `versioned_value::cache_hitrates` to shard 0. Additionally, in order to help detect similar issues in the future, a check is introduced to `get_next_version` which aborts the process if the function was called on other shard than 0. There is a possibility that it is a fix for #17493. Because `get_next_version` uses a simple incrementation to advance the global counter, a data race can occur if two shards call it concurrently and it may result in shard 0 returning the same or smaller value when called two times in a row. The following sequence of events is suspected to occur on node A: 1. Shard 1 calls `get_next_version()`, loads version `v - 1` from the global counter and stores in a register; the thread then is preempted, 2. Shard 0 executes `add_local_application_state()` which internally calls `get_next_version()`, loads `v - 1` then stores `v` and uses version `v` to update the application state, 3. Shard 0 executes `add_local_application_state()` again, increments version to `v + 1` and uses it to update the application state, 4. Gossip message handler runs, exchanging application states with node B. It sends its application state to B. Note that the max version of any of the local application states is `v + 1`, 5. Shard 1 resumes and stores version `v` in the global counter, 6. Shard 0 executes `add_local_application_state()` and updates the application state - again - with version `v + 1`. 7. After that, node B will never learn about the application state introduced in point 6. as gossip exchange only sends endpoint states with version larger than the previous observed max version, which was `v + 1` in point 4. Note that the above scenario was _not_ reproduced. However, I managed to observe a race condition by: 1. modifying Scylla to run update of `CACHE_HITRATES` much more frequently than usual, 2. putting an assertion in `add_local_application_state` which fails if the version returned by `get_next_version` was not larger than the previous returned value, 3. running a test which performs schema changes in a loop. The assertion from the second point was triggered. While it's hard to tell how likely it is to occur without making updates of cache hitrates more frequent - not to mention the full theorized scenario - for now this is the best lead that we have, and the data race being fixed here is a real bug anyway. Refs: #17493 Closes scylladb/scylladb#17499 * github.com:scylladb/scylladb: version_generator: check that get_next_version is called on shard 0 misc_services: fix data race from bad usage of get_next_version (cherry picked from commit `fd32e2ee10`)	2024-02-28 14:28:03 +01:00
Aleksandra Martyniuk	78aeb990a6	repair: handle no_such_column_family from remote node gracefully If no_such_column_family is thrown on remote node, then repair operation fails as the type of exception cannot be determined. Use repair::with_table_drop_silenced in repair to continue operation if a table was dropped. (cherry picked from commit `cf36015591`)	2024-02-28 11:46:02 +01:00
Aleksandra Martyniuk	23493bb342	test: test drop table on receiver side during streaming (cherry picked from commit `2ea5d9b623`)	2024-02-28 11:46:02 +01:00

1 2 3 4 5 ...

35086 Commits