scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-06 23:13:15 +00:00

Author	SHA1	Message	Date
Botond Dénes	ea176bf4ce	test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_skip_mode_test The test becomes a lot shorter and it now uses random schema and random data. The test is also split in two: one test for abort mode and one for skip mode. Indentation is left broken, to be fixed in a future patch. (cherry picked from commit `5237e8133b`)	2024-04-24 08:45:53 -04:00
Botond Dénes	3835fd681d	test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_validate_mode_test The test becomes a lot shorter and it now uses random schema and random data. Indentation is left broken, to be fixed in a future patch. (cherry picked from commit `76785baf43`)	2024-04-24 07:57:54 -04:00
Botond Dénes	14da273c4c	test/boost/sstable_compaction_test: introduce scrub_test_framework Scrub tests require a lot of boilerplate code to work. This has a lot of disadvantages: * Tests are long * The "meat" of the test is lost between all the boiler-plate, it is hard to glean what a test actually does * Tests are hard to write, so we have only a few of them and they test multiple things. * The boiler-plate differs sligthly from test-to-test. To solve this, this patch introduces a new class, `scrub_test_frawmework`, which is a central place for all the boiler-plate code needed to write scrub-related tests. In the next patches, we will migrate scrub related tests to this class. (cherry picked from commit `b6f0c4efa0`)	2024-04-24 07:57:54 -04:00
Botond Dénes	33d5f27244	test/lib/random_schema: add uncompatible_timestamp_generator() Guarantees that produced mutations will not be compactible. (cherry picked from commit `e412673c44`)	2024-04-24 07:57:54 -04:00
Asias He	10f137e367	repair: Improve estimated_partitions to reduce memory usage Currently, we use the sum of the estimated_partitions from each participant node as the estimated_partitions for sstable produced by repair. This way, the estimated_partitions is the biggest possible number of partitions repair would write. Since repair will write only the difference between repair participant nodes, using the biggest possible estimation will overestimate the partitions written by repair, most of the time. The problem is that overestimated partitions makes the bloom filter consume more memory. It is observed that it causes OOM in the field. This patch changes the estimation to use a fraction of the average partitions per node instead of sum. It is still not a perfect estimation but it already improves memory usage significantly. Fixes #18140 Closes scylladb/scylladb#18141 (cherry picked from commit `642f9a1966`) scylla-5.4.6 scylla-5.4.6-candidate	2024-04-18 11:40:06 +03:00
Kamil Braun	53e1ed0ebb	Merge '[Backport 5.4] gossiper: lock local endpoint when updating heart_beat' from ScyllaDB In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb/scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in `38c2347a3c`, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. The PR also adds a regression test. Fixes: scylladb/scylladb#15393 Fixes: scylladb/scylladb#15602 Fixes: scylladb/scylladb#16668 Fixes: scylladb/scylladb#16902 Fixes: scylladb/scylladb#17493 Fixes: scylladb/scylladb#18118 Ref: scylladb/scylla-enterprise#3720 (cherry picked from commit `a0b331b310`) (cherry picked from commit `72955093eb`) Refs scylladb/scylladb#18184 Closes scylladb/scylladb#18245 * github.com:scylladb/scylladb: test: reproducer for missing gossiper updates gossiper: lock local endpoint when updating heart_beat	2024-04-17 17:50:30 +02:00
Botond Dénes	1aedc7372d	Merge '[Backport 5.4] : Track and limit memory used by bloom filters' from Lakshmi Narayanan Sreethar Added support to track and limit the memory usage by sstable components. A reclaimable component of an SSTable is one from which memory can be reclaimed. SSTables and their managers now track such reclaimable memory and limit the component memory usage accordingly. A new configuration variable defines the memory reclaim threshold. If the total memory of the reclaimable components exceeds this limit, memory will be reclaimed to keep the usage under the limit. This PR considers only the bloom filters as reclaimable and adds support to track and limit them as required. The feature can be manually verified by doing the following : 1. run a single-node single-shard 1GB cluster 2. create a table with bloom-filter-false-positive-chance of 0.001 (to intentionally cause large bloom filter) 3. populate with tiny partitions 4. watch the bloom filter metrics get capped at 100MB The default value of the `components_memory_reclaim_threshold` config variable which controls the reclamation process is `.1`. This can also be reduced further during manual tests to easily hit the threshold and verify the feature. Fixes https://github.com/scylladb/scylladb/issues/17747 Backported from #17771 to 5.4. Closes scylladb/scylladb#18248 * github.com:scylladb/scylladb: test_bloom_filter.py: disable reclaiming memory from components sstable_datafile_test: add tests to verify auto reclamation of components test/lib: allow overriding available memory via test_env_config sstables_manager: support reclaiming memory from components sstables_manager: store available memory size sstables_manager: add variable to track component memory usage db/config: add a new variable to limit memory used by table components sstable_datafile_test: add testcase to verify reclamation from sstables sstables: support reclaiming memory from components	2024-04-17 14:33:19 +03:00
Kamil Braun	28781ca37e	test: reproducer for missing gossiper updates Regression test for scylladb/scylladb#17493. (cherry picked from commit `72955093eb`) Backport note: removed `timeout` parameter passed to `server_add`, missing on this branch. (If server adding hangs, it will timeout after `TOPOLOGY_TIMEOUT` from scylla_cluster.py) Removed `force_gossip_join_boot` error injection from test, not present in this branch. Starting nodes with `experimental_features` disabled. Added missing `handle_state_normal.*finished` message.	2024-04-17 13:09:39 +02:00
Beni Peled	9218bbb9b9	test.py: add the pytest junit_suite_name parameter By default the suitename in the junit files generated by pytest is named `pytest` for all suites instead of the suite, ex. `topology_experimental_raft` With this change, the junit files will use the real suitename This change doesn't affect the Test Report in Jenkins, but it raised part of the other task of publishing the test results to elasticsearch https://github.com/scylladb/scylla-pkg/pull/3950 where we parse the XMLs and we need the correct suitename Closes scylladb/scylladb#18172 (cherry picked from commit `223275b4d1`)	2024-04-17 07:02:33 +03:00
Kamil Braun	8faef135cb	gossiper: lock local endpoint when updating heart_beat In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb/scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in `38c2347a3c`, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. Fixes: scylladb/scylladb#15393 Fixes: scylladb/scylladb#15602 Fixes: scylladb/scylladb#16668 Fixes: scylladb/scylladb#16902 Fixes: scylladb/scylladb#17493 Fixes: scylladb/scylladb#18118 Ref: scylladb/scylla-enterprise#3720 (cherry picked from commit `a0b331b310`)	2024-04-16 14:39:05 +02:00
Lakshmi Narayanan Sreethar	75962d3e94	test_bloom_filter.py: disable reclaiming memory from components Disabled reclaiming memory from sstable components in the testcase as it interferes with the false positive calculation. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `d86505e399`)	2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar	034304127c	sstable_datafile_test: add tests to verify auto reclamation of components Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `d261f0fbea`)	2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar	95068d3c00	test/lib: allow overriding available memory via test_env_config Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `169629dd40`)	2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar	3dca49c524	sstables_manager: support reclaiming memory from components Reclaim memory from the SSTable that has the most reclaimable memory if the total reclaimable memory has crossed the threshold. Only the bloom filter memory is considered reclaimable for now. Fixes #17747 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `a36965c474`)	2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar	505a0714a6	sstables_manager: store available memory size The available memory size is required to calculate the reclaim memory threshold, so store that within the sstables manager. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `2ca4b0a7a2`)	2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar	b8bd650e32	sstables_manager: add variable to track component memory usage sstables_manager::_total_reclaimable_memory variable tracks the total memory that is reclaimable from all the SSTables managed by it. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `f05bb4ba36`)	2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar	fd0b083414	db/config: add a new variable to limit memory used by table components A new configuration variable, components_memory_reclaim_threshold, has been added to configure the maximum allowed percentage of available memory for all SSTable components in a shard. If the total memory usage exceeds this threshold, it will be reclaimed from the components to bring it back under the limit. Currently, only the memory used by the bloom filters will be restricted. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `e8026197d2`)	2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar	1609b77b45	sstable_datafile_test: add testcase to verify reclamation from sstables Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `e0b6186d16`)	2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar	aec4d157da	sstables: support reclaiming memory from components Added support to track total memory from components that are reclaimable and to reclaim memory from them if and when required. Right now only the bloom filters are considered as reclaimable components but this can be extended to any component in the future. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `4f0aee62d1`)	2024-04-16 13:05:36 +05:30
Tzach Livyatan	80d7bf7366	Update Driver root page The right term is Amazon DynamoDB not AWS DynamoDB See https://aws.amazon.com/dynamodb/ Closes scylladb/scylladb#18214 (cherry picked from commit `289793d964`)	2024-04-16 09:53:53 +03:00
Pavel Emelyanov	0dc50ac449	view_update_generator: Unplug from database later Patch `967ebacaa4` (view_update_generator: Move abort kicking to do_abort()) moved unplugging v.u.g from database from .stop() to .do_abort(). The latter call happens very early on stop -- once scylla receives SIGINT. However, database may still need v.u.g. plugged to flush views. This patch moves unplug to later, namely to .stop() method of v.u.g. which happens after database is drained and should no longer continue view updates. fixes: #16001 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18132 (cherry picked from commit `3471f30b58`)	2024-04-15 14:33:56 +03:00
Botond Dénes	e9d38b3c9e	Merge '[Backport 5.4] repair: fix memory counting in repair' from ScyllaDB Repair memory limit includes only the size of frozen mutation fragments in repair row. The size of other members of repair row may grow uncontrollably and cause out of memory. Modify what's counted to repair memory limit. Fixes: #16710. (cherry picked from commit `a4dc6553ab`) (cherry picked from commit `51c09a84cc`) Refs #17785 Closes scylladb/scylladb#18205 * github.com:scylladb/scylladb: test: add test for repair_row::size() repair: fix memory accounting in repair_row	2024-04-15 14:05:43 +03:00
Aleksandra Martyniuk	bfc4104eb9	test: add test for repair_row::size() Add test which checs whether repair_row::size() considers external memory. (cherry picked from commit `51c09a84cc`)	2024-04-06 22:44:51 +00:00
Aleksandra Martyniuk	c1c1fde90f	repair: fix memory accounting in repair_row In repair, only the size of frozen mutation fragments of repair row is counted to the memory limit. So, huge keys of repair rows may lead to OOM. Include other repair_row's members' memory size in repair memory limit. (cherry picked from commit `a4dc6553ab`)	2024-04-06 22:44:51 +00:00
Ferenc Szili	94a551e671	logging: Don't log PK/CK in large partition/row/cell warning Currently, Scylla logs a warning when it writes a cell, row or partition which are larger than certain configured sizes. These warnings contain the partition key and in case of rows and cells also the cluster key which allow the large row or partition to be identified. However, these keys can contain user-private, sensitive information. The information which identifies the partition/row/cell is also inserted into tables system.large_partitions, system.large_rows and system.large_cells respectivelly. This change removes the partition and cluster keys from the log messages, but still inserts them into the system tables. The logged data will look like this: Large cells: WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large cell ks_name/tbl_name: cell_name (SIZE bytes) to sstable.db Large rows: WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large row ks_name/tbl_name: (SIZE bytes) to sstable.db Large partitions: WARN 2024-04-02 16:49:48,602 [shard 3: mt] large_data - Writing large partition ks_name/tbl_name: (SIZE bytes) to sstable.db Fixes #18041 Closes scylladb/scylladb#18166 (cherry picked from commit `f1cc6252fd`)	2024-04-05 16:02:22 +03:00
Kefu Chai	d6c7a26419	utils/logalloc: do not allocate memory in reclaim_timer::report() before this change, `reclaim_timer::report()` calls ```c++ fmt::format(", at {}", current_backtrace()) ``` which allocates a `std::string` on heap, so it can fail and throw. in that case, `std::terminate()` is called. but at that moment, the reason why `reclaim_timer::report()` gets called is that we fail to reclaim memory for the caller. so we are more likely to run into this issue. anyway, we should not allocate memory in this path. in this change, a dedicated printer is created so that we don't format to a temporary `std::string`, and instead write directly to the buffer of logger. this avoids the memory allocation. Fixes #18099 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18100 (cherry picked from commit `fcf7ca5675`)	2024-04-02 15:13:49 +03:00
Kamil Braun	7e05a54b9c	schema_tables: pass `reload` flag when calling `merge_schema` cross-shard In `0c86abab4d` `merge_schema` obtained a new flag, `reload`. Unfortunately, the flag was assigned a default value, which I think is almost always a bad idea, and indeed it was in this case. When `merge_schema` is called on shard different than 0, it recursively calls itself on shard 0. That recursive call forgot to pass the `reload` flag. Fix this. (cherry picked from commit `5223d32fab`)	2024-04-02 12:53:58 +02:00
Tzach Livyatan	2fa581d8fb	Docs: Fix link fro scylla-sstable.rst to /architecture/sstable/ Fix https://github.com/scylladb/scylladb/issues/18096 Closes scylladb/scylladb#18097 (cherry picked from commit `4930095d39`)	2024-04-02 13:46:43 +03:00
Wojciech Mitros	892c97295b	mv: adjust the overhead estimation for view updates In order to avoid running out of memory, we can't underestimate the memory used when processing a view update. Particularly, we need to handle the remote view updates well, because we may create many of them at the same time in contrast to local updates which are processed synchronously. After investigating a coredump generated in a crash caused by running out of memory due to these remote view updates, we found that the current estimation is much lower than what we observed in practice; we identified overhead of up to 2288 bytes for each remote view update. The overhead consists of: - 512 bytes - a write_response_handler - less than 512 bytes - excessive memory allocation for the mutation in bytes_ostream - 448 bytes - the apply_to_remote_endpoints coroutine started in mutate_MV() - 192 bytes - a continuation to the coroutine above - 320 bytes - the coroutine in result_parallel_for_each started in mutate_begin() - 112 bytes - a continuation to the coroutine above - 192 bytes - 5 unspecified allocations of 32, 32, 32, 48 and 48 bytes This patch changes the previous overhead estimate of 256 bytes to 2288 bytes, which should take into account all allocations in the current version of the code. It's worth noting that changes in the related pieces of code may result in a different overhead. The allocations seem to be mostly captures for the background tasks. Coroutines seem to allocate extra, however testing shows that replacing a coroutine with continuations may result in generating a few smaller futures/continuations with a larger total size. Besides that, considering that we're waiting for a response for each remote view update, we need the relatively large write_response_handler, which also includes the mutation in case we needed to reuse it. The change should not majorly affect workloads with many local updates because we don't keep many of them at the same time anyway, and an added benefit of correct memory utilization estimation is avoiding evictions of other memory that would be otherwise necessary to handle the excessive memory used by view updates. Fixes #17364 (cherry picked from commit `4c767c379c`) Closes scylladb/scylladb#18105	2024-04-02 10:26:28 +02:00
Jenkins Promoter	5f6b1dc5b9	Update ScyllaDB version to: 5.4.6	2024-04-01 23:37:05 +03:00
Kefu Chai	f1d547e74e	bytes.hh: stop at '}' in fmt::formatter<fmt_hex> according to {fmt}'s document at https://fmt.dev/latest/api.html#formatting-user-defined-types, ``` // the range will contain "f} continued". The formatter should parse // specifiers until '}' or the end of the range. In this example the // formatter should parse the 'f' specifier and return an iterator // pointing to '}'. ``` so we should check for _both_ '}' and end of the range. when building scylla with {fmt} 10.2.1, it fails to build code like ```c++ fmt::format_to(out, "{}", fmt_hex(frag)) ``` as {fmt}'s compile-time checker fails to parse this format string along with given argument, as at compile time, ```c++ throw format_error("invalid group_size") ``` is executed. so, in this change, we check both '}' and the end of range. the change which introduced this formatter was `2f9dfba800` Refs `2f9dfba800` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `ef2afa75d1`) Closes scylladb/scylladb#18086	2024-03-29 09:45:00 +02:00
Michał Chojnowski	3df5de60a9	cache_flat_mutation_reader: only call get_iterator_in_latest() when pointing at a row Calling `_next_row.get_iterator_in_latest()` is illegal when `_next_row` is not pointing at a row. In particular, the iterator returned by such call might be dangling. We have observed this to cause a use-after-free in the field, when a reverse read called `maybe_add_to_cache` after `_latest_it` was left dangling after a dead row removal in `copy_from_cache_to_buffer`. To fix this, we should ensure that we only call `_next_row.get_iterator_in_latest` is pointing at a row. Only the occurrences of this problem in `maybe_add_to_cache` are truly dangerous. As far as I can see, other occurrences can't break anything as of now. But we apply fixes to them anyway. (cherry picked from commit `04db6d4bb1`) Closes scylladb/scylladb#18075	2024-03-28 11:04:28 +01:00
Botond Dénes	fd7d57b9fa	tools/toolchain: update python driver Backports scylladb/scylladb#17604 and scylladb/scylladb#17956. Fixes scylladb/scylladb#16709 Fixes scylladb/scylladb#17353 Closes scylladb/scylladb#17653 scylla-5.4.5-candidate scylla-5.4.5	2024-03-26 13:27:34 +02:00
Botond Dénes	8a6c543033	Merge '[Backport 5.4] tests: utils: error injection: print time duration instead of count' from ScyllaDB before this change, we always cast the wait duration to millisecond, even if it could be using a higher resolution. actually `std::chrono::steady_clock` is using `nanosecond` for its duration, so if we inject a deadline using `steady_clock`, we could be awaken earlier due to the narrowing of the duration type caused by the duration_cast. in this change, we just use the duration as it is. this should allow the caller to use the resolution provided by Seastar without losing the precision. the tests are updated to print the time duration instead of count to provide information with a higher resolution. Fixes #15902 (cherry picked from commit `8a5689e7a7`) (cherry picked from commit `1d33a68dd7`) Closes scylladb/scylladb#17909 * github.com:scylladb/scylladb: tests: utils: error injection: print time duration instead of count error_injection: do not cast to milliseconds when injecting timeout	2024-03-25 17:40:31 +02:00
Pavel Emelyanov	27beb0fe60	Update seastar submodule (dupliex IO queue activation fix) * seastar 9d44e5eb...ae05c138 (1): > fair_queue: Do not pop unplugged class immediately Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-03-25 13:01:07 +03:00
Botond Dénes	e868ade258	repair: resolve start-up deadlock Repairs have to obtain a permit to the reader concurrency semaphore on each shard they have a presence on. This is prone to deadlocks: node1 node2 repair1_master (takes permit) repair1_follower (waits on permit) repair2_master (waits for permit) repair2_follower (takes permit) In lieu of strong central coordination, we solved this by making permits evictable: if repair2 can evict repair1's permit so it can obtain one and make progress. This is not efficient as evicting a permit usually means discarding already done work, but it prevents the deadlocks. We recently discovered that there is a window when deadlocks can still happen. The permit is made evictable when the disk reader is created. This reader is an evictable one, which effectively makes the permit evictable. But the permit is obtained when the repair constrol structrure -- repair meta -- is create. Between creating the repair meta and reading the first row from disk, the deadlock is still possible. And we know that what is possible, will happen (and did happen). Fix by making the permit evictable as soon as the repair meta is created. This is very clunky and we should have a better API for this (refs #17644), but for now we go with this simple patch, to make it easy to backport. Refs: #17644 Fixes: #17591 Closes scylladb/scylladb#17646 (cherry picked from commit `65b9e10543`) Closes scylladb/scylladb#17950	2024-03-21 13:12:40 +02:00
Kefu Chai	57fa61e2ca	tests: utils: error injection: print time duration instead of count instead of casting / comparing the count of duration unit, let's just compare the durations, so that boost.test is able to print the duration in a more informative and user friendly way (line wrapped) test/boost/error_injection_test.cc(167): fatal error: in "test_inject_future_disabled": critical check wait_time > sleep_msec has failed [23839ns <= 10ms] Refs #15902 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `1d33a68dd7`)	2024-03-20 09:32:00 +00:00
Kefu Chai	6fc3c62223	error_injection: do not cast to milliseconds when injecting timeout before this change, we always cast the wait duration to millisecond, even if it could be using a higher resolution. actually `std::chrono::steady_clock` is using `nanosecond` for its duration, so if we inject a deadline using `steady_clock`, we could be awaken earlier due to the narrowing of the duration type caused by the duration_cast. in this change, we just use the duration as it is. this should allow the caller to use the resolution provided by Seastar without losing the precision. Fixes #15902 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `8a5689e7a7`)	2024-03-20 09:32:00 +00:00
Anna Stuchlik	8414aa292a	doc: fix the image upgrade page This commit updates the Upgrade ScyllaDB Image page. - It removes the incorrect information that updating underlying OS packages is mandatory. - It adds information about the extended procedure for non-official images. (cherry picked from commit `fc90112b97`) Closes scylladb/scylladb#17886	2024-03-19 14:38:22 +02:00
Aleksandra Martyniuk	881ac7a9af	test: fix regular compaction tasks check Since `6b87778` regular compaction tasks are removed from task manager immediately after they are finished. test_regular_compaction_task lists compaction tasks and then requests their statuses. Only one regular compaction task is guaranteed to still be running at that time, the rest of them may finish before their status is requested and so it will no longer be in task manager, causing the test to fail. Fix statuses check to consider the possibility of a regular compaction task being removed from task manager. Fixes: #17776. (cherry picked from commit `80c5eb4ecb`) Closes scylladb/scylladb#17810	2024-03-15 08:54:00 +02:00
Tomasz Grabiec	24db04dbe4	Merge 'migration_manager: take group0 lock during raft snapshot taking' from Kamil Braun This is a backport of `0c376043eb` and follow-up fix `57b14580f0` to 5.4. We haven't identified any specific issues in test or field in 5.4/2024.1 releases, but the bug should be fixed either way, it might bite us in unexpected ways. For 5.4 it's even more important than 5.2 because in 5.4 in Raft mode schema pulls are disabled. Closes scylladb/scylladb#17641 * github.com:scylladb/scylladb: raft_group0_client: assert that hold_read_apply_mutex is called on shard 0 migration_manager: fix indentation after the previous patch. messaging_service: process migration_request rpc on shard 0 migration_manager: take group0 lock during raft snapshot taking	2024-03-14 23:42:49 +01:00
Jenkins Promoter	9fcb7baa9e	Update ScyllaDB version to: 5.4.5	2024-03-13 16:18:59 +02:00
Nadav Har'El	8a1f01ad88	alternator, mv: fix case of two new key columns in GSI A materialized view in CQL allows AT MOST ONE view key column that wasn't a key column in the base table. This is because if there were two or more of those, the "liveness" (timestamp, ttl) of these different columns can change at every update, and it's not possible to pick what liveness to use for the view row we create. We made an exception for this rule for Alternator: DynamoDB's API allows creating a GSI whose partition key and range key are both regular columns in the base table, and we must support this. We claim that the fact that Alternator allows neither TTL (Alternator's "TTL" is a different feature) nor user-defined timestamps, does allow picking the liveness for the view row we create. But we did it wrong! We claimed in a comment - and implemented in the code before this patch - that in Alternator we can assume that both GSI key columns will have the same liveness, and in particular timestamp. But this is only true if one modifies both columns together! In fact, in general it is not true: We can have two non-key attributes 'a' and 'b' which are the GSI's key columns, and we can modify only b, without modifying a, in which case the timestamp of the view modification should be b's newer timestamp, not a's older one. The existing code took a's timestamp, assuming it will be the same as b's, which is incorrect. The result was that if we repeatedly modify only b, all view updates will receive the same timestamp (a's old timestamp), and a deletion will always win over all the modifications. This patch includes a reproducing test written by a user (@Zak-Kent) that demonstrates how after a view row is deleted it doesn't get recreated - because all the modifications use the same timestamp. The fix is, as suggested above, to use the higher of the two timestamps of both base-regular-column GSI key columns as the timestamp for the new view rows or view row deletions. The reproducer that failed before this patch passes with it. As usual, the reproducer passes on AWS DynamoDB as well, proving that the test is correct and should really work. Fixes #17119 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17172 (cherry picked from commit `21e7deafeb`)	2024-03-13 14:46:03 +02:00
Raphael S. Carvalho	db1c8e8754	Fix potential data resurrection when another compaction type does cleanup work Since commit `f1bbf70`, many compaction types can do cleanup work, but turns out we forgot to invalidate cache on their completion. So if a node regains ownership of token that had partition deleted in its previous owner (and tombstone is already gone), data can be resurrected. Tablet is not affected, as it explicitly invalidates cache during migration cleanup stage. Scylla 5.4 is affected. Fixes #17501. Fixes #17452. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17502 (cherry picked from commit `f07c233ad5`)	2024-03-13 14:10:05 +02:00
Kamil Braun	6e01e821d7	test: unflake test_topology_remove_garbage_group0 The test is booting nodes, and then immediately starts shutting down nodes and removing them from the cluster. The shutting down and removing may happen before driver manages to connect to all nodes in the cluster. In particular, the driver didn't yet connect to the last bootstrapped node. Or it can even happen that the driver has connected, but the control connection is established to the first node, and the driver fetched topology from the first node when the first node didn't yet consider the last node to be normal. So the driver decides to close connection to the last node like this: ``` 22:34:03.159 DEBUG> [control connection] Removing host not found in peers metadata: <Host: 127.42.90.14:9042 datacenter1> ``` Eventually, at the end of the test, only the last node remains, all other nodes have been removed or stopped. But the driver does not have a connection to that last node. Fix this problem by ensuring that: - all nodes see each other as NORMAL, - the driver has connected to all nodes at the beginning of the test, before we start shutting down and removing nodes. Fixes scylladb/scylladb#16373 (cherry picked from commit `a68701ed4f`) Closes scylladb/scylladb#17702	2024-03-12 10:50:27 +01:00
Israel Fruchter	02182caff4	Update tools/cqlsh submodule * tools/cqlsh 426fa0ea...99b2b777 (1): > `COPY TO STDOUT` shouldn't put None where a function is expected Closes scylladb/scylladb#17728	2024-03-11 21:52:02 +02:00
Michał Chojnowski	42d9e36454	sstables: fix a use-after-free in key_view::explode() key_view::explode() contains a blatant use-after-free: unless the input is already linearized, it returns a view to a local temporary buffer. This is rare, because partition keys are usually not large enough to be fragmented. But for a sufficiently large key, this bug causes a corrupted partition_key down the line. Fixes #17625 (cherry picked from commit `7a7b8972e5`) Closes scylladb/scylladb#17717	2024-03-11 12:05:55 +02:00
Lakshmi Narayanan Sreethar	05d2078911	reader_permit: store schema_ptr instead of raw schema pointer Store schema_ptr in reader permit instead of storing a const pointer to schema to ensure that the schema doesn't get changed elsewhere when the permit is holding on to it. Also update the constructors and all the relevant callers to pass down schema_ptr instead of a raw pointer. Fixes #16180 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#16658 (cherry picked from commit `76f0d5e35b`) Closes scylladb/scylladb#17677	2024-03-08 08:29:22 +02:00
Gleb Natapov	7908f69b7c	raft_group0_client: assert that hold_read_apply_mutex is called on shard 0 group0 operations a valid on shard 0 only. Assert that. (cherry picked from commit `9847e272f9`)	2024-03-05 20:36:44 +01:00
Gleb Natapov	a762ab6283	migration_manager: fix indentation after the previous patch. (cherry picked from commit `77907b97f1`)	2024-03-05 20:36:34 +01:00

1 2 3 4 5 ...

39537 Commits