scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-01 13:45:53 +00:00

Author	SHA1	Message	Date
Lakshmi Narayanan Sreethar	2e836fa077	db/config.cc: increment components_memory_reclaim_threshold config default Incremented the components_memory_reclaim_threshold config's default value to 0.2 as the previous value was too strict and caused unnecessary eviction in otherwise healthy clusters. Fixes #18607 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `3d7d1fa72a`) Closes scylladb/scylladb#19013	2024-06-04 07:11:43 +03:00
Botond Dénes	98139a8716	Merge '[Backport 5.4] : Reload reclaimed bloom filters when memory is available' from Lakshmi Narayanan Sreethar PR #17771 introduced a threshold for the total memory used by all bloom filters across SSTables. When the total usage surpasses the threshold, the largest bloom filter will be removed from memory, bringing the total usage back under the threshold. This PR adds support for reloading such reclaimed bloom filters back into memory when memory becomes available (i.e., within the 10% of available memory earmarked for the reclaimable components). The SSTables manager now maintains a list of all SSTables whose bloom filter was removed from memory and attempts to reload them when an SSTable, whose bloom filter is still in memory, gets deleted. The manager reloads from the smallest to the largest bloom filter to maximize the number of filters being reloaded into memory. Backported from https://github.com/scylladb/scylladb/pull/18186 to 5.4. Closes scylladb/scylladb#18660 * github.com:scylladb/scylladb: sstable_datafile_test: add testcase to test reclaim during reload sstable_datafile_test: add test to verify auto reload of reclaimed components sstables_manager: reload previously reclaimed components when memory is available sstables_manager: start a fiber to reload components sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables sstable_datafile_test: add test to verify reclaimed components reload sstables: support reloading reclaimed components sstables_manager: add new intrusive set to track the reclaimed sstables sstable: add link and comparator class to support new instrusive set sstable: renamed intrusive list link type sstable: track memory reclaimed from components per sstable sstable: rename local variable in sstable::total_reclaimable_memory_size	2024-05-30 11:09:51 +03:00
Nadav Har'El	4099833587	cql3, secondary index: consistently choose index to use in a query When a table has secondary indexes on multiple columns, and several such columns are used for filtering in a query, Scylla chooses one of these indexes as the main driver of the query, and the second column's restriction is implemented as filtering. Before this patch, the index to use was chosen fairly randomly, based on the order of the indexes in the schema. This order may be different in different coordinators, and may even change across restarts on the same coordinators. This is not only inconsistent, it can cause outright wrong results when using paging and switching (or restarting) coordinates in the middle of a paged scan... One coordinator saves one index's key in the paging state, and then the other coordinator gets this paging state and wrongly believes it is supposed to be a key of a different index. The fix in this patch is to pick the index suitable for the first indexed column mentioned in the query. This has two benefits over the situation before the patch: 1. The decision of which index to use no longer changes between coordinators or across restarts - it just depends on the schema and the specific query. 2. Different indexes can have different "specificity" so using one or the other can change the query's performance. After this patch, the user is in control over which index is used by changing the order of terms in the query. A curious user can use tracing to check which index was used to implement a particular query. An xfailing test we had for this issue no longer fails, so the "xfail" marker is removed. Fixes #7969 Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `77c61f907e`) Closes scylladb/scylladb#18963	2024-05-29 18:04:16 +03:00
Nadav Har'El	0d4e22ef55	cql: fix hang during certain SELECT statements The function intersection(r1,r2) in statement_restrictions.cc is used when several WHERE restrictions were applied to the same column. For example, for "WHERE b<1 AND b<2" the intersection of the two ranges is calculated to be b<1. As noted in issue #18690, Scylla is inconsistent in where it allows or doesn't allow these intersecting restrictions. But where they are allowed they must be implemented correctly. And it turns out the function intersection() had a bug that caused it to sometimes enter an infinite loop - when the intent was only to call itself once with swapped parameters. This patch includes a test reproducing this bug, and a fix for the bug. The test hangs before the fix, and passes after the fix. While at it, I carefully reviewed the entire code used to implement the intersection() function to try to make sure that the bug we found was the only one. I also added a few more comments where I thought they were needed to understand complicated logic of the code. The bug, the fix and the test were originally discovered by Michał Chojnowski. Fixes #18688 Refs #18690 Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `27ab560abd`) Closes scylladb/scylladb#18717	2024-05-21 16:31:21 +03:00
Benny Halevy	36c66d5a8f	chunked_vector_test: add more exception safety tests For insertion, with and without reservation, and for fill and copy constructors. Reproduces https://github.com/scylladb/scylladb/issues/18635 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-21 11:31:23 +03:00
Benny Halevy	9413afce41	chunked_vector_test: exception_safe_class: count also moved objects We have to account for moved objects as well as copied objects so they will be balanced with the respective `del_live_object` calls called by the destructor. However, since chunked_vector requires the value_type to be nothrow_move_constructible, just count the additional live object, but do not modify _countdown or, respectively, throw an exception, as this should be considered only for the default and copy constructors. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-21 11:05:38 +03:00
Botond Dénes	7552c4b187	test/boost/mutation_fragment_test.cc: add test for validator validation levels To make sure that the validator doesn't validate what the validation level doesn't include. (cherry picked from commit `78afb3644c`)	2024-05-17 07:55:05 +00:00
Kefu Chai	daf4ffb9b4	test/cql-pytest/test_tools.py: test shard-of with a single partition test_scylla_sstable_shard_of takes lots of time preparing the keys for a certain shard. with the debug build, it takes 3 minutes to complete the test. so in order to test the "shard-of" subcommand in an more efficient way, in this change, we improve the test in two ways: 1. cache the output of 'scylla types shardof`. so we can avoid the overhead of running a seastar application repeatly for the same keys. 2. reduce the number of partitions from 42 to 1. as the number of partitions in an sstable does not matter when testing the output of "shard-of" command of a certain sstable. because, the sstable is always generated by a certain shard. before this change, with pytest-profiling: ``` ncalls tottime percall cumtime percall filename:lineno(function) 4/3 0.000 0.000 181.950 60.650 runner.py:219(call_and_report) 4/3 0.000 0.000 181.948 60.649 runner.py:247(call_runtest_hook) 4/3 0.000 0.000 181.948 60.649 runner.py:318(from_call) 4/3 0.000 0.000 181.948 60.649 runner.py:262(<lambda>) 44/11 0.000 0.000 181.935 16.540 _hooks.py:427(__call__) 43/11 0.000 0.000 181.935 16.540 _manager.py:103(_hookexec) 43/11 0.000 0.000 181.935 16.540 _callers.py:30(_multicall) 361 0.001 0.000 181.531 0.503 contextlib.py:141(__exit__) 782/81 0.001 0.000 177.578 2.192 {built-in method builtins.next} 1044 0.006 0.000 92.452 0.089 base_events.py:1894(_run_once) 11 0.000 0.000 91.129 8.284 fixtures.py:686(<lambda>) 17/11 0.000 0.000 91.129 8.284 fixtures.py:1025(finish) 4 0.000 0.000 91.128 22.782 fixtures.py:913(_teardown_yield_fixture) 2/1 0.000 0.000 91.055 91.055 runner.py:111(pytest_runtest_protocol) 2/1 0.000 0.000 91.055 91.055 runner.py:119(runtestprotocol) 2 0.000 0.000 91.052 45.526 conftest.py:50(cql) 2 0.000 0.000 91.040 45.520 util.py:161(cql_session) 1 0.000 0.000 91.040 91.040 runner.py:180(pytest_runtest_teardown) 1 0.000 0.000 91.040 91.040 runner.py:509(teardown_exact) 1945 0.002 0.000 90.722 0.047 events.py:82(_run) ``` after this change: ``` ncalls tottime percall cumtime percall filename:lineno(function) 4/3 0.000 0.000 8.271 2.757 runner.py:219(call_and_report) 44/11 0.000 0.000 8.270 0.752 _hooks.py:427(__call__) 44/11 0.000 0.000 8.270 0.752 _manager.py:103(_hookexec) 44/11 0.000 0.000 8.270 0.752 _callers.py:30(_multicall) 4/3 0.000 0.000 8.269 2.756 runner.py:247(call_runtest_hook) 4/3 0.000 0.000 8.269 2.756 runner.py:318(from_call) 4/3 0.000 0.000 8.269 2.756 runner.py:262(<lambda>) 48 0.000 0.000 8.269 0.172 {method 'send' of 'generator' objects} 27 0.000 0.000 5.671 0.210 contextlib.py:141(__exit__) 11 0.000 0.000 4.297 0.391 fixtures.py:686(<lambda>) 2/1 0.000 0.000 4.228 4.228 runner.py:111(pytest_runtest_protocol) 2/1 0.000 0.000 4.228 4.228 runner.py:119(runtestprotocol) 2 0.000 0.000 4.213 2.106 capture.py:877(pytest_runtest_teardown) 1 0.000 0.000 4.213 4.213 runner.py:180(pytest_runtest_teardown) 1 0.000 0.000 4.213 4.213 runner.py:509(teardown_exact) 2 0.000 0.000 3.628 1.814 capture.py:872(pytest_runtest_call) 1 0.000 0.000 3.627 3.627 runner.py:160(pytest_runtest_call) 1 0.000 0.000 3.627 3.627 python.py:1797(runtest) 114/81 0.001 0.000 3.505 0.043 {built-in method builtins.next} 15 0.784 0.052 3.183 0.212 subprocess.py:417(check_output) ``` Fixes #16516 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16523 (cherry picked from commit `642652efab`)	2024-05-15 14:32:43 +08:00
Kefu Chai	03a54a4c07	tools/scylla-sstable: add `scylla sstable shard-of` command when migrating to the uuid-based identifiers, the mapping from the integer-based generation to the shard-id is preserved. we used to have "gen % smp_count" for calculating the shard which is responsible to host a given sstable. despite that this is not a documented behavior, this is handy when we try to correlate an sstable to a shard, typically when looking at a performance issue. in this change, a new subcommand is added to expose the connection between the sstable and its "owner" shards. Fixes #16343 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#16345 (cherry picked from commit `273ee36bee`)	2024-05-15 14:32:42 +08:00
Lakshmi Narayanan Sreethar	e30a2af700	sstable_datafile_test: add testcase to test reclaim during reload Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `4d22c4b68b`)	2024-05-14 01:04:42 +05:30
Lakshmi Narayanan Sreethar	e0b4483bb8	sstable_datafile_test: add test to verify auto reload of reclaimed components Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `a080daaa94`)	2024-05-14 00:10:28 +05:30
Lakshmi Narayanan Sreethar	3933fc25de	sstable_directory_test: fix generation in sstable_directory_test_table_scan_incomplete_sstables The testcase uses an sstable whose mutation key and the generation are owned by different shards. Due to this, when process_sstable_dir is called, the sstable gets loaded into a different shard than the one that was intended. This also means that the sstable and the sstable manager end up in different shards. The following patch will introduce a condition variable in sstables manager which will be signalled from the sstables. If the sstable and the sstable manager are in different shards, the signalling will cause the testcase to fail in debug mode with this error : "Promise task was set on shard x but made ready on shard y". So, fix it by supplying appropriate generation number owned by the same shard which owns the mutation key as well. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `24064064e9`)	2024-05-14 00:10:01 +05:30
Lakshmi Narayanan Sreethar	a741202ef0	sstable_datafile_test: add test to verify reclaimed components reload Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `69b2a127b0`)	2024-05-14 00:09:54 +05:30
Kamil Braun	ed89deab40	direct_failure_detector: increase ping timeout and make it tunable The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings. Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s). Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable. Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups. In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector. Ref: scylladb/scylladb#16410 Fixes: scylladb/scylladb#16607 --- I tested the change manually using `tc qdisc ... netem delay`, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in scylladb/scylladb#16410: interleaving ``` raft_group_registry - marking Raft server ... as dead for Raft groups raft_group_registry - marking Raft server ... as alive for Raft groups ``` happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive. With the change, the phenomenon no longer appears. (cherry picked from commit `8df6d10e88`) Closes scylladb/scylladb#18559	2024-05-08 14:57:09 +02:00
Nadav Har'El	862e2affe0	cql3: Fix invalid JSON parsing for JSON object with different key types More than three years ago, in issue #7949, we noticed that trying to set a `map<ascii, int>` from JSON input (i.e., using INSERT JSON or the fromJson() function) fails - the ascii key is incorrectly parsed. We fixed that issue in commit `75109e9519` but unfortunately, did not do our due diligence: We did not write enough tests inspired by this bug, and failed to discover that actually we have the same bug for many other key types, not just for "ascii". Specifically, the following key types have exactly the same bug: * blob * date * inet * time * timestamp * timeuuid * uuid Other types, like numbers or boolean worked "by accident" - instead of parsing them as a normal string, we asked the JSON parser to parse them again after removing the quotes, and because unquoted numbers and unquoted true/false happwn to work in JSON, this didn't fail. The fix here is very simple - for all native types (i.e., not collections or tuples), the encoding of the key in JSON is simply a quoted string - and removing the quotes is all we need to do and there's no need to run the JSON parser a second time. Only for more elaborate types - collections and tuples - we need to run the JSON parser a second time on the key string to build the more elaborate object. This patch also includes tests for fromJson() reading a map with all native key types, confirming that all the aforementioned key types were broken before this patch, and all key types (including the numbers and booleans which worked even befoe this patch) work with this patch. Fixes #18477. Signed-off-by: Nadav Har'El <nyh@scylladb.com> (cherry picked from commit `21557cfaa6`) Closes scylladb/scylladb#18522	2024-05-05 23:53:19 +03:00
Lakshmi Narayanan Sreethar	201d990072	sstables: reclaim_memory_from_components: do not update _recognised_components When reclaiming memory from bloom filters, do not remove them from _recognised_components, as that leads to the on-disk filter component being left back on disk when the SSTable is deleted. Fixes #18398 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#18400 (cherry picked from commit `6af2659b57`)	2024-04-26 10:59:13 +03:00
Botond Dénes	d09e2a2311	test/boost/sstable_compaction_test: add validation test with valid sstable Add a positive test, as it turns out we had some false-positive validation bugs in the validator and we need a regression test for this. (cherry picked from commit `2335f42b2b`)	2024-04-24 09:38:57 -04:00
Botond Dénes	1d80427888	test/boost/sstable_compaction_test: drop write_corrupt_sstable() helper It is not used anymore. (cherry picked from commit `8be97884ec`)	2024-04-24 09:26:46 -04:00
Botond Dénes	ff17ec81e4	test/boost/sstable_compaction_test: fix indentation (cherry picked from commit `da0f4d3a9f`)	2024-04-24 09:26:30 -04:00
Botond Dénes	121f2a530e	test/boost/sstable_compaction_test: use test_scrub_framework in test_scrub_quarantine_mode_test The test becomes a lot shorter and it now uses random schema and random data. Indentation is left broken, to be fixed in a future patch. (cherry picked from commit `c35092aff6`)	2024-04-24 09:03:44 -04:00
Botond Dénes	b77581c84c	test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_segregate_mode_test The test becomes a lot shorter and it now uses random schema and random data. Indentation is left broken, to be fixed in a future patch. (cherry picked from commit `3f76aad609`)	2024-04-24 08:58:18 -04:00
Botond Dénes	ea176bf4ce	test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_skip_mode_test The test becomes a lot shorter and it now uses random schema and random data. The test is also split in two: one test for abort mode and one for skip mode. Indentation is left broken, to be fixed in a future patch. (cherry picked from commit `5237e8133b`)	2024-04-24 08:45:53 -04:00
Botond Dénes	3835fd681d	test/boost/sstable_compaction_test: use scrub_test_framework in sstable_scrub_validate_mode_test The test becomes a lot shorter and it now uses random schema and random data. Indentation is left broken, to be fixed in a future patch. (cherry picked from commit `76785baf43`)	2024-04-24 07:57:54 -04:00
Botond Dénes	14da273c4c	test/boost/sstable_compaction_test: introduce scrub_test_framework Scrub tests require a lot of boilerplate code to work. This has a lot of disadvantages: * Tests are long * The "meat" of the test is lost between all the boiler-plate, it is hard to glean what a test actually does * Tests are hard to write, so we have only a few of them and they test multiple things. * The boiler-plate differs sligthly from test-to-test. To solve this, this patch introduces a new class, `scrub_test_frawmework`, which is a central place for all the boiler-plate code needed to write scrub-related tests. In the next patches, we will migrate scrub related tests to this class. (cherry picked from commit `b6f0c4efa0`)	2024-04-24 07:57:54 -04:00
Botond Dénes	33d5f27244	test/lib/random_schema: add uncompatible_timestamp_generator() Guarantees that produced mutations will not be compactible. (cherry picked from commit `e412673c44`)	2024-04-24 07:57:54 -04:00
Kamil Braun	53e1ed0ebb	Merge '[Backport 5.4] gossiper: lock local endpoint when updating heart_beat' from ScyllaDB In testing, we've observed multiple cases where nodes would fail to observe updated application states of other nodes in gossiper. For example: - in scylladb/scylladb#16902, a node would finish bootstrapping and enter NORMAL state, propagating this information through gossiper. However, other nodes would never observe that the node entered NORMAL state, still thinking that it is in joining state. This would lead to further bad consequences down the line. - in scylladb/scylladb#15393, a node got stuck in bootstrap, waiting for schema versions to converge. Convergence would never be achieved and the test eventually timed out. The node was observing outdated schema state of some existing node in gossip. I created a test that would bootstrap 3 nodes, then wait until they all observe each other as NORMAL, with timeout. Unfortunately, thousands of runs of this test on different machines failed to reproduce the problem. After banging my head against the wall failing to reproduce, I decided to sprinkle randomized sleeps across multiple places in gossiper code and finally: the test started catching the problem in about 1 in 1000 runs. With additional logging and additional head-banging, I determined the root cause. The following scenario can happen, 2 nodes are sufficient, let's call them A and B: - Node B calls `add_local_application_state` to update its gossiper state, for example, to propagate its new NORMAL status. - `add_local_application_state` takes a copy of the endpoint_state, and updates the copy: ``` auto local_state = ep_state_before; for (auto& p : states) { auto& state = p.first; auto& value = p.second; value = versioned_value::clone_with_higher_version(value); local_state.add_application_state(state, value); } ``` `clone_with_higher_version` bumps `version` inside gms/version_generator.cc. - `add_local_application_state` calls `gossiper.replicate(...)` - `replicate` works in 2 phases to achieve exception safety: in first phase it copies the updated `local_state` to all shards into a separate map. In second phase the values from separate map are used to overwrite the endpoint_state map used for gossiping. Due to the cross-shard calls of the 1 phase, there is a yield before the second phase. During this yield* the following happens: - `gossiper::run()` loop on B executes and bumps node B's `heart_beat`. This uses the monotonic version_generator, so it uses a higher version then the ones we used for states added above. Let's call this new version X. Note that X is larger than the versions used by application_states added above. - now node B handles a SYN or ACK message from node A, creating an ACK or ACK2 message in response. This message contains: - old application states (NOT including the update described above, because `replicate` is still sleeping before phase 2), - but bumped heart_beat == X from `gossiper::run()` loop, and sends the message. - node A receives the message and remembers that the max version across all states (including heart_beat) of node B is X. This means that it will no longer request or apply states from node B with versions smaller than X. - `gossiper.replicate(...)` on B wakes up, and overwrites endpoint_state with the ones it saved in phase 1. In particular it reverts heart_beat back to smaller value, but the larger problem is that it saves updated application_states that use versions smaller than X. - now when node B sends the updated application_states in ACK or ACK2 message to node A, node A will ignore them, because their versions are smaller than X. Or node B will never send them, because whenever node A requests states from node B, it only requests states with versions > X. Either way, node A will fail to observe new states of node B. If I understand correctly, this is a regression introduced in `38c2347a3c`, which introduced a yield in `replicate`. Before that, the updated state would be saved atomically on shard 0, there could be no `heart_beat` bump in-between making a copy of the local state, updating it, and then saving it. With the description above, it's easy to make a consistent reproducer for the problem -- introduce a longer sleep in `add_local_application_state` before second phase of replicate, to increase the chance that gossiper loop will execute and bump heart_beat version during the yield. Further commit adds a test based on that. The fix is to bump the heart_beat under local endpoint lock, which is also taken by `replicate`. The PR also adds a regression test. Fixes: scylladb/scylladb#15393 Fixes: scylladb/scylladb#15602 Fixes: scylladb/scylladb#16668 Fixes: scylladb/scylladb#16902 Fixes: scylladb/scylladb#17493 Fixes: scylladb/scylladb#18118 Ref: scylladb/scylla-enterprise#3720 (cherry picked from commit `a0b331b310`) (cherry picked from commit `72955093eb`) Refs scylladb/scylladb#18184 Closes scylladb/scylladb#18245 * github.com:scylladb/scylladb: test: reproducer for missing gossiper updates gossiper: lock local endpoint when updating heart_beat	2024-04-17 17:50:30 +02:00
Kamil Braun	28781ca37e	test: reproducer for missing gossiper updates Regression test for scylladb/scylladb#17493. (cherry picked from commit `72955093eb`) Backport note: removed `timeout` parameter passed to `server_add`, missing on this branch. (If server adding hangs, it will timeout after `TOPOLOGY_TIMEOUT` from scylla_cluster.py) Removed `force_gossip_join_boot` error injection from test, not present in this branch. Starting nodes with `experimental_features` disabled. Added missing `handle_state_normal.*finished` message.	2024-04-17 13:09:39 +02:00
Lakshmi Narayanan Sreethar	75962d3e94	test_bloom_filter.py: disable reclaiming memory from components Disabled reclaiming memory from sstable components in the testcase as it interferes with the false positive calculation. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `d86505e399`)	2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar	034304127c	sstable_datafile_test: add tests to verify auto reclamation of components Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `d261f0fbea`)	2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar	95068d3c00	test/lib: allow overriding available memory via test_env_config Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `169629dd40`)	2024-04-16 13:05:40 +05:30
Lakshmi Narayanan Sreethar	1609b77b45	sstable_datafile_test: add testcase to verify reclamation from sstables Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `e0b6186d16`)	2024-04-16 13:05:40 +05:30
Aleksandra Martyniuk	bfc4104eb9	test: add test for repair_row::size() Add test which checs whether repair_row::size() considers external memory. (cherry picked from commit `51c09a84cc`)	2024-04-06 22:44:51 +00:00
Michał Chojnowski	3df5de60a9	cache_flat_mutation_reader: only call get_iterator_in_latest() when pointing at a row Calling `_next_row.get_iterator_in_latest()` is illegal when `_next_row` is not pointing at a row. In particular, the iterator returned by such call might be dangling. We have observed this to cause a use-after-free in the field, when a reverse read called `maybe_add_to_cache` after `_latest_it` was left dangling after a dead row removal in `copy_from_cache_to_buffer`. To fix this, we should ensure that we only call `_next_row.get_iterator_in_latest` is pointing at a row. Only the occurrences of this problem in `maybe_add_to_cache` are truly dangerous. As far as I can see, other occurrences can't break anything as of now. But we apply fixes to them anyway. (cherry picked from commit `04db6d4bb1`) Closes scylladb/scylladb#18075	2024-03-28 11:04:28 +01:00
Kefu Chai	57fa61e2ca	tests: utils: error injection: print time duration instead of count instead of casting / comparing the count of duration unit, let's just compare the durations, so that boost.test is able to print the duration in a more informative and user friendly way (line wrapped) test/boost/error_injection_test.cc(167): fatal error: in "test_inject_future_disabled": critical check wait_time > sleep_msec has failed [23839ns <= 10ms] Refs #15902 Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> (cherry picked from commit `1d33a68dd7`)	2024-03-20 09:32:00 +00:00
Aleksandra Martyniuk	881ac7a9af	test: fix regular compaction tasks check Since `6b87778` regular compaction tasks are removed from task manager immediately after they are finished. test_regular_compaction_task lists compaction tasks and then requests their statuses. Only one regular compaction task is guaranteed to still be running at that time, the rest of them may finish before their status is requested and so it will no longer be in task manager, causing the test to fail. Fix statuses check to consider the possibility of a regular compaction task being removed from task manager. Fixes: #17776. (cherry picked from commit `80c5eb4ecb`) Closes scylladb/scylladb#17810	2024-03-15 08:54:00 +02:00
Nadav Har'El	8a1f01ad88	alternator, mv: fix case of two new key columns in GSI A materialized view in CQL allows AT MOST ONE view key column that wasn't a key column in the base table. This is because if there were two or more of those, the "liveness" (timestamp, ttl) of these different columns can change at every update, and it's not possible to pick what liveness to use for the view row we create. We made an exception for this rule for Alternator: DynamoDB's API allows creating a GSI whose partition key and range key are both regular columns in the base table, and we must support this. We claim that the fact that Alternator allows neither TTL (Alternator's "TTL" is a different feature) nor user-defined timestamps, does allow picking the liveness for the view row we create. But we did it wrong! We claimed in a comment - and implemented in the code before this patch - that in Alternator we can assume that both GSI key columns will have the same liveness, and in particular timestamp. But this is only true if one modifies both columns together! In fact, in general it is not true: We can have two non-key attributes 'a' and 'b' which are the GSI's key columns, and we can modify only b, without modifying a, in which case the timestamp of the view modification should be b's newer timestamp, not a's older one. The existing code took a's timestamp, assuming it will be the same as b's, which is incorrect. The result was that if we repeatedly modify only b, all view updates will receive the same timestamp (a's old timestamp), and a deletion will always win over all the modifications. This patch includes a reproducing test written by a user (@Zak-Kent) that demonstrates how after a view row is deleted it doesn't get recreated - because all the modifications use the same timestamp. The fix is, as suggested above, to use the higher of the two timestamps of both base-regular-column GSI key columns as the timestamp for the new view rows or view row deletions. The reproducer that failed before this patch passes with it. As usual, the reproducer passes on AWS DynamoDB as well, proving that the test is correct and should really work. Fixes #17119 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17172 (cherry picked from commit `21e7deafeb`)	2024-03-13 14:46:03 +02:00
Raphael S. Carvalho	db1c8e8754	Fix potential data resurrection when another compaction type does cleanup work Since commit `f1bbf70`, many compaction types can do cleanup work, but turns out we forgot to invalidate cache on their completion. So if a node regains ownership of token that had partition deleted in its previous owner (and tombstone is already gone), data can be resurrected. Tablet is not affected, as it explicitly invalidates cache during migration cleanup stage. Scylla 5.4 is affected. Fixes #17501. Fixes #17452. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#17502 (cherry picked from commit `f07c233ad5`)	2024-03-13 14:10:05 +02:00
Kamil Braun	6e01e821d7	test: unflake test_topology_remove_garbage_group0 The test is booting nodes, and then immediately starts shutting down nodes and removing them from the cluster. The shutting down and removing may happen before driver manages to connect to all nodes in the cluster. In particular, the driver didn't yet connect to the last bootstrapped node. Or it can even happen that the driver has connected, but the control connection is established to the first node, and the driver fetched topology from the first node when the first node didn't yet consider the last node to be normal. So the driver decides to close connection to the last node like this: ``` 22:34:03.159 DEBUG> [control connection] Removing host not found in peers metadata: <Host: 127.42.90.14:9042 datacenter1> ``` Eventually, at the end of the test, only the last node remains, all other nodes have been removed or stopped. But the driver does not have a connection to that last node. Fix this problem by ensuring that: - all nodes see each other as NORMAL, - the driver has connected to all nodes at the beginning of the test, before we start shutting down and removing nodes. Fixes scylladb/scylladb#16373 (cherry picked from commit `a68701ed4f`) Closes scylladb/scylladb#17702	2024-03-12 10:50:27 +01:00
Michał Chojnowski	42d9e36454	sstables: fix a use-after-free in key_view::explode() key_view::explode() contains a blatant use-after-free: unless the input is already linearized, it returns a view to a local temporary buffer. This is rare, because partition keys are usually not large enough to be fragmented. But for a sufficiently large key, this bug causes a corrupted partition_key down the line. Fixes #17625 (cherry picked from commit `7a7b8972e5`) Closes scylladb/scylladb#17717	2024-03-11 12:05:55 +02:00
Lakshmi Narayanan Sreethar	05d2078911	reader_permit: store schema_ptr instead of raw schema pointer Store schema_ptr in reader permit instead of storing a const pointer to schema to ensure that the schema doesn't get changed elsewhere when the permit is holding on to it. Also update the constructors and all the relevant callers to pass down schema_ptr instead of a raw pointer. Fixes #16180 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#16658 (cherry picked from commit `76f0d5e35b`) Closes scylladb/scylladb#17677	2024-03-08 08:29:22 +02:00
Avi Kivity	58a1be93b2	Merge ' test/topology_custom: test_read_repair.py: reduce run-time ' from Botond Dénes This test needed a lot of data to ensure multiple pages when doing the read repair. This change two key configuration items, allowing for a drastic reduction of the data size and consequently a large reduction in run-time. * Changes query-tombstone-page-limit 1000 -> 10. Before `f068d1a6fa`, reducing this to a too small value would start killing internal queries. Now, after said commit, this is no longer a concern, as this limit no longer affects unpaged queries. * Sets (the new) query-page-size-in-bytes 1MB (default) -> 1KB. The latter configuration is a new one, added by the first patches of this series. It allows configuring the page-size in bytes, after which pages are cut. Previously this was a hard-coded constant: 1MB. This forced any tests which wanted to check paging, with pages cut on size, to work with large datasets. This was especially pronounced in the tests fixed in this PR, because this test works with tombstones which are tiny and a lot of them were needed to trigger paging based on the size. With this two changes, we can reduce the data size: * total_rows: 20000 -> 100 * max_live_rows: 32 -> 8 The runtime of the test consequently drops from 62 seconds to 13.5 seconds (dev mode, on my build machine). Fixes: https://github.com/scylladb/scylladb/issues/15425 Fixes: https://github.com/scylladb/scylladb/issues/16899 Closes scylladb/scylladb#17529 * github.com:scylladb/scylladb: test/topology_custom: test_read_repair.py: reduce run-time replica/database: get_query_max_result_size(): use query_page_size_in_bytes replica/database: use include page-size in max-result-size query-request: max_result_size: add without_page_limit() db/config: introduce query_page_size_in_bytes (cherry picked from commit `616eec2214`)	2024-02-28 11:23:22 +02:00
Botond Dénes	6a6450a82d	Merge '[Backport 5.4] repair: streaming: handle no_such_column_family from remote node' from Aleksandra Martyniuk RPC calls lose information about the type of returned exception. Thus, if a table is dropped on receiver node, but it still exists on a sender node and sender node streams the table's data, then the whole operation fails. To prevent that, add a method which synchronizes schema and then checks, if the exception was caused by table drop. If so, the exception is swallowed. Use the method in streaming and repair to continue them when the table is dropped in the meantime. Fixes: https://github.com/scylladb/scylladb/issues/17028. Fixes: https://github.com/scylladb/scylladb/issues/15370. Fixes: https://github.com/scylladb/scylladb/issues/15598. Closes scylladb/scylladb#17525 * github.com:scylladb/scylladb: repair: handle no_such_column_family from remote node gracefully test: test drop table on receiver side during streaming streaming: fix indentation streaming: handle no_such_column_family from remote node gracefully repair: add methods to skip dropped table	2024-02-27 10:57:48 +02:00
Aleksandra Martyniuk	f843e7181b	test: add test to check if reader is closed Add test to check if reader is closed in sstable::has_partition_key. (cherry picked from commit `4530be9e5b`)	2024-02-26 15:40:49 +01:00
Aleksandra Martyniuk	5e665cd7fb	test: test drop table on receiver side during streaming (cherry picked from commit `2ea5d9b623`)	2024-02-26 13:00:58 +01:00
Nadav Har'El	72e804306c	mv: fix missing view deletions in some cases of range tombstones For efficiency, if a base-table update generates many view updates that go the same partition, they are collected as one mutation. If this mutation grows too big it can lead to memory exhaustion, so since commit `7d214800d0` we split the output mutation to mutations no longer than 100 rows (max_rows_for_view_updates) each. This patch fixes a bug where this split was done incorrectly when the update involved range tombstones, a bug which was discovered by a user in a real use case (#17117). Range tombstones are read in two parts, a beginning and an end, and the code could split the processing between these two parts and the result that some of the range tombstones in update could be missed - and the view could miss some deletions that happened in the base table. This patch fixes the code in two places to avoid breaking up the processing between range tombstones: 1. The counter "_op_count" that decides where to break the output mutation should only be incremented when adding rows to this output mutation. The existing code strangely incrmented it on every read (!?) which resulted in the counter being incremented on every input fragment, and in particular could reach the limit 100 between two range tombstone pieces. 2. Moreover, the length of output was checked in the wrong place... The existing code could get to 100 rows, not check at that point, read the next input - half a range tombstone - and only then check that we reached 100 rows and stop. The fix is to calculate the number of rows in the right place - exactly when it's needed, not before the step. The first change needs more justification: The old code, that incremented _op_count on every input fragment and not just output fragments did not fit the stated goal of its introduction - to avoid large allocations. In one test it resulted in breaking up the output mutation to chunks of 25 rows instead of the intended 100 rows. But, maybe there was another goal, to stop the iteration after 100 input rows and avoid the possibility of stalls if there are no output rows? It turns out the answer is no - we don't need this _op_count increment to avoid stalls: The function build_some() uses `co_await on_results()` to run one step of processing one input fragment - and `co_await` always checks for preemption. I verfied that indeed no stalls happen by using the existing test test_long_skipped_view_update_delete_with_timestamp. It generates a very long base update where all the view updates go to the same partition, but all but the last few updates don't generate any view updates. I confirmed that the fixed code loops over all these input rows without increasing _op_count and without generating any view update yet, but it does NOT stall. This patch also includes two tests reproducing this bug and confirming its fixed, and also two additional tests for breaking up long deletions that I wanted to make sure doesn't fail after this patch (it doesn't). By the way, this fix would have also fixed issue #12297 - which we fixed a year ago in a different way. That issue happend when the code went through 100 input rows without generating any output rows, and incorrectly concluding that there's no view update to send. With this fix, the code no longer stops generating the view update just because it saw 100 input rows - it would have waited until it generated 100 output rows in the view update (or the input is really done). Fixes #17117 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#17164 (cherry picked from commit `14315fcbc3`)	2024-02-22 15:04:28 +02:00
Avi Kivity	384a0628b0	Merge 'cdc: metadata: allow sending writes to the previous generations' from Patryk Jędrzejczak Before this PR, writes to the previous CDC generations would always be rejected. After this PR, they will be accepted if the write's timestamp is greater than `now - generation_leeway`. This change was proposed around 3 years ago. The motivation was to improve user experience. If a client generates timestamps by itself and its clock is desynchronized with the clock of the node the client is connected to, there could be a period during generation switching when writes fail. We didn't consider this problem critical because the client could simply retry a failed write with a higher timestamp. Eventually, it would succeed. This approach is safe because these failed writes cannot have any side effects. However, it can be inconvenient. Writing to previous generations was proposed to improve it. The idea was rejected 3 years ago. Recently, it turned out that there is a case when the client cannot retry a write with the increased timestamp. It happens when a table uses CDC and LWT, which makes timestamps permanent. Once Paxos commits an entry with a given timestamp, Scylla will keep trying to apply that entry until it succeeds, with the same timestamp. Applying the entry involves writing to the CDC log table. If it fails, we get stuck. It's a major bug with an unknown perfect solution. Allowing writes to previous generations for `generation_leeway` is a probabilistic fix that should solve the problem in practice. Apart from this change, this PR adds tests for it and updates the documentation. This PR is sufficient to enable writes to the previous generations only in the gossiper-based topology. The Raft-based topology needs some adjustments in loading and cleaning CDC generations. These changes won't interfere with the changes introduced in this PR, so they are left for a follow-up. Fixes scylladb/scylladb#7251 Fixes scylladb/scylladb#15260 Closes scylladb/scylladb#17134 * github.com:scylladb/scylladb: docs: using-scylla: cdc: remove info about failing writes to old generations docs: dev: cdc: document writing to previous CDC generations test: add test_writes_to_previous_cdc_generations cdc: generation: allow increasing generation_leeway through error injection cdc: metadata: allow sending writes to the previous generations (cherry picked from commit `9bb4482ad0`) Backport note: in tests, replaced `servers_add` with loop of `server_add`	2024-02-22 12:44:24 +01:00
Wojciech Mitros	435000ee70	rust: update dependencies The currently used versions of "time" and "rustix" depencies had minor security vulnerabilities. In this patch: - the "rustix" crate is updated - the "chrono" crate that we depend on was not compatible with the version of the "time" crate that had fixes, so we updated the "chrono" crate, which actually removed the dependency on "time" completely. Both updated were performed using "cargo update" on the relevant package and the corresponding version. Refs #15772 Closes scylladb/scylladb#17407	2024-02-19 22:12:13 +02:00
Botond Dénes	e4526449a1	query: do not kill unpaged queries when they reach the tombstone-limit The reason we introduced the tombstone-limit (query_tombstone_page_limit), was to allow paged queries to return incomplete/empty pages in the face of large tombstone spans. This works by cutting the page after the tombstone-limit amount of tombstones were processed. If the read is unpaged, it is killed instead. This was a mistake. First, it doesn't really make sense, the reason we introduced the tombstone limit, was to allow paged queries to process large tombstone-spans without timing out. It does not help unpaged queries. Furthermore, the tombstone-limit can kill internal queries done on behalf of user queries, because all our internal queries are unpaged. This can cause denial of service. So in this patch we disable the tombstone-limit for unpaged queries altogether, they are allowed to continue even after having processed the configured limit of tombstones. Fixes: #17241 Closes scylladb/scylladb#17242 (cherry picked from commit `f068d1a6fa`)	2024-02-15 12:50:09 +02:00
Botond Dénes	62d8c7274a	Merge 'Fix mintimeuuid() call that could crash Scylla' from Nadav Har'El This PR fixes the bug of certain calls to the `mintimeuuid()` CQL function which large negative timestamps could crash Scylla. It turns out we already had protections in place against very positive timestamps, but very negative timestamps could still cause bugs. The actual fix in this series is just a few lines, but the bigger effort was improving the test coverage in this area. I added tests for the "date" type (the original reproducer for this bug used totimestamp() which takes a date parameter), and also reproducers for this bug directly, without totimestamp() function, and one with that function. Finally this PR also replaces the assert() which made this molehill-of-a-bug into a mountain, by a throw. Fixes #17035 Closes scylladb/scylladb#17073 * github.com:scylladb/scylladb: utils: replace assert() by on_internal_error() utils: add on_internal_error with common logger utils: add a timeuuid minimum, like we had maximum test/cql-pytest: tests for "date" type (cherry picked from commit `2a4b991772`)	2024-02-07 13:47:55 +02:00
Kamil Braun	311e31b36f	test_raft_snapshot_request: fix flakiness (again) At the end of the test, we wait until a restarted node receives a snapshot from the leader, and then verify that the log has been truncated. To check the snapshot, the test used the `system.raft_snapshots` table, while the log is stored in `system.raft`. Unfortunately, the two tables are not updated atomically when Raft persists a snapshot (scylladb/scylladb#9603). We first update `system.raft_snapshots`, then `system.raft` (see `raft_sys_table_storage::store_snapshot_descriptor`). So after the wait finishes, there's no guarantee the log has been truncated yet -- there's a race between the test's last check and Scylla doing that last delete. But we can check the snapshot using `system.raft` instead of `system.raft_snapshots`, as `system.raft` has the latest ID. And since `1640f83fdc`, storing that ID and truncating the log in `system.raft` happens atomically. Closes scylladb/scylladb#17106 (cherry picked from commit `c911bf1a33`)	2024-02-02 13:02:30 +01:00

1 2 3 4 5 ...

5746 Commits