scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-09 00:13:31 +00:00

Author	SHA1	Message	Date
Łukasz Paszkowski	cf36de2c9a	compaction_manager: cancel submission timer on drain The `drain` method, cancels all running compactions and moves the compaction manager into the disabled state. To move it back to the enabled state, the `enable` method shall be called. This, however, throws an assertion error as the submission time is not cancelled and re-enabling the manager tries to arm the armed timer. Thus, cancel the timer, when calling the drain method to disable the compaction manager. Fixes https://github.com/scylladb/scylladb/issues/24504 All versions are affected. So it's a good candidate for a backport. Closes scylladb/scylladb#24505 (cherry picked from commit `a9a53d9178`) Closes scylladb/scylladb#24590	2025-07-03 10:19:16 +03:00
Botond Dénes	9816cdb901	Merge '[Backport 2025.2] mutation: check key of inserted rows' from Scylladb[bot] Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations. Fixes: https://github.com/scylladb/scylladb/issues/24506 Not a typical backport candidate (not a bugfix or regression fix), but we should still backport so we have the additional checks deployed to existing production clusters. - (cherry picked from commit `8b756ea837`) - (cherry picked from commit `ab96c703ff`) Parent PR: #24497 Closes scylladb/scylladb#24742 * github.com:scylladb/scylladb: mutation: check key of inserted rows compound: optimize is_full() for single-component types	2025-07-02 10:27:13 +03:00
Botond Dénes	7786167998	Merge '[Backport 2025.2] encryption_at_rest_test: Fix some spurious errors' from Scylladb[bot] Fixes #24574 * Ensure we close the embedded load_cache objects on encryption shutdown, otherwise we can, in unit testing, get destruction of these while a timer is still active -> assert * Add extra exception handling to `network_error_test_helper`, so even if test framework might exception-escape, we properly stop the network proxy to avoid use after free. - (cherry picked from commit `ee98f5d361`) - (cherry picked from commit `8d37e5e24b`) Parent PR: #24633 Closes scylladb/scylladb#24770 * github.com:scylladb/scylladb: encryption_at_rest_test: Add exception handler to ensure proxy stop encryption: Ensure stopping timers in provider cache objects	2025-07-02 10:03:52 +03:00
Calle Wilund	9a60a2adce	encryption_at_rest_test: Add exception handler to ensure proxy stop If boost test is run such that we somehow except even in a test macro such as BOOST_REQUIRE_THROW, we could end up not stopping the net proxy used, causing a use after free. (cherry picked from commit `8d37e5e24b`)	2025-07-01 15:12:50 +00:00
Avi Kivity	dd509b9513	Merge '[Backport 2025.2] memtable: ensure _flushed_memory doesn't grow above total_memory' from Scylladb[bot] `dirty_memory_manager` tracks two quantities about memtable memory usage: "real" and "unspooled" memory usage. "real" is the total memory usage (sum of `occupancy().total_space()`) by all memtable LSA regions, plus a upper-bound estimate of the size of memtable data which has already moved to the cache region but isn't evictable (merged into the cache) yet. "unspooled" is the difference between total memory usage by all memtable LSA regions, and the total flushed memory (sum of `_flushed_memory`) of memtables. `dirty_memory_manager` controls the shares of compaction and/or blocks writes when these quantities cross various thresholds. "Total flushed memory" isn't a well defined notion, since the actual consumption of memory by the same data can vary over time due to LSA compactions, and even the data present in memtable can change over the course of the flush due to removals of outdated MVCC versions. So `_flushed_memory` is merely an approximation computed by `flush_reader` based on the data passing through it. This approximation is supposed to be a conservative lower bound. In particular, `_flushed_memory` should be not greater than `occupancy().total_space()`. Otherwise, for example, "unspooled" memory could become negative (and/or wrap around) and weird things could happen. There is an assertion in `~flush_memory_accounter` which checks that `_flushed_memory < occupancy().total_space()` at the end of flush. But it can fail. Without additional treatment, the memtable reader sometimes emits data which is already deleted. (In particular, it emites rows covered by a partition tombstone in a newer MVCC version.) This data is seen by `flush_reader` and accounted in `_flushed_memory`. But this data can be garbage-collected by the `mutation_cleaner` later during the flush and decrease `total_memory` below `_flushed_memory`. There is a piece of code in `mutation_cleaner` intended to prevent that. If `total_memory` decreases during a `mutation_cleaner` run, `_flushed_memory` is lowered by the same amount, just to preserve the asserted property. (This could also make `_flushed_memory` quite inaccurate, but that's considered acceptable). But that only works if `total_memory` is decreased during that run. It doesn't work if the `total_memory` decrease (enabled by the new allocator holes made by `mutation_cleaner`'s garbage collection work) happens asynchronously (due to memory reclaim for whatever reason) after the run. This patch fixes that by tracking the decreases of `total_memory` closer to the source. Instead of relying on `mutation_cleaner` to notify the memtable if it lowers `total_memory`, the memtable itself listens for notifications about LSA segment deallocations. It keeps `_flushed_memory` equal to the reader's estimate of flushed memory decreased by the change in `total_memory` since the beginning of flush (if it was positive), and it keeps the amount of "spooled" memory reported to the `dirty_memory_manager` at `max(0, _flushed_memory)`. Fixes scylladb/scylladb#21413 Backport candidate because it fixes a crash that can happen in existing stable branches. - (cherry picked from commit `7d551f99be`) - (cherry picked from commit `975e7e405a`) Parent PR: #21638 Closes scylladb/scylladb#24604 * github.com:scylladb/scylladb: memtable: ensure _flushed_memory doesn't grow above total memory usage replica/memtable: move region_listener handlers from dirty_memory_manager to memtable	2025-07-01 12:31:25 +03:00
Lakshmi Narayanan Sreethar	adab525151	utils/big_decimal: fix scale overflow when parsing values with large exponents The exponent of a big decimal string is parsed as an int32, adjusted for the removed fractional part, and stored as an int32. When parsing values like `1.23E-2147483647`, the unscaled value becomes `123`, and the scale is adjusted to `2147483647 + 2 = 2147483649`. This exceeds the int32 limit, and since the scale is stored as an int32, it overflows and wraps around, losing the value. This patch fixes that the by parsing the exponent as an int64 value and then adjusting it for the fractional part. The adjusted scale is then checked to see if it is still within int32 limits before storing. An exception is thrown if it is not within the int32 limits. Note that strings with exponents that exceed the int32 range, like `0.01E2147483650`, were previously not parseable as a big decimal. They are now accepted if the final adjusted scale fits within int32 limits. For the above value, unscaled_value = 1 and scale = -2147483648, so it is now accepted. This is in line with how Java's `BigDecimal` parses strings. Fixes: #24581 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#24640 (cherry picked from commit `279253ffd0`) Closes scylladb/scylladb#24692	2025-07-01 12:28:55 +03:00
Botond Dénes	5f45cf1683	test/boost/memtable_test: only inject error for test table Currently the test indiscriminately injects failures into the flushes of any table, via the IO extension mechanism. The tests want to check that the node correctly handles the IO error by self isolating, however the indiscriminate IO errors can have unintended consequences when they hit raft, leading to disorderly shutdown and failure of the tests. Testing raft's resiliency to IO errors if of course worth doing, but it is not the goal of this particular test, so to avoid the fallout, the IO errors are limited to the test tables only. Fixes: https://github.com/scylladb/scylladb/issues/24637 Closes scylladb/scylladb#24638 (cherry picked from commit `ee6d7c6ad9`) Closes scylladb/scylladb#24743	2025-07-01 12:28:05 +03:00
Botond Dénes	236cab0f66	test/boost/sstable_datafile_test: add test for corrupt data * create a table with random schema * generate data: random mutations + one row with bad key * write data to sstable * check that only good data is written to sstable * check that the bad data was saved to system.corrupt_data (cherry picked from commit `edc2906892`)	2025-06-30 12:44:29 +00:00
Botond Dénes	f4f0ffd713	mutation: check key of inserted rows Make sure the keys are full prefixes as it is expected to be the case for rows. At severeal occasions we have seen empty row keys make their ways into the sstables, despite the fact that they are not allowed by the CQL frontend. This means that such empty keys are possibly results of memory corruption or use-after-{free,copy} errors. The source of the corruption is impossible to pinpoint when the empty key is discovered in the sstable. So this patch adds checks for such keys to places where mutations are built: when building or unserializing mutations. The test row_cache_test/test_reading_of_nonfull_keys needs adjustment to work with the changes: it has to make the schema use compact storage, otherwise the non-full changes used by this tests are rejected by the new checks. Fixes: https://github.com/scylladb/scylladb/issues/24506 (cherry picked from commit `ab96c703ff`)	2025-06-30 12:43:36 +00:00
Michał Chojnowski	9b98bacaa1	replica/memtable: move region_listener handlers from dirty_memory_manager to memtable The memtable wants to listen for changes in its `total_memory` in order to decrease its `_flushed_memory` in case some of the freed memory has already been accounted as flushed. (This can happen because the flush reader sees and accounts even outdated MVCC versions, which can be deleted and freed during the flush). Today, the memtable doesn't listen to those changes directly. Instead, some calls which can affect `total_memory` (in particular, the mutation cleaner) manually check the value of `total_memory` before and after they run, and they pass the difference to the memtable. But that's not good enough, because `total_memory` can also change outside of those manually-checked calls -- for example, during LSA compaction, which can occur anytime. This makes memtable's accounting inaccurate and can lead to unexpected states. But we already have an interface for listening to `total_memory` changes actively, and `dirty_memory_manager`, which also needs to know it, does just that. So what happens e.g. when `mutation_cleaner` runs is that `mutation_cleaner` checks the value of `total_memory` before it runs, then it runs, causing several changes to `total_memory` which are picked up by `dirty_memory_manager`, then `mutation_cleaner` checks the end value of `total_memory` and passes the difference to `memtable`, which corrects whatever was observed by `dirty_memory_manager`. To allow memtable to modify its `_flushed_memory` correctly, we need to make `memtable` itself a `region_listener`. Also, instead of the situation where `dirty_memory_manager` receives `total_memory` change notifications from `logalloc` directly, and `memtable` fixes the manager's state later, we want to only the memtable listen for the notifications, and pass them already modified accordingl to the manager, so there is no intermediate wrong states. This patch moves the `region_listener` callbacks from the `dirty_memory_manager` to the `memtable`. It's not intended to be a functional change, just a source code refactoring. The next patch will be a functional change enabled by this. (cherry picked from commit `7d551f99be`)	2025-06-24 13:06:06 +00:00
Benny Halevy	afa2b40ac9	disk_space_monitor: add space_source_registration Register the current space_source_fn in an RAII object that resets monitor._space_source to the previous function when the RAII object is destroyed. Use space_source_registration in database_test:: mutation_dump_generated_schema_deterministic_id_version to prevent use-after-stack-return in the test. Fixes #24314 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24342 (cherry picked from commit `8b387109fc`) Closes scylladb/scylladb#24392	2025-06-24 10:02:23 +03:00
Raphael S. Carvalho	fa420f8644	replica: Fix truncate assert failure Truncate doesn't really go well with concurrent writes. The fix (#23560) exposed a preexisting fragility which I missed. 1) truncate gets RP mark X, truncated_at = second T 2) new sstable written during snapshot or later, also at second T (difference of MS) 3) discard_sstables() get RP Y > saved RP X, since creation time of sstable with RP Y is equal to truncated_at = second T. So the problem is that truncate is using a clock of second granularity for filtering out sstables written later, and after we got low mark and truncate time, it can happen that a sstable is flushed later within the same second, but at a different millisecond. By switching to a millisecond clock (db_clock), we allow sstables written later within the same second from being filtered out. It's not perfect but extremely unlikely a new write lands and get flushed in the same millisecond we recorded truncated_at timepoint. In practice, truncate will not be used concurrently to writes, so this should be enough for our tests performing such concurrent actions. We're moving away from gc_clock which is our cheap lowres_clock, but time is only retrieved when creating sstable objects, which frequency of creation is low enough for not having significant consequences, and also db_clock should be cheap enough since it's usually syscall-less. Fixes #23771. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24426 (cherry picked from commit `2d716f3ffe`) Closes scylladb/scylladb#24435	2025-06-24 10:02:06 +03:00
Michał Chojnowski	3eba371e09	test/boost/mutation_reader_test: fix a use-after-free in `test_fast_forwarding_combined_reader_is_consistent_with_slicing` The contract in mutation_reader.hh says: ``` // pr needs to be valid until the reader is destroyed or fast_forward_to() // is called again. future<> fast_forward_to(const dht::partition_range& pr) { ``` `test_fast_forwarding_combined_reader_is_consistent_with_slicing` violates this by passing a temporary to `fast_forward_to`. Fix that. Fixes scylladb/scylladb#24542 Closes scylladb/scylladb#24543 (cherry picked from commit `27f66fb110`) Closes scylladb/scylladb#24548	2025-06-24 10:01:19 +03:00
Karol Nowacki	76bd23cddd	cql, schema: Extend name length limit from 48 to 192 bytes This commit increases the maximum length of names for keyspaces, tables, materialized views, and indexes from 48 to 192 bytes. The previous 48-bytes limit was inherited from Cassandra 3 for compatibility. However, this validation was removed in Cassandra 4 and 5 (see CASSANDRA-20389) and some usage scenarios (such as some feature store workflows generating long table names) now depend on this relaxed constraint. This change brings ScyllaDB's behavior in line with modern Cassandra versions and better supports these use cases. The new limit of 192 bytes is derived from underlying filesystem limitations to prevent runtime errors when creating directories for table data. When a new table is created, ScyllaDB generates a directory for its SSTables. The directory name is constructed from the table name, a dash, and a 32-character UUID. For a CDC-enabled table, an associated log table is also created, which has the suffix `_scylla_cdc_log` appended to its name. The directory name for this log table becomes the longest possible representation. Additionally we reserve 15 bytes for future use, allowing for potential future extensions without breaking existing schemas. To guarantee that directory creation never fails due to exceeding filesystem name limits, the maximum name length is calculated as follows: 255 bytes (common filesystem limit for a path component) - 32 bytes (for the 32-character UUID string) - 1 byte (for the '-' separator) - 15 bytes (for the '_scylla_cdc_log' suffix) - 15 bytes (reserved for future use) ---------- = 192 bytes (Maximum allowed name length) This calculation is similar in principle to the one proposed for Cassandra to fix related directory creation failures (see apache/cassandra/pull/4038). This patch also updates/adds all associated tests to validate the new 192-byte limit. The documentation has been updated accordingly. (cherry picked from commit `4577c66a04`)	2025-06-22 17:38:30 +00:00
Botond Dénes	a63b22eec6	Merge '[Backport 2025.2] tablets: fix missing data after tablet merge ' from Scylladb[bot] Consider the following scenario: 1) let's assume tablet 0 has range [1, 5] (pre merge) 2) tablet merge happens, tablet 0 has now range [1, 10] 3) tablet_sstable_set isn't refreshed, so holds a stale state, thinks tablet 0 still has range [1, 5] 4) during a full scan, forward service will intersect the full range with tablet ranges and consume one tablet at a time 5) replica service is asked to consume range [1, 10] of tablet 0 (post merge) We have two possible outcomes: With cache bypass: 1) cache reader is bypassed 2) sstable reader is created on range [1, 10] 3) unrefreshed tablet_sstable_set holds stale state, but select correctly all sstables intersecting with range [1, 10] With cache: 1) cache reader is created 2) finds partition with token 5 is cached 3) sstable reader is created on range [1, 4] (later would fast forward to range [6, 10]; also belongs to tablet 0) 4) incremental selector consumes the pre-merge sstable spanning range [1, 5] 4.1) since the partitioned_sstable_set pre-merge contains only that sstable, EOS is reached 4.2) since EOS is reached, the fast forward to range [6, 10] is not allowed. So with the set refreshed, sstable set is aligned with tablet ranges, and no premature EOS is signalled, otherwise preventing fast forward to from happening and all data from being properly captured in the read. This change fixes the bug and triggers a mutation source refresh whenever the number of tablets for the table has changed, not only when we have incoming tablets. Additionally, includes a fix for range reads that span more than one tablet, which can happen during split execution. Fixes: https://github.com/scylladb/scylladb/issues/23313 This change needs to be backported to all supported versions which implement tablet merge. - (cherry picked from commit `d0329ca370`) - (cherry picked from commit `1f9f724441`) - (cherry picked from commit `53df911145`) Parent PR: #24287 Closes scylladb/scylladb#24339 * github.com:scylladb/scylladb: replica: Fix range reads spanning sibling tablets test: add reproducer and test for mutation source refresh after merge tablets: trigger mutation source refresh on tablet count change	2025-06-17 08:35:14 +03:00
Raphael S. Carvalho	79958472bc	replica: Fix range reads spanning sibling tablets We don't guarantee that coordinators will only emit range reads that span only one tablet. Consider this scenario: 1) split is about to be finalized, barrier is executed, completes. 2) coordinator starts a read, uses pre-split erm (split not committed to group0 yet) 3) split is committed to group0, all replicas switch storage. 4) replica-side read is executed, uses a range which spans tablets. We could fix it with two-phase split execution. Rather than pushing the complexity to higher levels, let's fix incremental selector which should be able to serve all the tokens owned by a given shard. During split execution, either of sibling tablets aren't going anywhere since it runs with state machine locked, so a single read spanning both sibling tablets works as long as the selector works across tablet boundaries. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `53df911145`)	2025-06-15 09:14:38 -03:00
Ernest Zaslavsky	4fed3a5a5a	encryption_test: Catch exact exception Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed. Start catching the exact exception that we expect to be thrown. Maybe somewhat related to https://github.com/scylladb/scylladb/issues/22628 Fixes: https://github.com/scylladb/scylladb/issues/24145 reapplies reverted: https://github.com/scylladb/scylladb/pull/24065 Should be backported to 2025.2. Closes scylladb/scylladb#24242 (cherry picked from commit `a39b773d36`) Closes scylladb/scylladb#24402	2025-06-06 08:48:02 +03:00
Piotr Dulikowski	6edf92a9e3	Merge '[Backport 2025.2] test/boost: Adjust tests to RF-rack-valid keyspaces' from Scylladb[bot] This PR adjusts existing Boost tests so they respect the invariant introduced by enabling `rf_rack_valid_keyspaces` configuration option. We disable it explicitly in more problematic tests. After that, we enable the option by default in the whole test suite. Fixes scylladb/scylladb#23958 Backport: backporting to 2025.1 to be able to test the implementation there too. - (cherry picked from commit `6e2fb79152`) - (cherry picked from commit `e4e3b9c3a1`) - (cherry picked from commit `1199c68bac`) - (cherry picked from commit `cd615c3ef7`) - (cherry picked from commit `fa62f68a57`) - (cherry picked from commit `22d6c7e702`) - (cherry picked from commit `237638f4d3`) - (cherry picked from commit `c60035cbf6`) Parent PR: scylladb/scylladb#23802 Closes scylladb/scylladb#24368 * github.com:scylladb/scylladb: test/lib/cql_test_env.cc: Enable rf_rack_valid_keyspaces by default test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity	2025-06-04 10:24:35 +02:00
Dawid Mędrek	9938183ace	test/boost/tablets_test.cc: Explicitly disable rf_rack_valid_keyspaces in problematic tests Some of the tests in the file verify more subtle parts of the behavior of tablets and rely on topology layouts or using keyspaces that violate the invariant the `rf_rack_valid_keyspaces` configuration option is trying to enforce. Because of that, we explicitly disable the option to be able to enable it by default in the rest of the test suite in the following commit. (cherry picked from commit `237638f4d3`)	2025-06-03 11:10:16 +00:00
Dawid Mędrek	1271b42848	test/boost/tablets_test.cc: Fix indentation in test_load_balancing_with_random_load (cherry picked from commit `22d6c7e702`)	2025-06-03 11:10:16 +00:00
Dawid Mędrek	012e248792	test/boost/tablets_test.cc: Adjust test_load_balancing_with_random_load to RF-rack-validity We make sure that the keyspaces created in the test are always RF-rack-valid. To achieve that, we change how the test is performed. Before this commit, we first created a cluster and then ran the actual test logic multiple times. Each of those test cases created a keyspace with a random replication factor. That cannot work with `rf_rack_valid_keyspaces` set to true. We cannot modify the property file of a node (see commit: `eb5b52f598`), so once we set up the cluster, we cannot adjust its layout to work with another replication factor. To solve that issue, we also recreate the cluster in each test case. Now we choose the replication factor at random, create a cluster distributing nodes across as many racks as RF, and perform the rest of the logic. We perform it multiple times in a loop so that the test behaves as before these changes. (cherry picked from commit `fa62f68a57`)	2025-06-03 11:10:16 +00:00
Dawid Mędrek	1364eec694	test/boost/tablets_test.cc: Adjust test_load_balancing_works_with_in_progress_transitions to RF-rack-validity We distribute the nodes used in the test across two racks so we can run the test with `rf_rack_valid_keyspaces` set to true. We want to avoid cross-rack migrations and keep the test as realistic as possible. Since host3 is supposed to function as a new node in the cluster, we change the layout of it: now, host1 has 2 shards and resides in a separate rack. Most of the remaining test logic is preserved and behaves as before this commit. There is a slight difference in the tablet migrations. Before the commit, we were migrating a tablet between nodes of different shard counts. Now it's impossible because it would force us to migrate tablets between racks. However, since the test wants to simply verify that an ongoing migration doesn't interfere with load balancing and still leads to a perfect balance, that still happens: we explicitly migrate ONLY 1 tablet from host2 to host3, so to achieve the goal, one more tablet needs to be migrated, and we test that. (cherry picked from commit `cd615c3ef7`)	2025-06-03 11:10:16 +00:00
Dawid Mędrek	85fe37a8e4	test/boost/tablets_test.cc: Adjust test_load_balancing_resize_requests to RF-rack-validity We assign the nodes created by the test to separate racks. It has no impact on the test since the keyspace used in the test uses RF=2, so the tablet replicas will still be the same. (cherry picked from commit `1199c68bac`)	2025-06-03 11:10:16 +00:00
Dawid Mędrek	e21bdbb9ef	test/boost/tablets_test.cc: Adjust test_load_balancing_with_two_empty_nodes to RF-rack-validity We distribute the nodes used in the test between two racks. Although that may affect how tablets behave in general, this change will not have any real impact on the test. The test verifies that load balancing eventually balances tablets in the cluster, which will still happen. Because of that, the changes in this commit are safe to apply. (cherry picked from commit `e4e3b9c3a1`)	2025-06-03 11:10:16 +00:00
Dawid Mędrek	ca8762885b	test/boost/tablets_test.cc: Adjust test_load_balancer_shuffle_mode to RF-rack-validity We distribute the nodes used in the test between two racks. Although that may have an impact on how tablets behave, it's orthogonal to what the test verifies -- whether the topology coordinator is continuously in the tablet migration track. Because of that, it's safe to make this change without influencing the test. (cherry picked from commit `6e2fb79152`)	2025-06-03 11:10:15 +00:00
Michał Chojnowski	3a7a1dc4a9	test/boost/sstable_compressor_factory_test: define a test suite name It seems that tests in test/boost/combined_tests have to define a test suite name, otherwise they aren't picked up by test.py. Fixes #24199 Closes scylladb/scylladb#24200 (cherry picked from commit `ff8a119f26`) Closes scylladb/scylladb#24255	2025-06-03 12:01:35 +03:00
Botond Dénes	9a7ea917eb	mutation/mutation_compactor: cache regular/shadowable max-purgable in separate members Max purgeable has two possible values for each partition: one for regular tombstones and one for shadowable ones. Yet currently a single member is used to cache the max-purgeable value for the partition, so whichever kind of tombstone is checked first, its max-purgeable will become sticky and apply to the other kind of tombstones too. E.g. if the first can_gc() check is for a regular tombstone, its max-purgeable will apply to shadowable tombstones in the partition too, meaning they might not be purged, even though they are purgeable, as the shadowable max-purgeable is expected to be more lenient. The other way around is worse, as it will result in regular tombstone being incorrectly purged, permitted by the more lenient shadowable tombstone max-purgeable. Fix this by caching the two possible values in two separate members. A reproducer unit test is also added. Fixes: scylladb/scylladb#23272 Closes scylladb/scylladb#24171 (cherry picked from commit `7db956965e`) Closes scylladb/scylladb#24329	2025-06-03 09:51:52 +03:00
Pavel Emelyanov	eb78d3aefb	test/result_utils: Do not assume map_reduce reducing order When map_reduce is called on a collection, one shouldn't expect that it processes the elements of the collection in any specific order. Current test of map-reduce over boost outcome assumes that if reduce function is the string concatenation, then it would concatenate the given vector of strings in the order they are listed. That requirement should be relaxed, and the result may have reversed concatentation. Fixes scylladb/scylladb#24321 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#24325 (cherry picked from commit `a65ffdd0df`) Closes scylladb/scylladb#24337	2025-06-02 14:00:07 +03:00
Pavel Emelyanov	e215350c61	Revert "encryption_test: Catch exact exception" This reverts commit `59bf300e83`. KMS tests became flaky after it: #24218 Need to revisit.	2025-05-20 13:51:07 +03:00
Ernest Zaslavsky	59bf300e83	encryption_test: Catch exact exception Apparently `test_kms_network_error` will succeed at any circumstances since most of our exceptions derive from `std::exception`, so whatever happens to the test, for whatever reason it will throw, the test will be marked as passed. Start catching the exact exception that we expect to be thrown. Closes scylladb/scylladb#24065 (cherry picked from commit `2d5c0f0cfd`) Closes scylladb/scylladb#24147	2025-05-20 08:27:56 +03:00
Ernest Zaslavsky	24c134992b	database_test: Wait for the index to be created Just call `wait_until_built` for the index in question fix: https://github.com/scylladb/scylladb/issues/24059 Closes scylladb/scylladb#24117 (cherry picked from commit `4a7c847cba`) Closes scylladb/scylladb#24132	2025-05-19 12:08:41 +03:00
Pavel Emelyanov	9058d5658b	Merge '[Backport 2025.2] logalloc_test: don't test performance in test background_reclaim' from Scylladb[bot] The test is failing in CI sometimes due to performance reasons. There are at least two problems: 1. The initial 500ms (wall time) sleep might be too short. If the reclaimer doesn't manage to evict enough memory during this time, the test will fail. 2. During the 100ms (thread CPU time) window given by the test to background reclaim, the `background_reclaim` scheduling group isn't actually guaranteed to get any CPU, regardless of shares. If the process is switched out inside the `background_reclaim` group, it might accumulate so much vruntime that it won't get any more CPU again for a long time. We have seen both. This kind of timing test can't be run reliably on overcommitted machines without modifying the Seastar scheduler to support that (by e.g. using thread clock instead of wall time clock in the scheduler), and that would require an amount of effort disproportionate to the value of the test. So for now, to unflake the test, this patch removes the performance test part. (And the tradeoff is a weakening of the test). After the patch, we only check that the background reclaim happens eventually. Fixes https://github.com/scylladb/scylladb/issues/15677 Backporting this is optional. The test is flaky even in stable branches, but the failure is rare. - (cherry picked from commit `c47f438db3`) - (cherry picked from commit `1c1741cfbc`) Parent PR: #24030 Closes scylladb/scylladb#24094 * github.com:scylladb/scylladb: logalloc_test: don't test performance in test `background_reclaim` logalloc: make background_reclaimer::free_memory_threshold publicly visible	2025-05-16 11:50:17 +03:00
Aleksandra Martyniuk	fcde30d2b0	streaming: use host_id in file streaming Use host ids instead of ips in file-streaming. Fixes: #22421. Closes scylladb/scylladb#24055 (cherry picked from commit `2dcea5a27d`) Closes scylladb/scylladb#24119	2025-05-14 22:13:48 +02:00
Michał Chojnowski	732321e3b8	test: add test/boost/sstable_compressor_factory_test Add a basic test for NUMA awareness of `default_sstable_compressor_factory`. (cherry picked from commit `f075674ebe`)	2025-05-12 09:12:05 +00:00
Michał Chojnowski	68d2086fa5	test: switch uses of make_sstable_compressor_factory() to a seastar::thread-dependent version In next patches, make_sstable_compressor_factory() will have to disappear. In preparation for that, we switch to a seastar::thread-dependent replacement. (cherry picked from commit `8649adafa8`)	2025-05-12 09:12:05 +00:00
Michał Chojnowski	403d43093f	test: remove sstables::test_env::do_with() `sstable_manager` depends on `sstable_compressor_factory&`. Currently, `test_env` obtains an implementation of this interface with the synchronous `make_sstable_compressor_factory()`. But after this patch, the only implementation of that interface `sstable_compressor_factory&` will use `sharded<...>`, so its construction will become asynchronous, and the synchronous `make_sstable_compressor_factory()` must disappear. There are several possible ways to deal with this, but I think the easiest one is to write an asynchronous replacement for `make_sstable_compressor_factory()` that will keep the same signature but will be only usable in a `seastar::thread`. All other uses of `make_sstable_compressor_factory()` outside of `test_env::do_with()` already are in seastar threads, so if we just get rid of `test_env::do_with()`, then we will be able to use that thread-dependent replacement. This is the purpose of this commit. We shouldn't be losing much. (cherry picked from commit `0e4d0ded8d`)	2025-05-12 09:12:04 +00:00
Michał Chojnowski	a5b513dde7	logalloc_test: don't test performance in test `background_reclaim` The test is failing in CI sometimes due to performance reasons. There are at least two problems: 1. The initial 500ms (wall time) sleep might be too short. If the reclaimer doesn't manage to evict enough memory during this time, the test will fail. 2. During the 100ms (thread CPU time) window given by the test to background reclaim, the `background_reclaim` scheduling group isn't actually guaranteed to get any CPU, regardless of shares. If the process is switched out inside the `background_reclaim` group, it might accumulate so much vruntime that it won't get any more CPU again for a long time. We have seen both. This kind of timing test can't be run reliably on overcommitted machines without modifying the Seastar scheduler to support that (by e.g. using thread clock instead of wall time clock in the scheduler), and that would require an amount of effort disproportionate to the value of the test. So for now, to unflake the test, this patch removes the performance test part. (And the tradeoff is a weakening of the test). (cherry picked from commit `1c1741cfbc`)	2025-05-09 16:12:22 +00:00
Michał Chojnowski	f29b87970a	test/boost/mvcc_test: fix an overly-strong assertion in test_snapshot_cursor_is_consistent_with_merging The test checks that merging the partition versions on-the-fly using the cursor gives the same results as merging them destructively with apply_monotonically. In particular, it tests that the continuity of both results is equal. However, there's a subtlety which makes this not true. The cursor puts empty dummy rows (i.e. dummies shadowed by the partition tombstone) in the output. But the destructive merge is allowed (as an expection to the general rule, for optimization reasons), to remove those dummies and thus reduce the continuity. So after this patch we instead check that the output of the cursor has continuity equal to the merged continuities of version. (Rather than to the continuity of merged versions, which can be smaller as described above). Refs https://github.com/scylladb/scylladb/pull/21459, a patch which did the same in a different test. Fixes https://github.com/scylladb/scylladb/issues/13642 Closes scylladb/scylladb#24044 (cherry picked from commit `746ec1d4e4`) Closes scylladb/scylladb#24083	2025-05-09 13:00:34 +02:00
Botond Dénes	0a9ca52cfd	replica/database: memtable_list: save ref to memtable_table_shared_data This is passed by reference to the constructor, but a copy is saved into the _table_shared_data member. A reference to this member is passed down to all memtable readers. Because of the copy, the memtable readers save a reference to the memtable_list's member, which goes away together with the memtable_list when the storage_group is destroyed. This causes use-after-free when a storage group is destroyed while a memtable read is still ongoing. The memtable reader keeps the memtable alive, but its reference to the memtable_table_shared_data becomes stale. Fix by saving a reference in the memtable_list too, so memtable readers receive a reference pointing to the original replica::table member, which is stable accross tablet migrations and merges. The copy was introduced by `2a76065e3d`. There was a copy even before this commit, but in the previous vnode-only world this was fine -- there was one memtable_list per table and it was around until the table itself was. In the tablet world, this is no longer given, but the above commit didn't account for this. A test is included, which reproduces the use-after-free on memtable migration. The test is somewhat artificial in that the use-after-free would be prevented by holding on to an ERM, but this is done intentionaly to keep the test simple. Migration -- unlike merge where this use-after-free was originally observed -- is easy to trigger from unit tests. Fixes: #23762 Closes scylladb/scylladb#23984	2025-05-06 22:13:17 +03:00
Pavel Emelyanov	1b5bbc2433	Merge 'test.py: split boost pytest integration' from Andrei Chekun This PR contains changes that do not add new functionality, and have small refactoring of the existing code. The most significant change is the refactoring of resource gathering, so it will not create another cgroup to put itself in. So there will be no nested redundant 'initial' groups, e.x. `/sys/fs/cgroup/user.slice/user-1000.slice/user@1000.service/initial/initial/initial.../initial` This is part two of splitting the original PR. This PR is an extraction of several commits from https://github.com/scylladb/scylladb/pull/22894 as reviewer https://github.com/scylladb/scylladb/pull/22894?notification_referrer_id=NT_kwDOACiLR7MxNDg0ODk2MDU1MjoyNjU3MDk1&notifications_query=reason%3Aparticipating#pullrequestreview-2778582278. Closes scylladb/scylladb#23882 * github.com:scylladb/scylladb: test.py: add awareness of extra_scylla_cmdline_options test.py: increase timeout for C++ tests in pytest test.py: switch method of finding the root repo directory test.py: move get_combined_tests to the correct facade test.py: add common directory for reports test.py: add the possibility to provide additional env vars test.py: move setup cgroups to the generic method test.py: refactor resource_gather.py	2025-05-06 16:22:49 +03:00
Patryk Jędrzejczak	7f843e0a5c	Merge 'raft: make sure to retain the existing voters including the current leader (topology coordinator)' from Emil Maskovsky Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters. Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters. Improved the prioritization logic to account for the number of existing assigned voters in each data center and rack. Additionally, the limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the topology coordinator, triggering unnecessary Raft leader re-election. To address this, the topology coordinator's votership status is now preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the existing topology coordinator is prioritized for removal. This change ensures a more stable voter distribution and reduces unnecessary voter reassignments. The limited voters calculator is refactored to use a priority queue for sorting nodes by their priorities. This change simplifies the voter selection logic and makes it more extensible for future enhancements, such as supporting more complex priority calculations. Fixes: scylladb/scylladb#23950 Fixes: scylladb/scylladb#23588 Fixes: scylladb/scylladb#23786 No backport: The limited voters feature is currently only present in master. Closes scylladb/scylladb#23888 * https://github.com/scylladb/scylladb: raft: ensure topology coordinator retains votership raft: retain existing voters across data centers and racks raft: refactor limited voters calculator to prioritize nodes raft: replace pointer with reference for non-null output parameter raft: reduce code duplication in group0 voter handler raft: unify and optimize datacenter and rack info creation	2025-05-06 13:49:55 +02:00
Avi Kivity	fc2204cea0	Merge ' test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits' from Botond Dénes This test has multiple problems: * has 3 embedded loops to run different scenarios, ignores variable from 2 of these, running with hardcoded settings instead * initializes misses and lookups to 0 at the start of each scenario, this throws off per-page increment checks, when the previous scenario moved these metrics and they don't start from 0; this causes the test to sometimes fail * duplicate check of drops == 0 (just cosmetic) Fix all three problems, the second is especially important because it made the test flaky. Additionally, ensure the test will keep using vnodes in the future, by explicitly creating a vnodes keyspace for them. Fixes: #16794 Test fix, not a backport candidate normally, we can backport to 2025.1 if the test becomes too unstable there Closes scylladb/scylladb#23783 * github.com:scylladb/scylladb: test/boost/multishard_mutation_query_test: ensure test runs with vnodes test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits	2025-05-05 20:49:03 +03:00
Emil Maskovsky	24dfd2034b	raft: ensure topology coordinator retains votership The limited voters feature did not account for the existing topology coordinator (Raft leader) when selecting voters to be removed. As a result, the limited voters calculator could inadvertently remove the votership of the current topology coordinator, triggering an unnecessary Raft leader re-election. This change ensures that the existing topology coordinator's votership status is preserved unless absolutely necessary. When choosing between otherwise equivalent voters, the node other than the topology coordinator is prioritized for removal. This helps maintain stability in the cluster by avoiding unnecessary leader re-elections. Additionally, only the alive leader node is considered relevant for this logic. A dead existing leader (topology coordinator) is excluded from consideration, as it is already in the process of losing leadership. Fixes: scylladb/scylladb#23588 Fixes: scylladb/scylladb#23786	2025-05-05 16:58:34 +02:00
Emil Maskovsky	2ae59e8a87	raft: retain existing voters across data centers and racks Fix an issue in the voter calculator where existing voters were not retained across data centers and racks in certain scenarios. This occurred when voters were distributed across more data centers and racks than the maximum allowed number of voters. Previously, the prioritization logic for data centers and racks did not consider the number of existing assigned voters. It only prioritized nodes within a single data center or rack, which could result in unnecessary reassignment of voters. Improved the prioritization logic to account for the number of existing voters in each data center and rack. This change ensures a more stable voter distribution and reduces unnecessary voter reassignments. Fixes: scylladb/scylladb#23950	2025-05-05 16:51:48 +02:00
Botond Dénes	855411caad	test/boost/multishard_mutation_query_test: ensure test runs with vnodes All tests in this suite use the default "ks" keyspace from cql_test_env. This keyspace has tablet support and at any time we might decide to make it use tablets by default. This would make all these tests use the tablet path in multishard_mutation_query.cc. These tests were created to test the vastly more complex vnodes code path in said file. The tablet path is much simpler and it is only used by SELECT * FROM MUTATION_FRAGMENTS() and which has its own correctness tests. So explicitely create a vnodes keyspace and use it in all the tests to restore the test functionality.	2025-05-05 09:22:54 -04:00
Botond Dénes	1175e1ed49	test/boost/multishard_mutation_query_test: fix test_read_with_partition_row_limits This test has multiple problems: * has 3 embedded loops to run different scenarios, ignores variable from 2 of these, running with hardcoded settings instead * initializes misses and lookups to 0 at the start of each scenario, this throws off per-page increment checks, when the previous scenario moved these metrics and they don't start from 0; this causes the test to sometimes fail * duplicate check of drops == 0 (just cosmetic) Fix all three problems, the second is especially important because it made the test flaky.	2025-05-05 09:22:53 -04:00
Pavel Emelyanov	b56d6fbb84	Merge 'sstables: Fix quadratic space complexity in partitioned_sstable_set' from Raphael Raph Carvalho Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. We can backport this into 2024.2, 2025.1, but we should let this cook in master for 1 month or so. Closes scylladb/scylladb#23806 * github.com:scylladb/scylladb: test: Verify partitioned set store split and unsplit correctly sstables: Fix quadratic space complexity in partitioned_sstable_set compaction: Wire table_state into make_sstable_set() compaction: Introduce token_range() to table_state dht: Add overlap_ratio() for token range	2025-05-05 11:28:38 +03:00
Piotr Dulikowski	05c797795f	Merge 'Simplify test/sstable_assertions class API' from Pavel Emelyanov It had recently been patched to re-use the sstables::test class functionality (scylladb/scylladb#23697), now it can be put on some more strict diet. Closes scylladb/scylladb#23815 * github.com:scylladb/scylladb: test: Remove sstable_assertions::get_stats_metadata() test: Add sstable_assertions::operator->()	2025-05-05 09:33:45 +02:00
Raphael S. Carvalho	c77f710a0c	sstables: Fix quadratic space complexity in partitioned_sstable_set Interval map is very susceptible to quadratic space behavior when it's flooded with many entries overlapping all (or most of) intervals, since each such entry will have presence on all intervals it overlaps with. A trigger we observed was memtable flush storm, which creates many small "L0" sstables that spans roughly the entire token range. Since we cannot rely on insertion order, solution will be about storing sstables with such wide ranges in a vector (unleveled). There should be no consequence for single-key reads, since upper layer applies an additional filtering based on token of key being queried. And for range scans, there can be an increase in memory usage, but not significant because the sstables span an wide range and would have been selected in the combined reader if the range of scan overlaps with them. Anyway, this is a protection against storm of memtable flushes and shouldn't be the common scenario. It works both with tablets and vnodes, by adjusting the token range spanned by compaction group accordingly. Fixes #23634. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00
Raphael S. Carvalho	21d1e78457	compaction: Wire table_state into make_sstable_set() This will be useful for feeding token range owned by compaction group into sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-04-29 15:47:33 -03:00

1 2 3 4 5 ...

3902 Commits