scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 14:15:46 +00:00

Author	SHA1	Message	Date
Michael Litvak	cc94467097	test: test drop table during streaming Add a test that drops a table while tablet streaming is running for the table. The table is dropped after taking the storage snapshot and initializating streaming sources - after that streaming should be able to complete or abort correctly if the table is dropped. We want to verify there is no incorrect access to the destroyed table. The test tests both types of streaming in stream_blob - sstables and logstor segments.	2026-04-15 19:23:00 +02:00
Marcin Maliszkiewicz	8401e9cbbd	test: filter benign errors in tests that grep logs during shutdown Apply filter_errors() to grep_for_errors() results in test_split_stopped_on_shutdown and test_group0_apply_while_node_is_being_shutdown. Without filtering, benign RPC errors like 'connection dropped: Semaphore broken' that occur during graceful shutdown cause spurious test failures.	2026-04-13 18:33:41 +02:00
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Raphael S. Carvalho	b6ebbbf036	test/cluster/test_tablets2: Fix test_split_stopped_on_shutdown race with stale log messages The test was failing because the call to: await log.wait_for('Stopping.ongoing compactions') was missing the 'from_mark=log_mark' argument. The log mark was updated (line: log_mark = await log.mark()) immediately after detecting 'splitting_mutation_writer_switch_wait: waiting', and just before launching the shutdown task. However, the wait_for call on the following line was scanning from the beginning of the log, not from that mark. As a result, the search immediately matched old 'Stopping N tasks for N ongoing compactions for table system.X due to table removal' messages emitted during initial server bootstrap (for system.large_partitions, system.large_rows, system.large_cells), rather than waiting for the shutdown to actually stop the user-table split compaction. This caused the test to prematurely send the message to the 'splitting_mutation_writer_switch_wait' injection. The split compaction was unblocked before the shutdown had aborted it, so it completed successfully. Since the split succeeded, 'Failed to complete splitting of table' was never logged. Meanwhile, 'storage_service_drain_wait' was blocking do_drain() waiting for a message. With the split already done, the test was stuck waiting for the expected failure log that would never come (600s timeout). At the same time, after 60s the 'storage_service_drain_wait' injection timed out internally, triggering on_internal_error() which -- with --abort-on-internal-error=1 -- crashed the server (exit code -6). Fix: pass from_mark=log_mark to the wait_for('Stopping.ongoing compactions') call so it only matches messages that appear after the shutdown has started, ensuring the test correctly synchronizes with the shutdown aborting the user-table split compaction before releasing the injection. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1319. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29311	2026-04-03 06:28:51 +03:00
Raphael S. Carvalho	3143134968	test: avoid split/major compaction deadlock in tablet split test Run keyspace compaction asynchronously in `test_tombstone_gc_correctness_during_tablet_split` and only await it after `split_sstable_rewrite` is disabled. The problem is that `keyspace_compaction()` starts with a flush, and that flush can take around five seconds. During that window the split compaction is stopped before major compaction is retried. The stop aborts the in-flight major compaction attempt, then the split proceeds far enough to enter the `split_sstable_rewrite` injection point. At that point the test used to wait synchronously for major compaction to finish, but major compaction cannot finish yet: when it retries, it needs the same semaphore that is still effectively tied up behind the blocked split rewrite. So the test waits for major compaction, while the split waits for the injection to be released, and the code that would release that injection never runs. Starting major compaction as a task breaks that cycle. The test can first disable `split_sstable_rewrite`, let the split get out of the way, and only then wait for major compaction to complete. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-827. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#29066	2026-03-19 11:12:21 +02:00
Botond Dénes	ae17596c2a	Merge 'Demote log level on split failure during shutdown' from Raphael Raph Carvalho Since commit `509f2af8db`, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951. Fixes https://github.com/scylladb/scylladb/issues/24850. Only 2026.1 is affected. Closes scylladb/scylladb#29032 * github.com:scylladb/scylladb: replica: Demote log level on split failure during shutdown service: Demote log level on split failure during shutdown	2026-03-18 16:21:05 +02:00
Dawid Mędrek	a8dd13731f	Merge 'Improve debuggability of test/cluster/test_data_resurrection_in_memtable.py' from Botond Dénes This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed: * storage-service: add table name to mutation write failure error messages. * database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit. * test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known. Refs: SCYLLADB-812 Refs: SCYLLADB-870 Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces) Backport: test related improvement, no backport Closes scylladb/scylladb#28899 * github.com:scylladb/scylladb: test/cluster/test_data_resurrection_in_memtable.py: dump rows before check replica/database: consolidate the two database_apply error injections service/storage_proxy: add name of table to error message for write errors	2026-03-17 13:35:19 +01:00
Raphael S. Carvalho	b508f3dd38	service: Demote log level on split failure during shutdown Since commit `509f2af8db`, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2026-03-16 11:52:00 -03:00
Artsiom Mishuta	755d528135	test.py: fix warnings changes in this commit: 1)rename class from 'TestContext' to 'Context' so pytest will not consider this class as a test 2)extend pytest filterwarnings list to ignore warnings from external libs 3) use datetime.datetime.now(datetime.UTC) unstead datetime.datetime.utcnow() 4) use ResultSet.one() instead ResultSet[0] Fixes SCYLLADB-904 Fixes SCYLLADB-908 Related SCYLLADB-902 Closes scylladb/scylladb#28956	2026-03-15 12:00:10 +02:00
Botond Dénes	f375aae257	replica/database: consolidate the two database_apply error injections Into a single database_apply one. Add three parameters: * ks_name and cf_name to filter the tables to be affected * what - what to do: throw or wait This leads to smaller footprint in the code and improved filtering for table names at the cost of some extra error injection params in the tests.	2026-03-05 11:44:02 +02:00
Petr Gusev	c785d242a7	tests: extract get_topology_version helper This is a refactoring commit. We need to load the cluster version for a host in several places, so extract a helper for this.	2026-02-16 08:57:42 +01:00
Petr Gusev	df73f723a6	storage_proxy: hold erms in replica handlers Add explicit erm-holding variables in all replica-side RPC handlers. This is required to ensure that tablet migration waits for in-flight replica requests even if a non-replica coordinator has been fenced out. Holding erms on the replica side may increase the global-barrier wait time, since the barrier must drain these requests. We believe this is acceptable because: * We already hold erms during replica-side request execution, but in an ad-hoc, non-systemic way in lower layers of storage_proxy (e.g. in sp::mutate_locally and do_query_tablets). * Replica requests are bounded by replica-side timeouts, so the global-barrier wait time cannot exceed the maximum of these timeouts. For Paxos verbs, we use token_metadata_guard, which wraps the ERM and automatically refreshes it when tablet migration does not affect the current token; see the token_metadata_guard comments for details. We use this guard only for Paxos verbs because regular reads and writes already hold raw erms in storage_proxy and on the coordinators. The erms must be held in all RPC handlers that support fencing — that is, those with a fencing_token parameter in storage_proxy.idl. Counter updates already hold erms in mutate_counter_on_leader_and_replicate. Fix test_tablets2::test_timed_out_reader_after_cleanup: the tablets barrier now waits for all nodes. As a result, the replica read is expected to finish, rather than fail due to the tablet having moved as it did previously. The test is renamed to test_tablets_barrier_waits_for_replica_erms to better reflect its purpose. Refs scylladb/scylladb#26864	2026-02-16 08:57:42 +01:00
Ferenc Szili	92dbde54a5	test: add test and reproducer for load_stats refresh exception This patch adds a test and reproducer for the issue where the load_stats refresh procedure throws exceptions if any of the tables have been dropped since load_stats was produced.	2026-01-30 15:11:29 +01:00
Andrei Chekun	cc5ac75d73	test.py: remove deprecated skip_mode decorator Finishing the deprecation of the skip_mode function in favor of pytest.mark.skip_mode. This PR is only cleaning and migrating leftover tests that are still used and old way of skip_mode. Closes scylladb/scylladb#28299	2026-01-25 18:17:27 +02:00
Tomasz Grabiec	5e6935f276	test: Use ManagerClient.{disable,enable}_tablet_balancing()	2026-01-13 00:38:00 +01:00
Tomasz Grabiec	6936704677	test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API If a test tries to move a tablet, it assumes the tablets are stable. This fixes flakiness exposed by size-based load-balancing and a later change to refresh stats sooner.	2026-01-13 00:38:00 +01:00
Andrei Chekun	c950c2e582	test.py: convert skip_mode function to pytest.mark Function skip_mode works only on function and only in cluster test. This if OK when we need to skip one test, but it's not possible to use it with pytestmark to automatically mark all tests in the file. The goal of this PR is to migrate skip_mode to be dynamic pytest.mark that can be used as ordinary mark. Closes scylladb/scylladb#27853 [avi: apply to test/cluster/test_tablets.py::test_table_creation_wakes_up_balancer]	2026-01-08 21:55:16 +02:00
Ferenc Szili	a51cb3dad9	test: fix flaky test_update_load_stats_after_migration Disable load balancing to avoid the balancer moving the tablet from a node with less to a node with more available disk space. Otherwise, the move_tablet API can fail (if the tablet is already in transisiton) or be a no-op (in case the tablet has already been migrated) Fixes: #27980 Closes scylladb/scylladb#27993	2026-01-06 11:57:35 +02:00
Ferenc Szili	10eb364821	load_balancer: implement size-based load balancing This changes introduces tablet size based load balancing. It is an extension of capacity based balancing with the addition of actual tablet sizes. It computes the difference between the most and least loaded nodes in the DC and stops further balancing if this difference is bellow the config option size_based_balance_threshold_percentage. This config option does not apply to the absolute load, but instead to the percentage of how much the most loaded node is more loaded than the least loaded node: delta = (most_loaded - least_loaded) / most_loaded If this delta is smaller then the config threshold, the balancer will consider the nodes balanced.	2025-12-27 11:20:20 +01:00
Raphael S. Carvalho	a0a7941eb1	test: Add reproducer for split vs intra-node migration race This is a problem caught after removing split from add_sstable_and_update_cache(), which was used by intra node migration when loading new sstables into the destination shard. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:18 -03:00
Raphael S. Carvalho	77a4f95eb8	test: Add reproducer for split vs incremental repair race condition Refs #26041. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:16 -03:00
Ferenc Szili	d883ff2317	test: fix flakyness caused by TRUNCATE retries The test test_truncate_during_topology_change tests TRUNCATE TABLE while bootstrapping a new node. With tablets enabled TRUNCATE is a global topology operation which needs to serialize with boostrap. When TRUNCATE TABLE is issued, it first checks if there is an already queued truncate for the same table. This can happen if a previous TRUNCATE operation has timed out, and the client retried. The newly issued truncate will only join the queued one if it is waiting to be processed, and will fail immediatelly if the TRUNCATE is already being processed. In this test, TRUNCATE will be retried after a timeout (1 minute) due to the default retry policy, and will be retried up to 3 times, while the bootstrap is delayed by 2 minutes. This means that the test can validate the result of a truncate which was started after bootstrap was completed. Because of the way truncate joins existing truncate operations, we can also have the following scenario: - TRUNCATE times out after one minute because the new node is being bootstrapped - the client retries the TRUNCATE command which also times out after 1m - the third attempt is received during TRUNCATE being processed which fails the test This patch changes the retry policy of the TRUNCATE operation to FallthroughRetryPolicy which guarantees that TRUNCATE will not be retried on timeout. It also increases the timeout of the TRUNCATE from 1 to 4 minutes. This way the test will actually validate the performance of the TRUNCATE operation which was issued during bootstrap, instead of the subsequent, retried TRUNCATEs which could have been issued after the bootstrap was complete. Fixes: #26347 Closes scylladb/scylladb#27245	2025-12-08 14:13:26 +02:00
Ferenc Szili	39711920eb	test: add tests for tablet size migration during end_migration This change adds tests for the correctness of tablet size migration during the end_migrations stage. This size migration can happend for tablet migrations and for tablet rebuild.	2025-11-21 16:58:11 +01:00
Raphael S. Carvalho	74ecedfb5c	replica: Fail timed-out single-key read on cleaned up tablet replica Consider the following: 1) single-key read starts, blocks on replica e.g. waiting for memory. 2) the same replica is migrated away 3) single-key read expires, coordinator abandons it, releases erm. 4) migration advances to cleanup stage, barrier doesn't wait on timed-out read 5) compaction group of the replica is deallocated on cleanup 6) that single-key resumes, but doesn't find sstable set (post cleanup) 7) with abort-on-internal-error turned on, node crashes It's fine for abandoned (= timed out) reads to fail, since the coordinator is gone. For active reads (non timed out), the barrier will wait for them since their coordinator holds erm. This solution consists of failing reads which underlying tablet replica has been cleaned up, by just converting internal error to plain exception. Fixes #26229. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#27078	2025-11-20 11:44:03 +02:00
Raphael S. Carvalho	7f34366b9d	sstables_loader: Don't bypass synchronization with busy topology The patch `c543059f86` fixed the synchronization issue between tablet split and load-and-stream. The synchronization worked only with raft topology, and therefore was disabled with gossip. To do the check, storage_service::raft_topology_change_enabled() but the topology kind is only available/set on shard 0, so it caused the synchronization to be bypassed when load-and-stream runs on any shard other than 0. The reason the reproducer didn't catch it is that it was restricted to single cpu. It will now run with multi cpu and catch the problem observed. Fixes #22707 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#26730	2025-11-03 18:10:08 +01:00
Pavel Emelyanov	ac1d709709	Merge 'tablet_sstable_streamer: defer SSTable unlinking until fully streamed' from Taras Veretilnyk When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patch replaces unlink() with mark_for_deletion() deferring actual unlinking till sstable::close_files. test_tablets2::test_tablet_load_and_stream was enhanced to also verify that SSTables are removed after being streamed. Fixes #26606 Backport is not required, although it is a bug fix, but it isn't visible. This is more of a preparatory fix for https://github.com/scylladb/scylladb/pull/26444. Closes scylladb/scylladb#26622 * github.com:scylladb/scylladb: test_tablets2: verify SSTable cleanup after tablet load and stream tablet_sstable_streamer: replace unlink() call with mark_for_deletion()	2025-10-28 13:25:40 +03:00
Taras Veretilnyk	1361ae7a0a	test_tablets2: verify SSTable cleanup after tablet load and stream Modify existing test_tablet_load_and_stream testcase to verify that SSTable files are properly deleted from the upload directory after streaming.	2025-10-27 16:36:08 +01:00
Botond Dénes	c543059f86	Merge 'Synchronize tablet split and load-and-stream' from Raphael Raph Carvalho Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1) split starts 2) split finalization executes barrier and succeed 3) load-and-stream runs now, starts writing sstable (pre-split) 4) split finalization publishes changes to tablet metadata 5) load-and-stream finishes writing sstable 6) sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): 1) load-and-stream awaits for topology to quiesce 2) perform split compaction on sstable that spans both sibling tablets This patch implements # 1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes https://github.com/scylladb/scylladb/issues/26455. Closes scylladb/scylladb#26456 * github.com:scylladb/scylladb: test: Add reproducer for l-a-s and split synchronization issue sstables_loader: Synchronize tablet split and load-and-stream	2025-10-21 09:43:38 +03:00
Raphael S. Carvalho	4654cdc6fd	test: Add reproducer for l-a-s and split synchronization issue Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-10-20 19:17:25 -03:00
Tomasz Grabiec	c9f0a9d0eb	tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true Greatly improves performance of plan making, because we don't consider candidates in other racks, most of which will fail to be selected due to replication constraints (no rack overload). Also (but minor) reduces the overhead of candidate evaluation, as we don't have to evaluate rack load. Enabled only for rf_rack_valid_keyspaces because such setups guarantee that we will not need (because we must not) move tablets across racks, and we don't need to execute the general algorithm for the whole DC. Tested with perf-load-balancing, which performs a single scale-out operation on a cluster which initially has 10 nodes 88 shards each, 2 racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new nodes (same shard count). Time to rebalance the cluster (plan making only, sum of all iterations, no streaming): Before: 16 min 25 s After: 0 min 25 s Before, plan making cost (single incremental iteration) alternated between fast (0.1 [s]) and slow (14.1 [s]): Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0) Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0) The slow run chose min and max nodes in different racks, hence the fast path failed to find any candidates and we switched to exhaustive search of candidates in other nodes. After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls. Fixes #26016	2025-09-23 00:30:37 +02:00
Raphael S. Carvalho	68f23d54d8	replica: Fix split compaction when tablet boundaries change Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-09-07 05:20:23 -03:00
Michael Litvak	5c28cffdb4	test/pylib/rest_client: fix ScyllaMetrics filtering In the ScyllaMetrics `get` function, when requesting the value for a specific shard, it is expected to return the sum of all values of metrics for that shard that match the labels. However, it would return the value of the first matching line it finds instead of summing all matching lines. For example, if we have two lines for one shard like: some_metric{scheduling_group_name="compaction",shard="0"} 1 some_metric{scheduling_group_name="sl:default",shard="0"} 2 The result of this call would be 1 instead of 3: get('some_metric', shard="0") We fix this to sum all matching lines. The filtering of lines by labels is fixed to allow specifying only some of the labels. Previously, for the line to match the filter, either the filter needs to be empty, or all the labels in the metric line had to be specified in the filter parameter and match its value, which is unexpected, and breaks when more labels are added. We also simplify the function signature and the implementation - instead of having the shard as a separate parameter, it can be specified as a label, like any other label.	2025-08-10 10:16:00 +02:00
Artsiom Mishuta	4b975668f6	tiering (test.py): introduce tiering labels introduce tiering marks 1 “unstable” - For unstable tests that will be will continue runing every night and generate up-to-date statistics with failures without failing the “Main” verification path(scylla-ci, Next) 2 “nightly” - for tests that are quite old, stable, and test functionality that rather not be changed or affected by other features, are partially covered in other tests, verify non-critical functionality, have not found any issues or regressions, too long to run on every PR, and can be popped out from the CI run. set 7 long tests(according to statistic in elastic) as nightly(theses 8 tests took 20% of CI run, about 4 hours without paralelization) 1 test as unstable(as exaple ot marker usage) Closes scylladb/scylladb#24974	2025-08-04 15:38:16 +03:00
Taras Veretilnyk	1d6808aec4	topology_coordinator: Make tablet_load_stats_refresh_interval configurable This commits introduces an config option 'tablet_load_stats_refresh_interval_in_seconds' that allows overriding the default value without using error injection. Fixes scylladb/scylladb#24641 Closes scylladb/scylladb#24746	2025-07-31 14:31:55 +03:00
Tomasz Grabiec	a1d7722c6d	Merge 'api: repair_async: refuse repairing tablet keyspaces' from Aleksandra Martyniuk A tablet repair started with /storage_service/repair_async/ API bypasses tablet repair scheduler and repairs only the tablets that are owned by the requested node. Due to that, to safely repair the whole keyspace, we need to first disable tablet migrations and then start repair on all nodes. With the new API - /storage_service/tablets/repair - tailored to tablet repair requirements, we do not need additional preparation before repair. We may request it on one node in a cluster only and, thanks to tablet repair scheduler, a whole keyspace will be safely repaired. Both nodetool and Scylla Manager have already started using the new API to repair tablets. Refuse repairing tablet keyspaces with /storage_service/repair_async - 403 Forbidden is returned. repair_async should still be used to repair vnode keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/23008. Breaking change; no backport. Closes scylladb/scylladb#24678 * github.com:scylladb/scylladb: repair: remove unused code api: repair_async: forbid repairing tablet keyspaces	2025-07-27 09:25:42 +02:00
Ferenc Szili	7ce96345bf	test: remove test_tombstone_gc_disabled_on_pending_replica The test test_tombstone_gc_disabled_on_pending_replica was added when we fixed (#20788) the potential problem with data resurrection during file based streaming. The issue was occurring only in Enterprise, but we added the fix in OSS to limit code divergence. This test was added together with the fix in OSS with the idea to guard this change in OSS. The real reproducer and test for this fix was added later, after the fix was ported into Enterprise. It is in: test/cluster/test_resurrection.py Since Enterprise has been merged into OSS, there is no more need to keep the test test_tombstone_gc_disabled_on_pending_replica. Also, it is flaky with very low probability of failure, making it difficult to investigate the cause of failure. Fixes: #22182 Closes scylladb/scylladb#25134	2025-07-25 10:45:32 +03:00
Aleksandra Martyniuk	a0031ad05e	api: repair_async: forbid repairing tablet keyspaces Return 403 Forbidden if a user tries to repair tablet keyspace with /storage_service/repair_async/ API.	2025-07-24 11:11:09 +02:00
Michael Litvak	e01aae7871	test/pylib/tablets: common get_tablet_count api Introduce a common get_tablet_count test api instead of it being duplicated in few tests, and fix it to read the tablet count from the base table.	2025-07-01 13:20:19 +03:00
Dawid Mędrek	c4b32c38a3	test/cluster: Disable rf_rack_valid_keyspaces in problematic tests Some of the tests in the test suite have proven to be more problematic in adjusting to RF-rack-validity. Since we'd like to run as many tests as possible with the `rf_rack_valid_keyspaces` configuration option enabled, let's disable it in those. In the following commit, we'll enable it by default.	2025-05-10 16:30:49 +02:00
Dawid Mędrek	dbb8835fdf	test/cluster: Adjust simple tests to RF-rack-validity We adjust all of the simple cases of cluster tests so they work with `rf_rack_valid_keyspaces: true`. It boils down to assigning nodes to multiple racks. For most of the changes, we do that by: * Using `pytest.mark.prepare_3_racks_cluster` instead of `pytest.mark.prepare_3_nodes_cluster`. * Using an additional argument -- `auto_rack_dc` -- when calling `ManagerClient::servers_add()`. In some cases, we need to assign the racks manually, which may be less obvious, but in every such situation, the tests didn't rely on that assignment, so that doesn't affect them or what they verify.	2025-05-10 16:30:18 +02:00
Raphael S. Carvalho	434c2c4649	replica: Fix use-after-free with concurrent schema change and sstable set update When schema is changed, sstable set is updated according to the compaction strategy of the new schema (no changes to set are actually made, just the underlying set type is updated), but the problem is that it happens without a lock, causing a use-after-free when running concurrently to another set update. Example: 1) A: sstable set is being updated on compaction completion 2) B: schema change updates the set (it's non deferring, so it happens in one go) and frees the set used by A. 3) when A resumes, system will likely crash since the set is freed already. ASAN screams about it: SUMMARY: AddressSanitizer: heap-use-after-free sstables/sstable_set.cc ... Fix is about deferring update of the set on schema change to compaction, which is triggered after new schema is set. Only strategy state and backlog tracker are updated immediately, which is fine since strategy doesn't depend on any particular implementation of sstable set, since patch "sstables: Implement sstable_set_impl::all_sstable_runs()". Fixes #22040. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-05-06 10:06:55 -03:00
Lakshmipathi	42ed6a87bf	test: Test truncate during topology change Add a new node, during topology change issue truncate call and verify all nodes empty data after tablet migration. Fixes: https://github.com/scylladb/scylla-dtest/issues/5317 Signed-off-by: Lakshmipathi Ganapathi <lakshmipathi.ganapathi@scylladb.com> Closes scylladb/scylladb#22595	2025-04-16 09:10:22 +03:00
Raphael S. Carvalho	0f59deffaa	replica: Fix truncate and drop table after tablet migration happens When running those operations after a tablet replica is migrated away from a shard, an assert can fail resulting in a crash. Status quo (around the assert in truncate procedure): 1) Highest RP seen by table is saved in low_mark, and the current time in low_mark_at. 2) Then compaction is disabled in order to not mix data written before truncate, and data written later. 3) Then memtable is flushed in order for the data written before truncate to be available in sstables and then removed. 4) Now, current time is saved in truncated_at, which is supposedly the time of truncate to decide which sstables to remove. Note: truncated_at is likely above low_mark_at due to steps 2 and 3. The interesting part of the assert is: (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp) Note: RP in the assert above is the highest RP among all sstables generated before truncated_at. RP is retrieved by table::discard_sstables(). If truncated_at > low_mark_at, maybe newer data was written during steps 2 and 3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with RP > low_mark. So assert's 2nd condition is there to defend against the scenario above. truncated_at and low_mark_at uses millisecond granularity, so even if truncated_at == low_mark_at, data could have been written in steps 2 and 3 (during same MS window), failing the assert. This is fragile. Reproducer: To reproduce the problem, truncated_at must be > low_mark_at, which can easily happen with both drop table and truncate due to steps 2 and 3. If a shard has 2 or more tablets, the table's highest RP refer to just one tablet in that shard. If the tablet with the highest RP is migrated away, then the sstables in that shard will have lower RP than the recorded highest RP (it's a table wide state, which makes sense since CL is shared among tablets). So when either drop table or truncate runs, low_mark will be potentially bigger than highest RP retrieved from sstables. Proposed solution: The current assert is hacked to not fail if writes sneak in, during steps 2 and 3, but it's still fragile and seems not to serve its real purpose, since it's allowing for RP > low_mark. We should be able to say that low_mark >= RP, as a way of asserting we're not leaving data targeted by truncate behind (or that we're not removing the wrong data). But the problem is that we're saving low_mark in step 1, before preparation steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying all data written so far is targeted for removal. But as of today, low_mark refers to all data written up to step 1. So low_mark is now only one set before issuing flush, and also accounts for all potentially flushed data. Fixes #18059. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23560	2025-04-08 07:32:58 +03:00
Benny Halevy	c62865df90	db/config: add tablets_mode_for_new_keyspaces option The new option deprecates the existing `enable_tablets` option. It will be extended in the next patch with a 3rd value: "enforced" while will enable tablets by default for new keyspace but without the posibility to opt out using the `tablets = {'enabled': false}` keyspace schema option. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-03-24 14:54:45 +02:00
Raphael S. Carvalho	fedd838b9d	replica: Fix race of some operations like cleanup with snapshot There are two semaphores in table for synchronizing changes to sstable list: sstable_set_mutation_sem: used to serialize two concurrent operations updating the list, to prevent them from racing with each other. sstable_deletion_sem: A deletion guard, used to serialize deletion and iteration over the list, to prevent iteration from finding deleted files on disk. they're always taken in this order to avoid deadlocks: sstable_set_mutation_sem -> sstable_deletion_sem. problem: A = tablet cleanup B = take_snapshot() 1) A acquires sstable_set_mutation_sem for updating list 2) A acquires sstable_deletion_sem, then delete sstable before updating list 3) A releases sstable_deletion_sem, then yield 4) B acquires sstable_deletion_sem 5) B iterates through list and bumps sstable deleted in step 2 6) B fails since it cannot find the file on disk Initial reaction is to say that no procedure must delete sstable before updating the list, that's true. But we want a iteration, running concurrently to cleanup, to not find sstables being removed from the system. Otherwise, e.g. snapshot works with sstables of a tablet that was just cleaned up. That's achieved by serializing iteration with list update. Since sstable_deletion_sem is used within the scope of deletion only, it's useless for achieving this. Cleanup could acquire the deletion sem when preparing list updates, and then pass the "permit" to deletion function, but then sstable_deletion_sem would essentially become sstable_set_mutation_sem, which was created exactly to protect the list update. That being said, it makes sense to merge both semaphores. Also things become easier to reason about, and we don't have to worry about deadlocks anymore. The deletion goes through sstable_list_builder, which holds a permit throughout its lifetime, which guarantees that list updates and deletion are atomic to other concurrent operations. The interface becomes less error prone with that. It allowed us to find discard_sstables() was doing deletion without any permit, meaning another race could happen between truncate and snapshot. So we're fixing race of (truncate\|cleanup) with take_snapshot, as far as we know. It's possible another unknown races are fixed as well. Fixes #23049. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#23117	2025-03-06 11:00:48 +02:00
Artsiom Mishuta	d1198f8318	test.py: rename topology_custom folder to cluster rename topology_custom folder to cluster as it contains not only topology test cases	2025-03-04 10:32:44 +01:00

46 Commits