scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 14:15:46 +00:00

Author	SHA1	Message	Date
Aleksandra Martyniuk	b5dae10acb	test: extend test_batchlog_replay_failure_during_repair Modify test_batchlog_replay_failure_during_repair to also check that there isn't data resurrection if flushing hints falls within the repair cache timeout. (cherry picked from commit `e3dcb7e827`)	2026-01-22 10:48:32 +01:00
Botond Dénes	abf7c34b40	reader_concurrency_semaphore: improve handling of base resources reader_permit::release_base_resources() is a soft evict for the permit: it releases the resources aquired during admission. This is used in cases where a single process owns multiple permits, creating a risk for deadlock, like it is the case for repair. In this case, release_base_resources() acts as a manual eviction mechanism to prevent permits blockings each other from admission. Recently we found a bad interaction between release_base_resources() and permit eviction. Repair uses both mechanism: it marks its permits as inactive and later it also uses release_base_resources(). This partice might be worth reconsidering, but the fact remains that there is a bug in the reader permit which causes the base resources to be released twice when release_base_resources() is called on an already evicted permit. This is incorrect and is fixed in this patch. Improve release_base_resources(): * make _base_resources const * move signal call into the if (_base_resources_consumed()) { } * use reader_permit::impl::signal() instead of reader_concurrency_semaphore::signal() * all places where base resources are released now call release_base_resources() A reproducer unit test is added, which fails before and passes after the fix. Fixes: #28083 Closes scylladb/scylladb#28155 (cherry picked from commit `b7bc48e7b7`) Closes scylladb/scylladb#28238	2026-01-21 06:47:21 +02:00
Asias He	a04f31da5b	repair: Allow min max range to be updated for repair history It is observed that: repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb: seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum token,maximum token) is not in the format of (start, end]) This is because repair checks the end of the range to be repaired needs to be inclusive. When small_table_optimization is enabled for regular repair, a (minimum token,maximum token) will be used. To fix, we can relax the check of (start, end] for the min max range. Fixes #27220 Backport to all active branches. (cherry picked from commit `e97a504`) Parent PR: #27357 Closes scylladb/scylladb#27458	2026-01-19 11:09:42 +02:00
Nikos Dragazis	d21b37c8eb	test: database_test: Fix serialization of partition key The `make_key` lambda erroneously allocates a fixed 8-byte buffer (`sizeof(s.size())`) for variable-length strings, potentially causing uninitialized bytes to be included. If such bytes exist and they are not valid UTF-8 characters, deserialization fails: ``` ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7) ``` Fixes #28195. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#28197 (cherry picked from commit `8aca7b0eb9`) Closes scylladb/scylladb#28207	2026-01-19 09:44:21 +02:00
Botond Dénes	729cf6d5b6	Merge '[Backport 2025.1] raft topology: preserve IP -> ID mapping of a replacing node on restart' from Scylladb[bot] Note: only the fix for replace with different IP has been backported to 2025.1. The IP-based gossiper in 2025.1 has made backporting the fix for replace with the same IP too difficult. We currently do it only for a bootstrapping node, which is a bug. The missing IP can cause an internal error, for example, in the following scenario: - replace fails during streaming, - all live nodes are shut down before the rollback of replace completes, - all live nodes are restarted, - live nodes start hitting internal error in all operations that require IP of the replacing node (like client requests or REST API requests coming from nodetool). We fix the bug here, but we do it separately for replace with different IP and replace with the same IP. For replace with different IP, we persist the IP -> host ID mapping in `system.peers` just like for bootstrap. That's necessary, since there is no other way to determine IP of the replacing node on restart. For replace with the same IP, we can't do the same. This would require deleting the row corresponding to the node being replaced from `system.peers`. That's fine in theory, as that node is permanently banned, so its IP shouldn't be needed. Unfortunately, we have many places in the code where we assume that IP of a topology member is always present in the address map or that a topology member is always present in the gossiper endpoint set. Examples of such places: - nodetool operations, - REST API endpoints, - `db::hints::manager::store_hint`, - `group0_voter_handler::update_nodes`. We could fix all those places and verify that drivers work properly when they see a node in the token metadata, but not in `system.peers`. However, that would be too risky to backport. We take a different approach. We recover IP of the replacing node on restart based on the state of the topology state machine and `system.peers` just after loading `system.peers`. We rely on the fact that group 0 is set up at this point. The only case where this assumption is incorrect is a restart in the Raft-based recovery procedure. However, hitting this problem then seems improbable, and even if it happens, we can restart the node again after ensuring that no client and REST API requests come before replace is rolled back on the new topology coordinator. Hence, it's not worth to complicate the fix (by e.g. looking at the persistent topology state instead of the in-memory state machine). Fixes #28057 Backport this PR to all branches as it fixes a problematic bug. - (cherry picked from commit `fc4c2df2ce`) - (cherry picked from commit `4526dd93b1`) - (cherry picked from commit `749b0278e5`) - (cherry picked from commit `0fed9f94f8`) Manually cherry-picked: - `90b5b2c5f5` - `92b165b8c0` Parent PR: #27435 Closes scylladb/scylladb#28096 * github.com:scylladb/scylladb: test: introduce test_full_shutdown_during_replace utils: error_injection: allow aborting wait_for_message raft topology: preserve IP -> ID mapping of a replacing node on restart pylib/rest_client.py: encode injection name utils/error_injection: allow to abort `injection_handler::wait_for_message()`	2026-01-19 09:43:43 +02:00
Raphael S. Carvalho	ab19da2bd7	replica: Fix race between drop table and merge completion handling Consider this: 1) merge finishes, wakes up fiber to merge compaction groups 2) drop table happens, which in turn invokes truncate underneath 3) merge fiber stops old groups 4) truncate disables compaction on all groups, but the ones stopped 5) truncate performs a check that compaction has been disabled on all groups, including the ones stopped 6) the check fails because groups being stopped didn't have compaction explicitly disabled on them To fix it, the check on step 6 will ignore groups that have been stopped, since those are not eligible for having compaction explicitly disabled on them. The compaction check is there, so ongoing compaction will not propagate data being truncated, but here it happens in the context of drop table which doesn't leave anything behind. Also, a group stopped is somewhat equivalent to compaction disabled on it, since the procedure to stop a group stops all ongoing compaction and eventually removes its state from compaction manager. Fixes #25551. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#25563 (cherry picked from commit `149f9d8448`) Closes scylladb/scylladb#25630	2026-01-19 06:40:30 +02:00
Calle Wilund	74a721b9db	db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc Fixes #27992 When doing a commit log oversized allocation, we lock out all other writers by grabbing the _request_controller semaphore fully (max capacity). We thereafter assert that the semaphore is in fact zero. However, due to how things work with the bookkeep here, the semaphore can in fact become negative (some paths will not actually wait for the semaphore, because this could deadlock). Thus, if, after we grab the semaphore and execution actually returns to us (task schedule), new_buffer via segment::allocate is called (due to a non-fully-full segment), we might in fact grab the segment overhead from zero, resulting in a negative semaphore. The same problem applies later when we try to sanity check the return of our permits. Fix is trivial, just accept less-than-zero values, and take same possible ltz-value into account in exit check (returning units) Added whitebox (special callback interface for sync) unit test that provokes/creates the race condition explicitly (and reliably). Closes scylladb/scylladb#27998 (cherry picked from commit `a7cdb602e1`) Closes scylladb/scylladb#28095	2026-01-16 16:24:43 +02:00
Patryk Jędrzejczak	a1eec6f495	test: test_group0_schema_versioning: wait for schema sync in system.local `test_schema_versioning_with_recovery` is currently flaky. It performs a write with CL=ALL and then checks if the schema version is the same on all nodes by calling `verify_table_versions_synced`. All nodes are expected to sync their schema before handling the replica write. The node in RECOVERY mode should do it through a schema pull, and other nodes should do it through a group 0 read barrier. The problem is in `verify_local_schema_versions_synced` that compares the schema versions in `system.local`. The node in RECOVERY mode updates the schema version in `system.local` after it acknowledges the replica write as completed. Hence, the check can fail. We fix the problem by making the function wait until the schema versions match. Note that RECOVERY mode is about to be retired together with the whole gossip-based topology in 2026.2. So, this test is about to be deleted. However, we still want to fix it, so that it doesn't bother us in older branches. Fixes #23803 Closes scylladb/scylladb#28114 (cherry picked from commit `6b5923c64e`) Closes scylladb/scylladb#28172	2026-01-16 11:23:09 +01:00
Sergey Zolotukhin	c0f730bbaf	test: disable test_start_bootstrapped_with_invalid_seed The test intermittently fails when an invalid DNS name is resolved, likely due to ISP DNS error hijacking (see scylladb/scylladb#28153). Disable this test to unblock CI. Fixes scylladb/scylladb#28153 Closes scylladb/scylladb#28162 (cherry picked from commit `799d837295`)	2026-01-15 17:56:10 +02:00
Patryk Jędrzejczak	fe8738540e	test: introduce test_full_shutdown_during_replace (cherry picked from commit `749b0278e5`)	2026-01-13 17:14:19 +01:00
Petr Gusev	1b16fabbf6	pylib/rest_client.py: encode injection name Sometimes it's convenient to use slashes in injection names, for example my_component/my_method/my_condition. Without quote() we get 'handler not found' error from Scylla. (cherry picked from commit `92b165b8c0`)	2026-01-13 17:14:19 +01:00
Patryk Jędrzejczak	52f20e66ef	Merge '[Backport 2025.1] database: truncate_table_on_all_shards: consider can_flush on all shards' from Scylladb[bot] Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them. This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged). Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`. Fixes #27639 * The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions - (cherry picked from commit `ec4069246d`) - (cherry picked from commit `5be6b80936`) - (cherry picked from commit `0342a24ee0`) - (cherry picked from commit `02ee341a03`) - (cherry picked from commit `2a803d2261`) - (cherry picked from commit `93b827c185`) - (cherry picked from commit `ebd667a8e0`) Parent PR: #27643 Closes scylladb/scylladb#28067 * https://github.com/scylladb/scylladb: test: database_test: do_with_some_data: randomize keys database: truncate_table_on_all_shards: drop outdated TODO comment database: truncate_table_on_all_shards: consider can_flush on all shards memtable_list: unify can_flush and may_flush test: database_test: add test_flush_empty_table_waits_on_outstanding_flush replica: table, storage_group, compaction_group: add needs_flush test: database_test: do_with_some_data_in_thread: accept void callback function	2026-01-12 11:27:01 +01:00
Tomasz Grabiec	bec19944ca	test: cluster: Fix NoHostAvailable error in test_not_enough_token_owners The driver must see server_c before we stop server_a, otherwise there will be no live host in the pool when we attempt to drop the keyspace: ``` @pytest.mark.asyncio async def test_not_enough_token_owners(manager: ManagerClient): """ Test that: - the first node in the cluster cannot be a zero-token node - removenode and decommission of the only token owner fail in the presence of zero-token nodes - removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token owners would fall below the RF of some keyspace using tablets """ logging.info('Trying to add a zero-token server as the first server in the cluster') await manager.server_add(config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}, expected_error='Cannot start the first node in the cluster as zero-token') logging.info('Adding the first server') server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"}) logging.info('Adding two zero-token servers') # The second server is needed only to preserve the Raft majority. server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0] logging.info(f'Trying to decommission the only token owner {server_a}') await manager.decommission_node(server_a.server_id, expected_error='Cannot decommission the last token-owning node in the cluster') logging.info(f'Stopping {server_a}') await manager.server_stop_gracefully(server_a.server_id) logging.info(f'Trying to remove the only token owner {server_a} by {server_b}') await manager.remove_node(server_b.server_id, server_a.server_id, expected_error='cannot be removed because it is the last token-owning node in the cluster') logging.info(f'Starting {server_a}') await manager.server_start(server_a.server_id) logging.info('Adding a normal server') await manager.server_add(property_file={"dc": "dc1", "rack": "r2"}) cql = manager.get_cql() await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60) > async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name: test/cluster/test_not_enough_token_owners.py:57: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/lib64/python3.14/contextlib.py:221: in __aexit__ await anext(self.gen) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830> opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }" host = None @asynccontextmanager async def new_test_keyspace(manager: ManagerClient, opts, host=None): """ A utility function for creating a new temporary keyspace with given options. It can be used in a "async with", as: async with new_test_keyspace(ManagerClient, '...') as keyspace: """ keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host) try: yield keyspace except: logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation") raise else: > await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host) E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')}) test/cluster/util.py:544: NoHostAvailable ``` Fixes #28011 Closes scylladb/scylladb#28040 (cherry picked from commit `34df158605`) Closes scylladb/scylladb#28063	2026-01-09 19:14:34 +01:00
Botond Dénes	f9be7b4672	reader_concurrency_semaphore: add protection against negative count resource leaks The semaphore has detection and protection against regular resource leaks, where some resources go unaccounted for and are not released by the time the semaphore is destroyed. There is no detection or protection against negative leaks: where resources are "made up" of thin air. This kind of leaks looks benign at first sight, a few extra resources won't hurt anyone so long as this is a small amount. But turns out that even a single extra count resource can defeat a very important anti-deadlock protection in can_admit_read(): the special case which admits a new permit regardless of memory resources, when all original count resources all available. This check uses ==, so if resource > original, the protection is defeated indefinitely. Instead of just changing == to >=, we add detection of such negative leaks to signal(), via on_internal_error_noexcept(). At this time I still don't now how this negative leak happens (the code doesn't confess), with this detection, hopefully we'll get a clue from tests or the field. Note that on_internal_error_noexcept() will not generate a coredump, unless ScyllaDB is explicitely configured to do so. In production, it will just generate an error log with a backtrace. The detection also clams the _resources to _initial_resources, to prevent any damage from the negativae leak. I just noticed that there is no unit test for the deadlock protection described above, so one is added in this PR, even if only loosely related to the rest of the patch. Fixes: SCYLLADB-163 Closes scylladb/scylladb#27764 (cherry picked from commit `e4da0afb8d`) Closes scylladb/scylladb#28000	2026-01-09 13:28:46 +02:00
Benny Halevy	7389784151	test: database_test: do_with_some_data: randomize keys With randomized keys, and since we're inserting only 2 keys, it is possible that they would end up owned only by a single shard, reproducing #27639 in snapshot_list_contains_dropped_tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `ebd667a8e0`)	2026-01-09 09:02:29 +02:00
Benny Halevy	3920cacadc	test: database_test: add test_flush_empty_table_waits_on_outstanding_flush Test that table::flush waits on outstanding flushes, even if the active memtable is empty Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `0342a24ee0`)	2026-01-09 08:57:39 +02:00
Benny Halevy	faecb2aabc	test: database_test: do_with_some_data_in_thread: accept void callback function Many test cases already assume `func` is being called a seastar thread and although the function they pass returns a (ready) future, it serves no purpose other than to conform to the interface. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> (cherry picked from commit `ec4069246d`)	2026-01-09 08:53:16 +02:00
Botond Dénes	a6367cc48c	Merge '[Backport 2025.1] topology_coordinator: Add barrier to cleanup_target' from Scylladb[bot] Consider the following scenario: 1. A table has RF=3 and writes use CL=QUORUM 2. One node is down 3. There is a pending tablet migration from the unavailable node that is reverted During the revert, there can be a time window where the pending replica being cleaned up still accepts writes. This leads to write failures, as only two nodes (out of four) are able to acknowledge writes. This patch fixes the issue by adding a barrier to the cleanup_target tablet transition state, ensuring that the coordinator switches back to the previous replica set before cleanup is triggered. Fixes https://github.com/scylladb/scylladb/issues/26512 It's a pre existing issue. Backport is required to all recent 2025.x versions. - (cherry picked from commit `669286b1d6`) - (cherry picked from commit `67f1c6d36c`) - (cherry picked from commit `6163fedd2e`) Parent PR: #27413 Closes scylladb/scylladb#27425 * github.com:scylladb/scylladb: topology_coordinator: Fix the indentation for the cleanup_target case topology_coordinator: Add barrier to cleanup_target test_node_failure_during_tablet_migration: Increase RF from 2 to 3	2026-01-08 18:09:37 +02:00
Botond Dénes	9b10a6328d	Merge '[Backport 2025.1] db: repair: do not update repair_time if batchlog replay failed' from Scylladb[bot] Currently, batchlog replay is considered successful even if all batches fail to be sent (they are replayed later). However, repair requires all batches to be sent successfully. Currently, if batchlog isn't cleared, the repair never learns and updates the repair_time. If GC mode is set to "repair", this means that the tombstones written before the repair_time (minus propagation_delay) can be GC'd while not all batches were replied. Consider a scenario: - Table t has a row with (pk=1, v=0); - There is an entry in the batchlog that sets (pk=1, v=1) in table t; - The row with pk=1 is deleted from table t; - Table t is repaired: - batchlog reply fails; - repair_time is updated; - propagation_delay seconds passes and the tombstone of pk=1 is GC'd; - batchlog is replayed and (pk=1, v=1) inserted - data resurrection! Do not update repair_time if sending any batch fails. The data is still repaired. For tablet repair the repair runs, but at the end the exception is passed to topology coordinator. Thanks to that the repair_time isn't updated. The repair request isn't removed as well, due to which the repair will need to rerun. Apart from that, a batch is removed from the batchlog if its version is invalid or unknown. The condition on which we consider a batch too fresh to replay is updated to consider propagation_delay. Fixes: https://github.com/scylladb/scylladb/issues/24415 Data resurrection fix; needs backport to all versions - (cherry picked from commit `502b03dbc6`) - (cherry picked from commit `904183734f`) - (cherry picked from commit `7f20b66eff`) - (cherry picked from commit `e1b2180092`) - (cherry picked from commit `d436233209`) - (cherry picked from commit `1935268a87`) - (cherry picked from commit `6fc43f27d0`) Parent PR: #26319 Closes scylladb/scylladb#26752 * github.com:scylladb/scylladb: repair: throw if flush failed in get_flush_time db: fix indentation test: add reproducer for data resurrection repair: fail tablet repair if any batch wasn't sent successfully db/batchlog_manager: fix making decision to skip batch replay db: repair: throw if replay fails db/batchlog_manager: delete batch with incorrect or unknown version db/batchlog_manager: coroutinize replay_all_failed_batches	2026-01-08 18:07:03 +02:00
Patryk Jędrzejczak	310ce822c9	Merge '[Backport 2025.1] test/raft: use valid sentinel in liveness check to prevent digest errors' from Scylladb[bot] Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants. The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value. By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test. The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry. Fixes: scylladb/scylladb#27307 Backporting to active branches - this is a test-only fix (low risk) for a flaky test that exists in older branches (thus affects the CI of active branches). - (cherry picked from commit `3af5183633`) - (cherry picked from commit `4ba3e90f33`) Parent PR: #28010 Closes scylladb/scylladb#28036 * https://github.com/scylladb/scylladb: test/raft: use valid sentinel in liveness check to prevent digest errors test/raft: improve debugging in randomized_nemesis_test test/raft: improve reporting in the randomized_nemesis_test digest functions	2026-01-08 15:42:08 +01:00
Aleksandra Martyniuk	e1776dc828	test: rename duplicate tests There are two test with name test_repair_options_hosts_tablets in test/nodetool/test_cluster_repair.py and and two test_repair_keyspace in test/nodetool/test_repair.py. Due to that one of each pair is ignored. Rename the tests so that they are unique. Fixes: https://github.com/scylladb/scylladb/issues/27701. Closes scylladb/scylladb#27720 (cherry picked from commit `bbe64e0e2a`) Closes scylladb/scylladb#27845	2026-01-08 14:52:38 +02:00
Emil Maskovsky	9945ade82d	test/raft: use valid sentinel in liveness check to prevent digest errors Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants. The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value. By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test. The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry. Fixes: scylladb/scylladb#27307 (cherry picked from commit `4ba3e90f33`)	2026-01-08 10:47:21 +01:00
Emil Maskovsky	3237ad1e08	test/raft: improve debugging in randomized_nemesis_test Move the post-condition check before the assertion to ensure it is always executed first. Before, the wrong value could be passed to the digest_remove assertion, making the pre-check trigger there instead of the post-check as expected. Also, add a check in the append_seq constructor to ensure that the digest value is valid when creating an append_seq object. (cherry picked from commit `3af5183633`)	2026-01-08 10:47:21 +01:00
Emil Maskovsky	0febd6719a	test/raft: improve reporting in the randomized_nemesis_test digest functions The Boost ASSERTs in the digest functions of the randomized_nemesis_test were not working well inside the state machine digest functions, leading to unhelpful boost::execution_exception errors that terminated the apply fiber, and didn't provide any helpful information. Replaced by explicit checks with on_fatal_internal_error calls that provide more context about the failure. Also added validation of the digest value after appending or removing an element, which allows to determine which operation resulted in causing the wrong value. This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282, but adds improved error reporting. Refs: scylladb/scylladb#27307 Refs: scylladb/scylladb#17030 (cherry picked from commit `d60b908a8e`)	2026-01-08 10:46:58 +01:00
Ferenc Szili	7a646e101c	test: fix flakyness caused by TRUNCATE retries The test test_truncate_during_topology_change tests TRUNCATE TABLE while bootstrapping a new node. With tablets enabled TRUNCATE is a global topology operation which needs to serialize with boostrap. When TRUNCATE TABLE is issued, it first checks if there is an already queued truncate for the same table. This can happen if a previous TRUNCATE operation has timed out, and the client retried. The newly issued truncate will only join the queued one if it is waiting to be processed, and will fail immediatelly if the TRUNCATE is already being processed. In this test, TRUNCATE will be retried after a timeout (1 minute) due to the default retry policy, and will be retried up to 3 times, while the bootstrap is delayed by 2 minutes. This means that the test can validate the result of a truncate which was started after bootstrap was completed. Because of the way truncate joins existing truncate operations, we can also have the following scenario: - TRUNCATE times out after one minute because the new node is being bootstrapped - the client retries the TRUNCATE command which also times out after 1m - the third attempt is received during TRUNCATE being processed which fails the test This patch changes the retry policy of the TRUNCATE operation to FallthroughRetryPolicy which guarantees that TRUNCATE will not be retried on timeout. It also increases the timeout of the TRUNCATE from 1 to 4 minutes. This way the test will actually validate the performance of the TRUNCATE operation which was issued during bootstrap, instead of the subsequent, retried TRUNCATEs which could have been issued after the bootstrap was complete. Fixes: #26347 Closes scylladb/scylladb#27245 (cherry picked from commit `d883ff2317`) Closes scylladb/scylladb#27503	2025-12-23 17:09:37 +02:00
Emil Maskovsky	00a6671543	test/raft: fix race condition in failure_detector_test The test had a sporadic failure due to a broken promise exception. The issue was in `test_pinger::ping()` which captured the promise by move into the subscription lambda, causing the promise to be destroyed when the lambda was destroyed during coroutine unwinding. Simplify `test_pinger::ping()` by replacing manual abort_source/promise logic with `seastar::sleep_abortable()`. This removes the risk of promise lifetime/race issues and makes the code simpler and more robust. Fixes: scylladb/scylladb#27136 Backport to active branches: This fixes a CI test issue, so it is beneficial to backport the fix. As this is a test-only fix, it is a low risk change. Closes scylladb/scylladb#27737 (cherry picked from commit `2a75b1374e`) Closes scylladb/scylladb#27778	2025-12-21 19:26:45 +02:00
Aleksandra Martyniuk	a6c72c5020	test: add reproducer for data resurrection Add a reproducer to check that the repair_time isn't updated if the batchlog replay fails. If repair_time was updated, tombstones could be GC'd before the batchlog is replayed. The replay could later cause the data resurrection. (cherry picked from commit `1935268a87`)	2025-12-16 15:55:25 +01:00
Aleksandra Martyniuk	9422baf49f	db: repair: throw if replay fails Return a flag determining whether all the batches were sent successfully in batchlog_manager::replay_all_failed_batches (batches skipped due to being too fresh are not counted). Throw in repair_flush_hints_batchlog_handler if not all batches were replayed, to ensure that repair_time isn't updated. (cherry picked from commit `7f20b66eff`)	2025-12-16 15:55:25 +01:00
Łukasz Paszkowski	a9b06865f1	topology_coordinator: Add barrier to cleanup_target Consider the following scenario: 1. A table has RF=3 and writes use CL=QUORUM 2. One node is down 3. There is a pending tablet migration from the unavailable node that is reverted During the revert, there can be a time window where the pending replica being cleaned up still accepts writes. This leads to write failures, as only two nodes (out of four) are able to acknowledge writes. This patch fixes the issue by adding a barrier to the cleanup_target tablet transition state, ensuring that the coordinator switches back to the previous replica set before cleanup is triggered. Fixes https://github.com/scylladb/scylladb/issues/26512	2025-12-04 13:33:14 +01:00
Łukasz Paszkowski	7f8f5ba24e	test_node_failure_during_tablet_migration: Increase RF from 2 to 3 The patch prepares the test for additional write workload to be executed in parallel with node failures. With the original RF=2, QUORUM is also 2, which causes writes to fail during node outage. To address it, the third rack with a single node is added and the replication factor is increased to 3.	2025-12-04 13:29:12 +01:00
Calle Wilund	7defa0b4cd	commitlog::read_log_file: Check for eof position on all data reads Fixes #24346 When reading, we check for each entry and each chunk, if advancing there will hit EOF of the segment. However, IFF the last chunk being read has the last entry _exactly_ matching the chunk size, and the chunk ending at _exactly_ segment size (preset size, typically 32Mb), we did not check the position, and instead complained about not being able to read. This has literally _never_ happened in actual commitlog (that was replayed at least), but has apparently happened more and more in hints replay. Fix is simple, just check the file position against size when advancing said position, i.e. when reading (skipping already does). v2: * Added unit test Closes scylladb/scylladb#27236 (cherry picked from commit `59c87025d1`) Closes scylladb/scylladb#27336	2025-12-03 12:21:13 +03:00
Michael Litvak	e3f5924f71	tablet: scheduler: Do not emit conflicting migration in merge colocation The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 Closes scylladb/scylladb#27312 (cherry picked from commit `97b7c03709`)	2025-11-30 10:37:58 +01:00
Tomasz Grabiec	dc1a318971	tablet: scheduler: Do not emit conflicting migrations in the plan Plan-making is invoked independently for different DCs (and in the future, racks) and then plans are merged. It could be that the same tablets are selected for migration in different DCs. Only one migration will prevail and be committed to group0, so it's not a correctness problem. Next cycle will recognize that the tablet is in transition and will not be selected by plan-maker. But it makes plan-making less efficient. It may also surprise consumers of the plan, like we saw in #25912. So we should make plan-maker be aware of already scheduled transitions and not consider those tablets as candidates. Fixes #26038 Closes scylladb/scylladb#26048 (cherry picked from commit `981592bca5`)	2025-11-30 10:37:58 +01:00
Gleb Natapov	01c71681a8	test: test that expired erm that held for too long triggers notification (cherry picked from commit `5dcdaa6f66`)	2025-11-26 15:07:28 +00:00
Pavel Emelyanov	14fd0d9c21	lister: Fix race between readdir and stat Sometimes file::list_directory() returns entries without type set. In thase case lister calls file_type() on the entry name to get it. In case the call returns disengated type, the code assumes that some error occurred and resolves into exception. That's not correct. The file_type() method returns disengated type only if the file being inspected is missing (i.e. on ENOENT errno). But this can validly happen if a file is removed bettween readdir and stat. In that case it's not "some error happened", but a enry should be just skipped. In "some error happened", then file_type() would resolve into exceptional future on its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26595 (cherry picked from commit `d9bfbeda9a`) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26754	2025-11-24 17:19:42 +03:00
Pavel Emelyanov	4f0ea75f51	Merge '[Backport 2025.1] Synchronize tablet split and load-and-stream' from Raphael Raph Carvalho Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1. split starts 2. split finalization executes barrier and succeed 3. load-and-stream runs now, starts writing sstable (pre-split) 4. split finalization publishes changes to tablet metadata 5. load-and-stream finishes writing sstable 6. sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): load-and-stream awaits for topology to quiesce perform split compaction on sstable that spans both sibling tablets This patch implements # 1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes https://github.com/scylladb/scylladb/issues/26455. (cherry picked from commit `3abc66da5a`) (cherry picked from commit `4654cdc6fd`) Parent PR: https://github.com/scylladb/scylladb/pull/26456 Closes scylladb/scylladb#27126 * https://github.com/scylladb/scylladb: sstables_loader: Don't bypass synchronization with busy topology test: Add reproducer for l-a-s and split synchronization issue sstables_loader: Synchronize tablet split and load-and-stream sstable_set: incremental_reader_selector: be more careful when filtering out already engaged sstables	2025-11-24 17:17:53 +03:00
Raphael S. Carvalho	90e6e88f69	sstables_loader: Don't bypass synchronization with busy topology The patch `c543059f86` fixed the synchronization issue between tablet split and load-and-stream. The synchronization worked only with raft topology, and therefore was disabled with gossip. To do the check, storage_service::raft_topology_change_enabled() but the topology kind is only available/set on shard 0, so it caused the synchronization to be bypassed when load-and-stream runs on any shard other than 0. The reason the reproducer didn't catch it is that it was restricted to single cpu. It will now run with multi cpu and catch the problem observed. Fixes #22707 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#26730 (cherry picked from commit `7f34366b9d`) (cherry picked from commit `4c466ace4f`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-11-21 10:39:53 -03:00
Raphael S. Carvalho	d2bddea515	test: Add reproducer for l-a-s and split synchronization issue Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `4654cdc6fd`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-11-21 10:39:53 -03:00
Botond Dénes	cb1f72dc81	Merge '[Backport 2025.1] Automatic cleanup improvements' from Scylladb[bot] This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously. Fixes https://github.com/scylladb/scylladb/issues/26866 Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing. - (cherry picked from commit `e872f9cb4e`) - (cherry picked from commit `0f0ab11311`) Parent PR: #26868 Closes scylladb/scylladb#27089 * github.com:scylladb/scylladb: cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster cleanup: Add RESTful API to allow reset cleanup needed flag	2025-11-21 13:52:08 +02:00
Patryk Jędrzejczak	5ba523fa77	test: test_raft_recovery_stuck: ensure mutual visibility before using driver Not waiting for nodes to see each other as alive can cause the driver to fail the request sent in `wait_for_upgrade_state()`. scylladb/scylladb#19771 has already replaced concurrent restarts with `ManagerClient.rolling_restart()`, but it has missed this single place, probably because we do concurrent starts here. Fixes #27055 Closes scylladb/scylladb#27075 (cherry picked from commit `e35ba974ce`) Closes scylladb/scylladb#27107	2025-11-20 10:55:12 +02:00
Botond Dénes	cf39bd8f3e	Merge '[Backport 2025.1] service/qos: Fall back to default scheduling group when using maintenance socket' from Scylladb[bot] The service level controller relies on `auth::service` to collect information about roles and the relation between them and the service levels (those attached to them). Unfortunately, the service level controller is initialized way earlier than `auth::service` and so we had to prevent potential invalid queries of user service levels (cf. `46193f5e79`). Unfortunately, that came at a price: it made the maintenance socket incompatible with the current implementation of the service level controller. The maintenance socket starts early, before the `auth::service` is fully initialized and registered, and is exposed almost immediately. If the user attempts to connect to Scylla within this time window, via the maintenance socket, one of the things that will happen is choosing the right service level for the connection. Since the `auth::service` is not registered, Scylla with fail an assertion and crash. A similar scenario occurs when using maintenance mode. The maintenance socket is how the user communicates with the database, and we're not prepared for that either. To avoid unnecessary crashes, we add new branches if the passed user is absent or if it corresponds to the anonymous role. Since the role corresponding to a connection via the maintenance socket is the anonymous role, that solves the problem. Some accesses to `auth::service` are not affected and we do not modify those. Fixes scylladb/scylladb#26816 Backport: yes. This is a fix of a regression. - (cherry picked from commit `c0f7622d12`) - (cherry picked from commit `222eab45f8`) - (cherry picked from commit `394207fd69`) - (cherry picked from commit `b357c8278f`) Parent PR: #26856 Closes scylladb/scylladb#27029 * github.com:scylladb/scylladb: test/cluster/test_maintenance_mode.py: Wait for initialization test: Disable maintenance mode correctly in test_maintenance_mode.py test: Fix keyspace in test_maintenance_mode.py service/qos: Do not crash Scylla if auth_integration absent	2025-11-20 10:47:06 +02:00
Botond Dénes	a84b331b09	Merge '[Backport 2025.1] cdc: set column drop timestamp in the future' from Scylladb[bot] When dropping a column from a CDC log table, set the column drop timestamp several seconds into the future. If a value is written to a column concurrently with dropping that column, the value's timestamp may be after the column drop timestamp. If this value is also flushed to an SSTable, the SSTable would be corrupted, because it considers the column missing after the drop timestamp and doesn't allow values for it. While this issue affects general tables, it especially impacts CDC tables because this scenario can occur when writing to a table with CDC preimage enabled while dropping a column from the base table. This happens even if the base mutation doesn't write to the dropped column, because CDC log mutations can generate values for a column even if the base mutation doesn't. For general tables, this issue can be avoided by simply not writing to a column while dropping it. We fix this for the more problematic case of CDC log tables by setting the column drop timestamp several seconds into the future, ensuring that writes concurrent with column drops are much less likely to have timestamps greater than the column drop timestamp. Fixes https://github.com/scylladb/scylladb/issues/26340 the issue affects all previous releases, backport to improve stability - (cherry picked from commit `eefae4cc4e`) - (cherry picked from commit `48298e38ab`) - (cherry picked from commit `039323d889`) - (cherry picked from commit `e85051068d`) Parent PR: #26533 Closes scylladb/scylladb#27025 * github.com:scylladb/scylladb: test: test concurrent writes with column drop with cdc preimage cdc: check if recreating a column too soon cdc: set column drop timestamp in the future	2025-11-20 10:46:24 +02:00
Gleb Natapov	1368f48221	cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster `97ab3f6622` changed "nodetool cleanup" (without arguments) to run cleanup on all dirty nodes in the cluster. This was somewhat unexpected, so this patch changes it back to run cleanup on the target node only (and reset "cleanup needed" flag afterwards) and it adds "nodetool cluster cleanup" command that runs the cleanup on all dirty nodes in the cluster. (cherry picked from commit `0f0ab11311`)	2025-11-18 16:01:27 +02:00
Benny Halevy	4dcb8c19bd	scylla-sstable: correctly dump sharding_metadata This patch fixes 2 issues at one go: First, Currently sstables::load clears the sharding metadata (via open_data()), and so scylla-sstable always prints an empty array for it. Second, printing token values would generate invalid json as they are currently printed as binary bytes, and they should be printed simply as numbers, as we do elsewhere, for example, for the first and last keys. Fixes #26982 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26991 (cherry picked from commit `f9ce98384a`) Closes scylladb/scylladb#27030	2025-11-16 16:05:14 +02:00
Michael Litvak	c8466afe74	test: test concurrent writes with column drop with cdc preimage add a test that writes to a table concurrently with dropping a column, where the table has CDC enabled with preimage. the test reproduces issue #26340 where this results in a malformed sstable. (cherry picked from commit `e85051068d`)	2025-11-16 10:19:08 +01:00
Michael Litvak	903afafa5f	cdc: check if recreating a column too soon When we drop a column from a CDC log table, we set the column drop timestamp a few seconds into the future. This can cause unexpected problems if a user tries to recreate a CDC column too soon, before the drop timestamp has passed. To prevent this issue, when creating a CDC column we check its creation timestamp against the existing drop timestamp, if any, and fail with an informative error if the recreation attempt is too soon. (cherry picked from commit `039323d889`)	2025-11-16 10:19:08 +01:00
Dawid Mędrek	2313aa5856	test/cluster/test_maintenance_mode.py: Wait for initialization If we try to perform queries too early, before the call to `storage_service::start_maintenance_mode` has finished, we will fail with the following error: ``` ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index! ``` To avoid that, we should wait until initialization is complete. (cherry picked from commit `b357c8278f`)	2025-11-15 22:09:14 +00:00
Dawid Mędrek	b217a5e43a	test: Disable maintenance mode correctly in test_maintenance_mode.py Although setting the value of `maintenance_mode` to the string `"false"` disables maintenance mode, the testing framework misinterprets the value and thinks that it's actually enabled. As a result, it might try to connect to Scylla via the maintenance socket, which we don't want. (cherry picked from commit `394207fd69`)	2025-11-15 22:09:14 +00:00
Dawid Mędrek	7808d85ecb	test: Fix keyspace in test_maintenance_mode.py The keyspace used in the test is not necessarily called `ks`. (cherry picked from commit `222eab45f8`)	2025-11-15 22:09:14 +00:00
Ernest Zaslavsky	e185740c54	minio: update CLI usage, remove deprecated `mc` options Replace phased-out `mc` command options with supported alternatives. Ensures compatibility with the latest MinIO version. Closes scylladb/scylladb#24363 (cherry picked from commit `1446f57635`) Closes scylladb/scylladb#27004	2025-11-14 10:48:00 +02:00

1 2 3 4 5 ...

8579 Commits