scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-27 03:45:11 +00:00

Author	SHA1	Message	Date
Patryk Jędrzejczak	bd0f876fdb	test: introduce test_full_shutdown_during_replace (cherry picked from commit `749b0278e5`)	2026-01-13 12:07:18 +01:00
Patryk Jędrzejczak	e2d97bd1f4	utils: error_injection: allow aborting wait_for_message The test added in the following commit utilizes it. (cherry picked from commit `4526dd93b1`)	2026-01-13 12:07:18 +01:00
Patryk Jędrzejczak	ef95e1efeb	raft topology: preserve IP -> ID mapping of a replacing node on restart We currently do it only for a bootstrapping node, which is a bug. The missing IP can cause an internal error, for example, in the following scenario: - replace fails during streaming, - all live nodes are shut down before the rollback of replace completes, - all live nodes are restarted, - live nodes start hitting internal error in all operations that require IP of the replacing node (like client requests or REST API requests coming from nodetool). We fix the bug here, but we do it separately for replace with different IP and replace with the same IP. For replace with different IP, we persist the IP -> host ID mapping in `system.peers` just like for bootstrap. That's necessary, since there is no other way to determine IP of the replacing node on restart. For replace with the same IP, we can't do the same. This would require deleting the row corresponding to the node being replaced from `system.peers`. That's fine in theory, as that node is permanently banned, so its IP shouldn't be needed. Unfortunately, we have many places in the code where we assume that IP of a topology member is always present in the address map or that a topology member is always present in the gossiper endpoint set. Examples of such places: - nodetool operations, - REST API endpoints, - `db::hints::manager::store_hint`, - `group0_voter_handler::update_nodes`. We could fix all those places and verify that drivers work properly when they see a node in the token metadata, but not in `system.peers`. However, that would be too risky to backport. We take a different approach. We recover IP of the replacing node on restart based on the state of the topology state machine and `system.peers` just after loading `system.peers`. We rely on the fact that group 0 is set up at this point. The only case where this assumption is incorrect is a restart in the Raft-based recovery procedure. However, hitting this problem then seems improbable, and even if it happens, we can restart the node again after ensuring that no client and REST API requests come before replace is rolled back on the new topology coordinator. Hence, it's not worth to complicate the fix (by e.g. looking at the persistent topology state instead of the in-memory state machine). (cherry picked from commit `fc4c2df2ce`)	2026-01-13 12:07:16 +01:00
Tomasz Grabiec	5cb3900d90	test: cluster: Fix NoHostAvailable error in test_not_enough_token_owners The driver must see server_c before we stop server_a, otherwise there will be no live host in the pool when we attempt to drop the keyspace: ``` @pytest.mark.asyncio async def test_not_enough_token_owners(manager: ManagerClient): """ Test that: - the first node in the cluster cannot be a zero-token node - removenode and decommission of the only token owner fail in the presence of zero-token nodes - removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token owners would fall below the RF of some keyspace using tablets """ logging.info('Trying to add a zero-token server as the first server in the cluster') await manager.server_add(config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}, expected_error='Cannot start the first node in the cluster as zero-token') logging.info('Adding the first server') server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"}) logging.info('Adding two zero-token servers') # The second server is needed only to preserve the Raft majority. server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0] logging.info(f'Trying to decommission the only token owner {server_a}') await manager.decommission_node(server_a.server_id, expected_error='Cannot decommission the last token-owning node in the cluster') logging.info(f'Stopping {server_a}') await manager.server_stop_gracefully(server_a.server_id) logging.info(f'Trying to remove the only token owner {server_a} by {server_b}') await manager.remove_node(server_b.server_id, server_a.server_id, expected_error='cannot be removed because it is the last token-owning node in the cluster') logging.info(f'Starting {server_a}') await manager.server_start(server_a.server_id) logging.info('Adding a normal server') await manager.server_add(property_file={"dc": "dc1", "rack": "r2"}) cql = manager.get_cql() await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60) > async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name: test/cluster/test_not_enough_token_owners.py:57: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/lib64/python3.14/contextlib.py:221: in __aexit__ await anext(self.gen) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830> opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }" host = None @asynccontextmanager async def new_test_keyspace(manager: ManagerClient, opts, host=None): """ A utility function for creating a new temporary keyspace with given options. It can be used in a "async with", as: async with new_test_keyspace(ManagerClient, '...') as keyspace: """ keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host) try: yield keyspace except: logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation") raise else: > await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host) E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')}) test/cluster/util.py:544: NoHostAvailable ``` Fixes #28011 Closes scylladb/scylladb#28040 (cherry picked from commit `34df158605`) Closes scylladb/scylladb#28073	2026-01-09 19:10:11 +01:00
Łukasz Paszkowski	6c8663b1ec	load_sketch: Allow populating load_sketch with normalized current load Currently, tablet allocation intentionally ignores current load ( introduced by the commit #1e407ab) which could cause identical shard selection when allocating a small number of tablets in the same topology. When a tablet allocator is asked to allocate N tablets (where N is smaller than the number of shards on a node), it selects the first N lowest shards. If multiple such tables are created, each allocator run picks the same shards, leading to tablet imbalance across shards. This change initializes the load sketch with the current shard load, scaled into the [0,1] range, ensuring allocation still remains even while starting from globally least-loaded shards. Fixes https://github.com/scylladb/scylladb/issues/27620 Closes https://github.com/scylladb/scylladb/pull/27802 Closes scylladb/scylladb#28060	2026-01-09 18:42:03 +01:00
Piotr Dulikowski	1df6ef365e	Merge '[Backport 2025.4] service/storage_service: update service levels cache after upgrade to v2' from Scylladb[bot] Service levels cache is empty after upgrade to consistent topology if no mutations are commited to `system.service_levels_v2` or rolling restart is not done. To fix the bug, this patch adds service levels cache reloading after upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`. Fixes [SCYLLADB-90](https://scylladb.atlassian.net/browse/SCYLLADB-90) This fix should be backported to all versions containing service levels on Raft. [SCYLLADB-90]: https://scylladb.atlassian.net/browse/SCYLLADB-90?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ - (cherry picked from commit `53d0a2b5dc`) - (cherry picked from commit `be16e42cb0`) Parent PR: #27585 Closes scylladb/scylladb#28075 * github.com:scylladb/scylladb: service/storage_service: update service levels cache after upgrade to v2 service/storage_service: check if service levels were already upgraded before doing migration to raft	2026-01-09 17:51:49 +01:00
Anna Stuchlik	ec917fd5e7	doc: add the patch release upgrade procedure for version 2025.4 Adds the patch upgrade guide based on previous upgrade guides. Fixes https://github.com/scylladb/scylladb/issues/27982 Closes scylladb/scylladb#27985 (cherry picked from commit `f614482e66`) Closes scylladb/scylladb#28066	2026-01-09 16:23:27 +02:00
Michał Hudobski	133a92e86c	auth: add system table permissions to VECTOR_SEARCH_INDEXING Due to the recent changes in the vector store service, the service needs to read two of the system tables to function correctly. This was not accounted for when the new permission was added. This patch fixes that by allowing these tables (group0_history and versions) to be read with the VECTOR_SEARCH_INDEXING permission. We also add a test that validates this behavior. Fixes: SCYLLADB-73 Closes scylladb/scylladb#27546 (cherry picked from commit `ce3320a3ff`) Closes scylladb/scylladb#28042 Parent PR: #27546	2026-01-09 14:55:34 +01:00
Botond Dénes	1632853ebd	reader_concurrency_semaphore: add protection against negative count resource leaks The semaphore has detection and protection against regular resource leaks, where some resources go unaccounted for and are not released by the time the semaphore is destroyed. There is no detection or protection against negative leaks: where resources are "made up" of thin air. This kind of leaks looks benign at first sight, a few extra resources won't hurt anyone so long as this is a small amount. But turns out that even a single extra count resource can defeat a very important anti-deadlock protection in can_admit_read(): the special case which admits a new permit regardless of memory resources, when all original count resources all available. This check uses ==, so if resource > original, the protection is defeated indefinitely. Instead of just changing == to >=, we add detection of such negative leaks to signal(), via on_internal_error_noexcept(). At this time I still don't now how this negative leak happens (the code doesn't confess), with this detection, hopefully we'll get a clue from tests or the field. Note that on_internal_error_noexcept() will not generate a coredump, unless ScyllaDB is explicitely configured to do so. In production, it will just generate an error log with a backtrace. The detection also clams the _resources to _initial_resources, to prevent any damage from the negativae leak. I just noticed that there is no unit test for the deadlock protection described above, so one is added in this PR, even if only loosely related to the rest of the patch. Fixes: SCYLLADB-163 Closes scylladb/scylladb#27764 (cherry picked from commit `e4da0afb8d`) Closes scylladb/scylladb#28004	2026-01-09 13:27:43 +02:00
Jenkins Promoter	488be3c52d	Update ScyllaDB version to: 2025.4.2	2026-01-09 06:29:53 +02:00
Michał Jadwiszczak	29fc0d480c	service/storage_service: update service levels cache after upgrade to v2 Service levels cache is empty after upgrade to consistent topology if no mutations are commited to `system.service_levels_v2` or rolling restart is not done. To fix the bug, this commit adds service levels cache reloading after upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`. Fixes SCYLLADB-90 (cherry picked from commit `be16e42cb0`)	2026-01-08 22:46:36 +00:00
Michał Jadwiszczak	875b1ecacf	service/storage_service: check if service levels were already upgraded before doing migration to raft There is no need to call `service_level_controller::upgrade_to_v2()` on every topology state load, we only need to do it once. (cherry picked from commit `53d0a2b5dc`)	2026-01-08 22:46:36 +00:00
Dawid Mędrek	f5c6310c66	db/hints: Prevent draining hints before hint replay is allowed Context ------- The procedure of hint draining boils down to the following steps: 1. Drain a hint sender. That should get rid of all hints stored for the corresponding endpoint. 2. Remove the hint directory corresponding to that endpoint. Obviously, it gets more complex than this high-level perspective. Without blurring the view, the relevant information is that step 1 in the algorithm above may not be executed. Breaking it down, it comprises of two calls to `hint_sender::send_hints_maybe()`. The function is responsible for sending out hints, but it's not unconditional and will not be performed if any of the following bullets is not satisfied: * `hint_sender::replay_allowed()` is not `true`. This can happen when hint replay hasn't been turned on yet. * `hint_sender::can_send()` is not `true`. This can happen if the corresponding endpoint is not alive AND it hasn't left the cluster AND it's still a normal token owner. There is one more relevant point: sending hints can be stopped if replaying hints fails and `hint_sender::send_hints_maybe()` returns `false`. However, that's not not possible in the case of draining. In that case, if Scylla comes across any failure, it'll simply delete the corresponding hint segment. Because of that, we ignore it and only focus on the two bullets. --- Why is it a problem? -------------------- If a hint directory is not purged of all hint segments in it, any attempt to remove it will fail and we'll observe an error like this: ``` Exception when draining <host ID>: std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [<path>]) ``` The folder with the remaining hints will also stay on disk, which is, of course, undesired. --- When can it happen? ------------------- As highlighted in the Context section of this commit message, the key part of the code that can lead to a dangerous situation like that is `hint_sender::send_hints_maybe()`. The function is called twice when draining a hint endpoint manager: once to purge all of the existing hints, and another time after flushing all hints stored in a commitlog instances, but not listed by `hint_sender` yet. If any of those calls misbehaves, we may end up with a problem. That's why it's crucial to ensure that the function always goes through ALL of the hints. Dangerous situations: 1. We try to drain hints before hint replay is allowed. That will violate the first bullet above. 2. The node we're draining is dead, but it hasn't left the cluster, and it still possesses some tokens. --- How do we solve that? --------------------- Hint replay is turned on in `main.cc`. Once enabled, it cannot be disabled. So to address the first bullet above, it suffices to ensure that no draining occurs beforehand. It's perfectly fine to prevent it. Soon after hint replay is allowed, `main.cc` also asks the hint manager to drain all of the endpoint managers whose endpoints are no longer normal token owners (cf. `db::hints::manager::drain_left_nodes()`). The other bullet is more tricky. It's important here to know that draining only initiated in three situations: 1. As part of the call to `storage_service::notify_left()`. 2. As part of the call to `storage_service::notify_released()`. 3. As part of the call to `db::hints::manager::drain_left_nodes()`. The last one is trivially non-problematic. The nodes that it'll try to drain are no longer normal token owners, so `can_send()` must always return `true`. The second situation is similar. As we read in the commit message of scylladb/scylladb@eb92f50413, which introduced the notion of released nodes, the nodes are no longer normal token owners: > In this patch we postpone the hint draining for the "left" nodes to > the time when we know that the target nodes no longer hold ownership > of any tokens - so they're no longer referenced in topology. I'm > calling such nodes "released". I suggest reading the full commit message there because the problems there are somewhat similar these changes try to solve. Finally, the first situation: unfortunately, it's more tricky. The same commit message says: > When a node is being replaced, it enters a "left" state while still > owning tokens. Before this patch, this is also the time when we start > draining hints targeted to this node, so the hints may get sent before > the token ownership gets migrated to another replica, and these hints > may get lost. This suggests that `storage_service::notify_left()` may be called when the corresponding node still has some tokens! That's something that may prevent properly draining hints. Fortunately, no hope is lost. We only drain hints via `notify_left()` when hinted handoff hasn't been upgraded to being host-ID-based yet. If it has, draining always happens via `notify_released()`. When I write this commit message, all of the supported versions of Scylla 2025.1+ use host-ID-based hinted handoff. That means that problems can only arise when upgrading from an older version of Scylla (2024.1 downwards). Because of that, we don't cover it. It would most likely require more extensive changes. --- Non-issues ---------- There are notions that are closely related to sending hints. One of them is the host filter that hinted handoff uses. It decides which endpoints are eligible for receiving hints, and which are not. Fortunately, all endpoints rejected by the host filter lose their hint endpoint managers -- they're stopped as part of that procedure. What's more, draining hints and changing the host filter cannot be happening at the same time, so it cannot lead to any problems. The solution ------------ To solve the described issue, we simply prevent draining hints before hint replay is allowed. No reproducer test is attached because it's not feasible to write one. Fixes scylladb/scylladb#27693 Closes scylladb/scylladb#27713 (cherry picked from commit `77a934e5b9`) Closes scylladb/scylladb#27972	2026-01-08 17:52:28 +02:00
Anna Stuchlik	f68b032ce9	doc: remove cassandra-stress from installation instructions The cassandra-stress tool is no longer part of the default package and cannot be run in the way described. This commit removes the instruction to run cassandra-stress. Fixes https://github.com/scylladb/scylladb/issues/24994 Closes scylladb/scylladb#27726 (cherry picked from commit `624869de86`) Closes scylladb/scylladb#27951	2026-01-08 16:43:29 +02:00
Benny Halevy	c6593b3e8f	db: system_keyspace: get_group0_history: unfreeze_gently Prevent stall when the group0 history is too long using unfreeze_gently rather than the synchronous unfreeze() function Fixes #27872 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27873 (cherry picked from commit `f60033db63`) Closes scylladb/scylladb#27909	2026-01-08 16:43:06 +02:00
Avi Kivity	85f42de3f4	test: sstable_validation_test: actually test `ms` version sstable_validation_test tests the `scylla sstable validate` command by passing it intentionally corrupted sstables. It uses an sstable cache to avoid re-creating the same sstables. However, the cache does not consider the sstable version, so if called twice with the same inputs for different versions, it will return an sstable with the original version for both calls. As a results, `ms` sstables were not tested. Fix this bug by adding the sstable version (and the schema for good measure) to the cache key. An additional bug, hidden by the first, was that we corrupted the sstable by overwriting its Index.db component. But `ms` sstables don't have an Index.db component, they have a Partitions.db component. Adjust the corrupting code to take that into account. With these two fixes, test_scylla_sstable_validate_mismatching_partition_large fails on `ms` sstables. Disable it for that version. Since it was previously practically untested, we're not losing any coverage. Fixing this test unblocks further work on making pytest take charge of running the tests. pytest exposed this problem, likely by running it on different runners (and thus reducing the effectiveness of the cache). Fixes #27822. Closes scylladb/scylladb#27825 (cherry picked from commit `fc81983d42`) Closes scylladb/scylladb#27863	2026-01-08 16:42:41 +02:00
Michał Jadwiszczak	99842b30e3	test/cluster/test_view_building_coordinator: fix flakiness in test_file_streaming The test generates a staging sstable on a node and verifies whether the view is correctly populated. However view updates generated by a staging sstable (`view_update_generator::generate_and_propagate_view_updates()`) aren't awaited by sstable consumer. It's possible that the view building coordinator may see the task as finished (so the staging sstable was processed) but not all view updates were writted yet. This patch fixes the flakiness by waiting until `scylla_database_view_update_backlog` drops down to 0 on all shards. Fixes scylladb/scylladb#26683 Closes scylladb/scylladb#27389 (cherry picked from commit `74ab5addd3`) Closes scylladb/scylladb#27739	2026-01-08 16:42:08 +02:00
Lakshmi Narayanan Sreethar	d27849b1d3	sstables: prevent oversized allocation when parsing summary positions During sstable summary parsing, the entire header was read into a single buffer upfront and then parsed to obtain the positions. If the header was too large, it could trigger oversized allocation warnings. This commit updates the parse method to read one position at a time from the input stream instead of reading the entire header at once. Since `random_access_reader` already maintains an internal buffer of 128 KB, there is no need to pre read the entire header upfront. Fixes #24428 Fixes #27590 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26846 (cherry picked from commit `3eba90041f`) Closes scylladb/scylladb#27638	2026-01-08 16:41:30 +02:00
Benny Halevy	67cce6434b	utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout When waiting for the condition variable times out we call on_internal_error, but unfortunately, the backtrace it generates is obfuscated by `coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`. To make the log more useful, print the error injection name and the caller's source_location in the timeout error message. Fixes #27531 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27532 (cherry picked from commit `5f13880a91`) Closes scylladb/scylladb#27584	2026-01-08 16:40:58 +02:00
Botond Dénes	6fb37ae77f	Merge '[Backport 2025.4] alternator: fix batch writes during intranode tablet migrations' from Scylladb[bot] Scylla implements `LWT` in the` storage_proxy::cas` method. This method expects to be called on a specific shard, represented by the `cas_shard` parameter. Clients must create this object before calling `storage_proxy::cas`, check its `this_shard()` method, and jump to `cas_shard.shard()` if it returns false. The nuance is that by the time the request reaches the destination shard, the tablet may have already advanced in its migration state machine. For example, a client may acquire a `cas_shard` at the `streaming` tablet state, then submit a request to another shard via `smp::submit_to(cas_shard.shard())`. However, the new `cas_shard` created on that other shard might already be in the `write_both_read_new` state, and its `cas_shard.shard()` would not be equal to `this_shard_id()`. Such broken invariant results in an `on_internal_error` in `storage_proxy::cas`. Clients of `storage_proxy::cas` are expected to check` cas_shard.this_shard()` and recursively jump to another shard if it returns false. Most calls to `storage_proxy::cas` already implement this logic. The only exception is `executor::do_batch_write`, which currently checks `cas_shard.this_shard()` only once. This can break the invariant if the tablet state changes more than once during the operation. This PR fixes the issue by implementing recursive `cas_shard.this_shard()` checks in `executor::do_batch_write`. It also adds a test that reproduces the problem. Fixes: scylladb/scylladb#27353 backport: need to be backported to 2025.4 - (cherry picked from commit `e60bcd0011`) - (cherry picked from commit `74bf24a4a7`) - (cherry picked from commit `9bef142328`) - (cherry picked from commit `c6eec4eeef`) - (cherry picked from commit `3a865fe991`) - (cherry picked from commit `0bcc2977bb`) - (cherry picked from commit `608eee0357`) Parent PR: #27396 Closes scylladb/scylladb#27529 * github.com:scylladb/scylladb: alternator/executor.cc: eliminate redundant dk copy alternator/executor.cc: release cas_shard on the original shard alternator/executor.cc: move shard check into cas_write alternator/executor.cc: make cas_write a private method alternator/executor.cc: make do_batch_write a private method alternator/executor.cc: fix indent test_alternator: add test_alternator_invalid_shard_for_lwt alternator/executor.cc: avoid cross-shard free	2026-01-08 16:40:20 +02:00
Dawid Mędrek	088be56347	test/cluster/mv: Rewrite test_view_building_scheduling_group We rewrite the test to avoid flakiness. Instead of looking at the metrics, we make a trade-off and start depending on a less reliable mechanism -- logs. We grep all relevant messages printed by Scylla in TRACE mode and make sure that they were all printed from a context using the streaming scheduling group. Although it's a "less proper" way of testing, it should be much more dependable and avoid flakiness. Fixes scylladb/scylladb#25957 Closes scylladb/scylladb#26656 (cherry picked from commit `58dc414912`) Closes scylladb/scylladb#27504	2026-01-08 16:39:46 +02:00
Asias He	4bbdee8089	repair: Allow min max range to be updated for repair history It is observed that: repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb: seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum token,maximum token) is not in the format of (start, end]) This is because repair checks the end of the range to be repaired needs to be inclusive. When small_table_optimization is enabled for regular repair, a (minimum token,maximum token) will be used. To fix, we can relax the check of (start, end] for the min max range. Fixes #27220 Closes scylladb/scylladb#27357 (cherry picked from commit `e97a504775`) Closes scylladb/scylladb#27461	2026-01-08 16:39:10 +02:00
Calle Wilund	ddc22a349c	encryption::gcp_host: Add exponential retry for server errors Fixes #27242 Similar to AWS, google services may at times simply return a 503, more or less meaning "busy, please retry". We rely for most cases higher up layers to handle said retry, but we cannot fully do so, because both we reach this code sometimes through paths that do no such thing, and also because it would be slightly inefficient, since we'd like to for example control the back-off for auth etc. This simply changes the existing retry loop in gcp_host to be a little more forgiving, special case 503 errors and extend the retry to the auth part, as well as re-use the exponential_backoff_retry primitive. v2: * Avoid backoff if refreshing credentials. Should not add latency due to this. * Only allow re-auth once per (non-service-failure-backoff) try. * Add abort source to both request and retry v3: * Include timeout and other server errors in retry-backoff v4: * Reorder error code handling correctly Closes scylladb/scylladb#27267 (cherry picked from commit `4169bdb7a6`) Closes scylladb/scylladb#27443	2026-01-08 16:38:32 +02:00
Botond Dénes	6308304c9b	Merge '[Backport 2025.4] topology_coordinator: Add barrier to cleanup_target' from Scylladb[bot] Consider the following scenario: 1. A table has RF=3 and writes use CL=QUORUM 2. One node is down 3. There is a pending tablet migration from the unavailable node that is reverted During the revert, there can be a time window where the pending replica being cleaned up still accepts writes. This leads to write failures, as only two nodes (out of four) are able to acknowledge writes. This patch fixes the issue by adding a barrier to the cleanup_target tablet transition state, ensuring that the coordinator switches back to the previous replica set before cleanup is triggered. Fixes https://github.com/scylladb/scylladb/issues/26512 It's a pre existing issue. Backport is required to all recent 2025.x versions. - (cherry picked from commit `669286b1d6`) - (cherry picked from commit `67f1c6d36c`) - (cherry picked from commit `6163fedd2e`) Parent PR: #27413 Closes scylladb/scylladb#27428 * github.com:scylladb/scylladb: topology_coordinator: Fix the indentation for the cleanup_target case topology_coordinator: Add barrier to cleanup_target test_node_failure_during_tablet_migration: Increase RF from 2 to 3	2026-01-08 16:37:52 +02:00
Calle Wilund	5b8d6e21f1	commitlog::read_log_file: Check for eof position on all data reads Fixes #24346 When reading, we check for each entry and each chunk, if advancing there will hit EOF of the segment. However, IFF the last chunk being read has the last entry _exactly_ matching the chunk size, and the chunk ending at _exactly_ segment size (preset size, typically 32Mb), we did not check the position, and instead complained about not being able to read. This has literally _never_ happened in actual commitlog (that was replayed at least), but has apparently happened more and more in hints replay. Fix is simple, just check the file position against size when advancing said position, i.e. when reading (skipping already does). v2: * Added unit test Closes scylladb/scylladb#27236 (cherry picked from commit `59c87025d1`) Closes scylladb/scylladb#27346	2026-01-08 16:37:22 +02:00
Aleksandra Martyniuk	50e2a1a9b0	replica: database: change type of tables_metadata::_ks_cf_to_uuid If there is a lot of tables, a node reports oversized allocation in _ks_cf_to_uuid of type flat_hash_map. Change the type to std::unordered_map to prevent oversized allocations. Fixes: https://github.com/scylladb/scylladb/issues/26787. Closes scylladb/scylladb#27165 (cherry picked from commit `19a7d8e248`) Closes scylladb/scylladb#27200	2026-01-08 16:36:34 +02:00
Botond Dénes	bd58857680	Merge '[Backport 2025.4] db: batchlog_manager: update _last_replay only if all batches were re…' from Scylladb[bot] …played Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection. Update _last_replay only if all batches were replayed. Fixes: https://github.com/scylladb/scylladb/issues/24415. Needs backport to all live versions. - (cherry picked from commit `4d0de1126f`) - (cherry picked from commit `e3dcb7e827`) Parent PR: #26793 Closes scylladb/scylladb#27094 * github.com:scylladb/scylladb: test: extend test_batchlog_replay_failure_during_repair db: batchlog_manager: update _last_replay only if all batches were replayed	2026-01-08 16:36:04 +02:00
Patryk Jędrzejczak	ae5eebc04a	Merge '[Backport 2025.4] test/raft: use valid sentinel in liveness check to prevent digest errors' from Scylladb[bot] Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants. The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value. By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test. The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry. Fixes: scylladb/scylladb#27307 Backporting to active branches - this is a test-only fix (low risk) for a flaky test that exists in older branches (thus affects the CI of active branches). - (cherry picked from commit `3af5183633`) - (cherry picked from commit `4ba3e90f33`) Parent PR: #28010 Closes scylladb/scylladb#28038 * https://github.com/scylladb/scylladb: test/raft: use valid sentinel in liveness check to prevent digest errors test/raft: improve debugging in randomized_nemesis_test test/raft: improve reporting in the randomized_nemesis_test digest functions	2026-01-08 15:34:00 +01:00
Anna Stuchlik	eea09f8565	doc: remove references to ScyllaDB versions 4.3 and 4.4 We should never refer to the no longer supported OSS versions. This is a leftover - other mentions were removed long time ago. Fixes https://github.com/scylladb/scylladb/issues/19569 Closes scylladb/scylladb#27656 (cherry picked from commit `ea6f2a21c6`) Closes scylladb/scylladb#27683	2026-01-08 15:14:50 +01:00
Anna Stuchlik	a003f47def	doc: fix the syntax of internal links Some internal links had the wrong syntax: they were formatted as external links. As a result, they redirected the user to the outdated Open Source documentation. This commit fixes that bug. Fixes https://github.com/scylladb/scylladb/issues/25899 Closes scylladb/scylladb#27905 (cherry picked from commit `375479d96c`) Closes scylladb/scylladb#28003	2026-01-08 14:57:39 +02:00
Emil Maskovsky	b1dcbc2199	test/raft: use valid sentinel in liveness check to prevent digest errors Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants. The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value. By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test. The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry. Fixes: scylladb/scylladb#27307 (cherry picked from commit `4ba3e90f33`)	2026-01-08 11:53:19 +01:00
Emil Maskovsky	404fee4568	test/raft: improve debugging in randomized_nemesis_test Move the post-condition check before the assertion to ensure it is always executed first. Before, the wrong value could be passed to the digest_remove assertion, making the pre-check trigger there instead of the post-check as expected. Also, add a check in the append_seq constructor to ensure that the digest value is valid when creating an append_seq object. (cherry picked from commit `3af5183633`)	2026-01-08 11:53:15 +01:00
Emil Maskovsky	0fe860910a	test/raft: improve reporting in the randomized_nemesis_test digest functions The Boost ASSERTs in the digest functions of the randomized_nemesis_test were not working well inside the state machine digest functions, leading to unhelpful boost::execution_exception errors that terminated the apply fiber, and didn't provide any helpful information. Replaced by explicit checks with on_fatal_internal_error calls that provide more context about the failure. Also added validation of the digest value after appending or removing an element, which allows to determine which operation resulted in causing the wrong value. This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282, but adds improved error reporting. Refs: scylladb/scylladb#27307 Refs: scylladb/scylladb#17030 (cherry picked from commit `d60b908a8e`)	2026-01-08 11:53:07 +01:00
Jenkins Promoter	812750f770	Update pgo profiles - x86_64	2026-01-04 01:40:49 -05:00
Asias He	392c65b83f	topology_coordinator: Ensure repair_update_compaction_ctrl is executed Consider this: - n1 is a coordinator and schedules tablet repair - n1 detects tablet repair failed, so it schedules tablet transition to end_repair state - n1 loses leadership and n2 becomes the new topology coordinator - n2 runs end_repair on the tablet with session_id=00000000-0000-0000-0000-000000000000 - when a new tablet repair is scheduled, it hangs since the lock is already taken because it was not removed in previous step To fix, we use the global_tablet_id to index the lock instead of the session id. In addition, we retry the repair_update_compaction_ctrl verb in case of error to ensure the verb is eventually executed. The verb handler is also updated to check if it is still in end_repair stage. Fixes #26346 Closes scylladb/scylladb#27740 (cherry picked from commit `3abda7d15e`) Closes scylladb/scylladb#27940 scylla-2025.4.1 scylla-2025.4.1-candidate-20260101051132	2026-01-01 14:08:29 +02:00
Jenkins Promoter	8dcbe011e6	Update pgo profiles - aarch64	2026-01-01 04:53:35 +02:00
Botond Dénes	c697c6633b	Merge 'Remove noexcept from storage_group and table functions to allow exception propagation' from Tomasz Grabiec Fixed a critical bug where `storage_group::for_each_compaction_group()` was incorrectly marked `noexcept`, causing `std::terminate` when actions threw exceptions (e.g., `utils::memory_limit_reached` during memory-constrained reader creation). Changes made: 1. Removed `noexcept` from `storage_group::for_each_compaction_group()` declaration and implementation 2. Removed `noexcept` from `storage_group::compaction_groups()` overloads (they call for_each_compaction_group) 3. Removed `noexcept` from `storage_group::live_disk_space_used()` and `memtable_count()` (they call compaction_groups()) 4. Kept `noexcept` on `storage_group::flush()` - it's a coroutine that automatically captures exceptions and returns them as exceptional futures 5. Removed `noexcept` from `table_load_stats()` functions in base class, table, and storage group managers Rationale: There's no reason to kill the server if these functions throw. For coroutines returning futures, `noexcept` is appropriate because Seastar automatically captures exceptions and returns them as exceptional futures. For other functions, proper exception handling allows the system to recover gracefully instead of terminating. Fixes #27475 Closes scylladb/scylladb#27476 * github.com:scylladb/scylladb: replica: Remove unnecessary noexcept replica: Remove noexcept from compaction_groups() functions replica: Remove noexcept from storage_group::for_each_compaction_group (cherry picked from commit `730eca5dac`) Closes scylladb/scylladb#27914	2025-12-30 14:23:30 +01:00
Gleb Natapov	9e205cc3a6	raft topology: Notify that a node was removed only once Raft topology goes over all nodes in a 'left' state and triggers 'remove node' notification in case id/ip mapping is available (meaning the node left recently), but the problem is that, since the mapping is not removed immediately, when multiple nodes are removed in succession a notification for the same node can be sent several times. Fix that by sending notification only if the node still exists in the peers table. It will be removed by the first notification and following notification will not be sent. Closes scylladb/scylladb#27743 (cherry picked from commit `4a5292e815`) Closes scylladb/scylladb#27913	2025-12-30 11:17:41 +01:00
Dario Mirovic	fa3146e76f	test: dtest: audit_test.py: fix audit error log detection `test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py` has an insert statement that is expected to fail. Dtest environment uses `FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails means we expect 6 error audit logs. The test failed because `create keyspace ks` failed once, then succeeded on retry. It allowed the test to proceed properly, but the last part of the test that expects exactly 6 failed queries actually had 7. The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed queries, counting only the query expected to fail. If other queries fail with successful retry, it's fine. If other queries fail without successful retry, the test will fail, as it should in such situations. They are not related to this expected failed insert statement. Fixes #27322 Closes scylladb/scylladb#27378 (cherry picked from commit `f545ed37bc`) Closes scylladb/scylladb#27582	2025-12-29 18:12:45 +02:00
Nadav Har'El	bc87366b32	Merge '[Backport 2025.4] test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions thr…' from Scylladb[bot] …eshold The initial problem: Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened. Test logic: These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests. The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values. Note that the exceptions are counted per shard, not per code path. The new problem: The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12. There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted. Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths. For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run). Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold. Also, this patch series enables debug logging for `exception` logger. This will allow us to inspect which exceptions happened if a protocol exceptions test fails again. Fixes #27247 Fixes #27325 Issue observed on master and branch-2025.4. The tests, in the same form, exist on master, branch-2025.4, branch-2025.3, branch-2025.2, and branch-2025.1. Code change is simple, and no issue is expected with backport automation. Thus, backports for all the aforementioned versions is requested. - (cherry picked from commit `807fc68dc5`) - (cherry picked from commit `c30b326033`) Parent PR: #27412 Closes scylladb/scylladb#27555 * github.com:scylladb/scylladb: test: cqlpy: test_protocol_exceptions.py: enable debug exception logging test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold	2025-12-29 11:27:16 +02:00
Gleb Natapov	1f8c2744a4	topology coordinator: set session id for streaming at the correct time Commit `d3efb3ab6f` added streaming session for rebuild, but it set the session and request submission time. The session should be set when request starts the execution, so this patch moved it to the correct place. Closes scylladb/scylladb#27757 (cherry picked from commit `04976875cc`) Closes scylladb/scylladb#27867	2025-12-28 13:32:44 +02:00
Ferenc Szili	c08b2290dc	test: fix flakyness caused by TRUNCATE retries The test test_truncate_during_topology_change tests TRUNCATE TABLE while bootstrapping a new node. With tablets enabled TRUNCATE is a global topology operation which needs to serialize with boostrap. When TRUNCATE TABLE is issued, it first checks if there is an already queued truncate for the same table. This can happen if a previous TRUNCATE operation has timed out, and the client retried. The newly issued truncate will only join the queued one if it is waiting to be processed, and will fail immediatelly if the TRUNCATE is already being processed. In this test, TRUNCATE will be retried after a timeout (1 minute) due to the default retry policy, and will be retried up to 3 times, while the bootstrap is delayed by 2 minutes. This means that the test can validate the result of a truncate which was started after bootstrap was completed. Because of the way truncate joins existing truncate operations, we can also have the following scenario: - TRUNCATE times out after one minute because the new node is being bootstrapped - the client retries the TRUNCATE command which also times out after 1m - the third attempt is received during TRUNCATE being processed which fails the test This patch changes the retry policy of the TRUNCATE operation to FallthroughRetryPolicy which guarantees that TRUNCATE will not be retried on timeout. It also increases the timeout of the TRUNCATE from 1 to 4 minutes. This way the test will actually validate the performance of the TRUNCATE operation which was issued during bootstrap, instead of the subsequent, retried TRUNCATEs which could have been issued after the bootstrap was complete. Fixes: #26347 Closes scylladb/scylladb#27245 (cherry picked from commit `d883ff2317`) Closes scylladb/scylladb#27507	2025-12-23 17:06:48 +02:00
Anna Stuchlik	bfff9ebe15	doc: document support for i8g and i8ge instances Fixes https://github.com/scylladb/scylladb/issues/27703 Closes scylladb/scylladb#27754 (cherry picked from commit `4c247a5d08`) Closes scylladb/scylladb#27827	2025-12-23 10:47:34 +02:00
Anna Stuchlik	0ed82c1877	doc: add a Vector Search page under Features This commit adds a page with an overview of Vector Search under the Features section. It includes a link to the VS documentation in ScyllaDB Cloud, as the feature is only available in ScyllaDB Cloud. The purpose of the page is to raise awareness of the feature. Fixes https://scylladb.atlassian.net/browse/VECTOR-215 Closes scylladb/scylladb#27787 (cherry picked from commit `9793a45288`) Closes scylladb/scylladb#27826	2025-12-23 10:15:23 +02:00
Karol Nowacki	1c0891d577	vector_search: test: Fix flaky DNS resolution test The `vector_store_client_test_dns_resolving_repeated` test had race conditions causing it to be flaky. Two main issues were identified: 1. Race between initial refresh and manual trigger: The test assumes a specific resolution sequence, but timing variations between the initial DNS refresh (on client creation) and the first manual trigger (in the test loop) can cause unexpected delayed scheduling. 2. Extra triggers from resolve_hostname fiber: During the client refresh phase, the background DNS fiber clears the client list. If resolve_hostname executes in the window after clearing but before the update completes, pending triggers are processed, incrementing the resolution count unexpectedly. At count 6, the mock resolver returns a valid address (count % 3 == 0), causing the test to fail. The fix relaxes test assertions to verify retry behavior and client clearing on DNS address loss, rather than enforcing exact resolution counts. Fixes: #27074 Closes scylladb/scylladb#27685 (cherry picked from commit `addac8b3f7`) Closes scylladb/scylladb#27799	2025-12-23 09:13:22 +02:00
Aleksandra Martyniuk	fc9aac0a58	test: extend test_batchlog_replay_failure_during_repair Modify test_batchlog_replay_failure_during_repair to also check that there isn't data resurrection if flushing hints falls within the repair cache timeout. (cherry picked from commit `e3dcb7e827`)	2025-12-22 14:45:08 +01:00
Aleksandra Martyniuk	9f339ec3e0	db: batchlog_manager: update _last_replay only if all batches were replayed Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection. Update _last_replay only if all batches were replayed. Fixes: https://github.com/scylladb/scylladb/issues/24415. (cherry picked from commit `4d0de1126f`)	2025-12-22 14:44:52 +01:00
Michał Hudobski	af14df5459	vector_search: throw an error when we restrict primary in vector search We currently allow restrictions on single column primary key, but we ignore the restriction and return all results. This can confuse the users. We change it so such a restriction will throw an error and add a test to validate it. Fixes: VECTOR-331 Closes scylladb/scylladb#27668	2025-12-21 19:29:03 +02:00
Emil Maskovsky	49306c76f0	test/raft: fix race condition in failure_detector_test The test had a sporadic failure due to a broken promise exception. The issue was in `test_pinger::ping()` which captured the promise by move into the subscription lambda, causing the promise to be destroyed when the lambda was destroyed during coroutine unwinding. Simplify `test_pinger::ping()` by replacing manual abort_source/promise logic with `seastar::sleep_abortable()`. This removes the risk of promise lifetime/race issues and makes the code simpler and more robust. Fixes: scylladb/scylladb#27136 Backport to active branches: This fixes a CI test issue, so it is beneficial to backport the fix. As this is a test-only fix, it is a low risk change. Closes scylladb/scylladb#27737 (cherry picked from commit `2a75b1374e`) Closes scylladb/scylladb#27784	2025-12-21 19:26:20 +02:00
Anna Stuchlik	4888f5b008	doc: remove the links to the Download Center This commit removes the remaining links to the Download Center on the website. We no longer use it for installation, and we don't want users to infer that something like that still exists. Fixes https://github.com/scylladb/scylladb/issues/27753 Closes scylladb/scylladb#27756 (cherry picked from commit `f65db4e8eb`) Closes scylladb/scylladb#27785	2025-12-21 19:22:51 +02:00

1 2 3 4 5 ...

50232 Commits