scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-25 19:10:42 +00:00

Author	SHA1	Message	Date
Yaniv Kaul	f1c9eda49e	Potential fix for code scanning alert no. 144: Workflow does not contain permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27809	2026-01-12 12:21:35 +02:00
Marcin Maliszkiewicz	3c9f52e709	Merge 'doc: update the Web Installer instructions' from Anna Stuchlik This PR: - Replaces a fixed version name with the variable for the current version in the instructions for installing a non-default version with Web Installer. This will make using the installer more user-friendly. - Removes the instruction for Open Source from the Web Installer docs. Fixes https://github.com/scylladb/scylladb/issues/28005 Fixes https://github.com/scylladb/scylladb/issues/28079 Closes scylladb/scylladb#28046 * github.com:scylladb/scylladb: doc: remove the instruction for Open Source from the Web Installer docs doc: add the version variable to the Web Installer instructions	2026-01-12 11:10:04 +01:00
Petr Gusev	889d7782ed	treewide: use coroutine::maybe_yield in coroutines It's more efficient since coroutine::maybe_yield returns a lightweight struct (awaitable), not the future. Closes scylladb/scylladb#28101	2026-01-12 10:38:47 +01:00
Marcin Maliszkiewicz	09af3828ab	auth: remove confusing deprecation msg from hash_with_salt Closes scylladb/scylladb#27705	2026-01-12 10:12:54 +01:00
Alex	e430065c92	db: views: serialize create/drop view operations via shard 0 Create and drop view operations are currently performed on all shards, and their execution is not fully serialized. On slower processors this can lead to interleavings that leave stale entries in `system.scylla_views_build` A problematic sequence looks like this: * `on_create_view()` runs on shard 0 → entries for shard 0 and shard 1 are created * `on_drop_view()` runs on shard 0 → entry for shard 0 is removed * `on_create_view()` runs on shard 1 → entries for shard 0 and shard 1 are created again * `on_drop_view()` runs on shard 1 → entry for shard 1 is removed, while the shard 0 entry remains This results in a leftover row in `system.scylla_views_builds_in_progress`, causing `view_build_test.cc` to get stuck indefinitely in an eventual state and eventually be terminated by CI. This patch fixes the issue by fully serializing all view create and drop operations through shard 0. Shard 0 becomes the single execution point and notifies other shards to perform their work in order. Requests originating. new process: - view_builder::on_create_view(...) runs only on shard 0 and kicks off dispatch_create_view(...) in the background. - dispatch_create_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed. - dispatch_create_view(...) calls handle_seed_view_build_progress(...) on shard 0. That: - writes the global “build progress” row across all shards via _sys_ks.register_view_for_building_for_all_shards(...). - After seeding, dispatch_create_view(...) broadcasts to all shards with container().invoke_on_all(...). - Each shard runs handle_create_view_local(...), which: - waits for pending base writes/streams, flushes the base, - resets the reader to the current token and adds the new view, - handles errors and triggers _build_step to continue processing. Drop view - view_builder::on_drop_view(...) runs only on shard 0 and kicks off dispatch_drop_view(...) in the background. - dispatch_drop_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed. - It broadcasts handle_drop_view_local(...) to all shards with invoke_on_all(...). - Each shard runs handle_drop_view_local(...), which: - removes the view from local build state (_base_to_build_step and _built_views) by scanning existing steps, - ignores missing keyspace cases. - After all shards finish local cleanup, shard 0 runs handle_drop_view_global_cleanup(...), which: - removes global build progress, built‑view state, and view build status in system tables, Shutdown - drain() waits on _view_notification_sem before _sem so in‑flight dispatches finish before bookkeeping is halted. In addition, the test is adjusted to remove the long eventual wait (596.52s / 30 iterations) and instead rely on the default wait of 17 iterations (~4.37 minutes), eliminating unnecessary delays while preserving correctness. Fixes: https://github.com/scylladb/scylladb/issues/27898 Backport: not required as the problem happens on master Closes scylladb/scylladb#27929	2026-01-12 09:23:22 +02:00
Michał Hudobski	92c988514c	vector_search: allow all where clauses in vector search queries To prepare for implementation of filtering we skip validation of where clauses in vector search queries. All queries that would be blocked by the lack of ALLOW FILTERING now will pass through. Fixes: VECTOR-410 Closes scylladb/scylladb#27758	2026-01-11 12:56:44 +02:00
Marcin Maliszkiewicz	03e0dd0841	Merge 'test/alternator: fix most tests to run on DynamoDB' from Nadav Har'El We can run Alternator's tests against DynamoDB with `test/alternator/run --aws`, and our intention is that all except a few specially marked should pass on DynamoDB - indicating that the test itself is correct and checks compatibility with DynamoDB and not with some misunderstood spec. Before this patch series, almost two dozen Alternator's tests failed on DynamoDB. This series fixes most of them. Refs #26079 (it fixes almost all the problems but probably not all of them so let's keep the issue open for a while longer) Closes scylladb/scylladb#27995 * github.com:scylladb/scylladb: test/alternator: fix some expected error messages to fit DynamoDB test/alternator: fix compressed request test on non-us-east1 test/alternator: fix test's expected error message on DynamoDB test/alternator: mark Alternator-only test scylla_only test/alternator: fix test on DynamoDB test/alternator: increase wait_for_gsi() timeout test/alternator: fix test passing a spurious parameter	2026-01-09 18:05:20 +01:00
Botond Dénes	7e1c8776b7	docs: remove sstabledump and sstablemetadata These tools are deprecated and no longer shipped by ScyllaDB packages. They no longer support the latest SSTable versions and ScyllaDB-only features, like encryption and dictionary based compression. Remove them from the documentation. Closes scylladb/scylladb#27608	2026-01-09 17:31:54 +01:00
Dawid Mędrek	2385afa1c7	scripts/pull_github_pr.sh: Update instructions for creating token The interface of Jenkins has changed, and the instructions for creating a token are out-of-date. This commit updates them. Closes scylladb/scylladb#28054	2026-01-09 17:45:00 +02:00
Ferenc Szili	0ede8d154b	docs: add docs for size based load balancing This patch updates the documentation for size based load balancing. Closes scylladb/scylladb#27616	2026-01-09 16:25:25 +02:00
Yaniv Michael Kaul	af8eaa9ea5	scripts: fixes flagged by CodeQL/PyLens Unused imports, unused variables and such. Initially, there were no functional changes, just to get rid of some standard CodeQL warnings. I've then broken the CI, as apparently there's a install time(!?) Python script creation for the sole purpose of product naming. I changed it - we have it in etcdir, as SCYLLA-PRODUCT-FILE. So added (copied from a different script) a get_product() helper function in scylla_util.py and used it instead. While at it, also fixed the too broad import from scylla_util, which 'forced' me to also fix other specific imports (such as shutil). Improvement - no need to backport. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#27883	2026-01-09 15:13:12 +02:00
Anna Stuchlik	396093ff60	doc: remove the instruction for Open Source from the Web Installer docs Fixes https://github.com/scylladb/scylladb/issues/28079	2026-01-09 14:07:32 +01:00
Botond Dénes	af6cb0d0a4	Merge 'raft topology: preserve IP -> ID mapping of a replacing node on restart' from Patryk Jędrzejczak We currently do it only for a bootstrapping node, which is a bug. The missing IP can cause an internal error, for example, in the following scenario: - replace fails during streaming, - all live nodes are shut down before the rollback of replace completes, - all live nodes are restarted, - live nodes start hitting internal error in all operations that require IP of the replacing node (like client requests or REST API requests coming from nodetool). We fix the bug here, but we do it separately for replace with different IP and replace with the same IP. For replace with different IP, we persist the IP -> host ID mapping in `system.peers` just like for bootstrap. That's necessary, since there is no other way to determine IP of the replacing node on restart. For replace with the same IP, we can't do the same. This would require deleting the row corresponding to the node being replaced from `system.peers`. That's fine in theory, as that node is permanently banned, so its IP shouldn't be needed. Unfortunately, we have many places in the code where we assume that IP of a topology member is always present in the address map or that a topology member is always present in the gossiper endpoint set. Examples of such places: - nodetool operations, - REST API endpoints, - `db::hints::manager::store_hint`, - `group0_voter_handler::update_nodes`. We could fix all those places and verify that drivers work properly when they see a node in the token metadata, but not in `system.peers`. However, that would be too risky to backport. We take a different approach. We recover IP of the replacing node on restart based on the state of the topology state machine and `system.peers` just after loading `system.peers`. We rely on the fact that group 0 is set up at this point. The only case where this assumption is incorrect is a restart in the Raft-based recovery procedure. However, hitting this problem then seems improbable, and even if it happens, we can restart the node again after ensuring that no client and REST API requests come before replace is rolled back on the new topology coordinator. Hence, it's not worth to complicate the fix (by e.g. looking at the persistent topology state instead of the in-memory state machine). Fixes #28057 Backport this PR to all branches as it fixes a problematic bug. Closes scylladb/scylladb#27435 * github.com:scylladb/scylladb: gossiper: add_saved_endpoint: make generations of excluded nodes negative test: introduce test_full_shutdown_during_replace utils: error_injection: allow aborting wait_for_message raft topology: preserve IP -> ID mapping of a replacing node on restart	2026-01-09 14:56:16 +02:00
Calle Wilund	a7cdb602e1	db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc Fixes #27992 When doing a commit log oversized allocation, we lock out all other writers by grabbing the _request_controller semaphore fully (max capacity). We thereafter assert that the semaphore is in fact zero. However, due to how things work with the bookkeep here, the semaphore can in fact become negative (some paths will not actually wait for the semaphore, because this could deadlock). Thus, if, after we grab the semaphore and execution actually returns to us (task schedule), new_buffer via segment::allocate is called (due to a non-fully-full segment), we might in fact grab the segment overhead from zero, resulting in a negative semaphore. The same problem applies later when we try to sanity check the return of our permits. Fix is trivial, just accept less-than-zero values, and take same possible ltz-value into account in exit check (returning units) Added whitebox (special callback interface for sync) unit test that provokes/creates the race condition explicitly (and reliably). Closes scylladb/scylladb#27998	2026-01-09 14:06:58 +02:00
Łukasz Paszkowski	7bf26ece4d	test_user_writes_rejection: Fix test flakiness caused by typo and non-local CL=ONE reads The current code: ``` try: cql.execute(f"INSERT INTO {cf} (pk, t) VALUES (-1, 'x')", host=host[0], execution_profile=cl_one_profile).result() except Exception: pass ``` contains a typo: `host=host[0]` which throws an exception becase Host object is not subscriptable. The test does not fail because the except block is too broad and suppresses all exceptions. Fixing the typo alone is insufficient. The write still succeeds because the remaining nodes are UP and the query uses CL=ONE, so no failure should be expected. Another source of flakiness is data verification: ``` SELECT * FROM {cf} WHERE pk = 0; ``` Even when a coordinator is explicitly provided, using CL=ONE does not guarantee a local read. The coordinator may forward the read request to another replica, causing the verification to fail nondeterministically. This patch rewrites the tests to address these issues: - Fix the typo: `host[0]` to `hosts[0]` - Verify data using `MUTATION_FRAGMENTS({cf})` which guarantees a local read on the coordinator node - Reconnect the driver after node restart Fixes https://github.com/scylladb/scylladb/issues/27933 Closes scylladb/scylladb#27934	2026-01-09 13:42:05 +02:00
Botond Dénes	60570d7114	Merge 'topology coordinator: restrict node join/remove to preserve RF-rack validity' from Michael Litvak Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations: * Altering a keyspace's RF if it would make the keyspace RF-rack-invalid * Adding a node in a new rack * Removing / Decommissioning the last node in a rack Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views. The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid. Fixes scylladb/scylladb#23345 Fixes https://github.com/scylladb/scylladb/issues/26820 backport to relevant versions for materialized views with tablets since it depends on rf-rack validity Closes scylladb/scylladb#26354 * github.com:scylladb/scylladb: docs: update RF-rack restrictions cql3: don't apply RF-rack restrictions on vector indexes cql3: add warning when creating mv/index with tablets about rf-rack service/tablet_allocator: always allow tablet merge of tables with views locator: extend rf-rack validation for rack lists test: test rf-rack validity when creating keyspace during node ops locator: fix rf-rack validation during node join/remove test: test topology restrictions for views with tablets test: add test_topology_ops_with_rf_rack_valid topology coordinator: restrict node join/remove to preserve RF-rack validity topology coordinator: add validation to node remove locator: extend rf-rack validation functions view: change validate_view_keyspace to allow MVs if RF=Racks db: enforce rf-rack-validity for keyspaces with views replica/db: add enforce_rf_rack_validity_for_keyspace helper db: remove enforce parameter from check_rf_rack_validity test: adjust test to not break rf-rack validity	2026-01-09 10:01:23 +02:00
Patryk Jędrzejczak	eee2b6c7af	Merge 'tablets: Make balancing disabling RPC preempt tablet transitions' from Tomasz Grabiec Disabling of balancing waits for topology state machine to become idle, to guarantee that no migrations are happening or will happen after the call returns. But it doesn't interrupt the scheduler, which means the call can take arbitrary amount of time. It may wait for tablet repair to be finished, which can take many hours. We should do it via topology request, which will interrupt the tablet scheduler. Enabling of balancing can be immediate. Fixes https://github.com/scylladb/scylladb/issues/27647 Fixes #27210 Closes scylladb/scylladb#27736 * https://github.com/scylladb/scylladb: test: Verify that repair doesn't block disabling of tablet load balancing tablets: Make balancing disabling call preempt tablet transitions	2026-01-08 21:55:19 +02:00
Piotr Dulikowski	8e3e39a64a	Merge 'service/storage_service: update service levels cache after upgrade to v2' from Michał Jadwiszczak Service levels cache is empty after upgrade to consistent topology if no mutations are commited to `system.service_levels_v2` or rolling restart is not done. To fix the bug, this patch adds service levels cache reloading after upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`. Fixes [SCYLLADB-90](https://scylladb.atlassian.net/browse/SCYLLADB-90) This fix should be backported to all versions containing service levels on Raft. [SCYLLADB-90]: https://scylladb.atlassian.net/browse/SCYLLADB-90?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#27585 * github.com:scylladb/scylladb: service/storage_service: update service levels cache after upgrade to v2 service/storage_service: check if service levels were already upgraded before doing migration to raft	2026-01-08 21:55:19 +02:00
Michał Hudobski	e2e479f20d	auth: fix cdc vector search indexing permission bug VECTOR_SEARCH_INDEXING permission didn't work on cdc tables as we mistakenly checked for vector indexes on the cdc table insted of the base. This patch fixes that and adds a test that validates this behavior. Fixes: VECTOR-476 Closes scylladb/scylladb#28050	2026-01-08 21:55:19 +02:00
Ernest Zaslavsky	19fe630c0e	Update seastar submodule seastar 4dcd4df..dd46b6fe ``` dd46b6fe net: expose DNS TTL via net::hostent b94f81b0 test: Extend statat() test to check ENOENT exception reporting ``` Closes scylladb/scylladb#28006	2026-01-08 21:55:19 +02:00
Michael Litvak	8f15c7a874	db/view/view_update_generator: move discover_staging_sstables to start Call discover_staging_sstables in view_update_generator::start() instead of in the constructor, because the constructor is called during initialization before sstables are loaded. The initialization order was changed in `5d1f74b86a` and caused this regression. It means the view update generator won't discover staging sstables on startup and view updates won't be generated for them. It also causes issues in sstable cleanup. view_update_generator::start() is called in a later stage of the initialization, after sstable loading, so do the discovery of staging sstables there. Fixes scylladb/scylladb#27956 Closes scylladb/scylladb#27970	2026-01-08 21:55:19 +02:00
Botond Dénes	8c72dcc1ec	Merge 'database: truncate_table_on_all_shards: consider can_flush on all shards' from Benny Halevy Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them. This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged). Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`. Fixes #27639 * The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions Closes scylladb/scylladb#27643 * github.com:scylladb/scylladb: test: database_test: do_with_some_data: randomize keys database: truncate_table_on_all_shards: drop outdated TODO comment database: truncate_table_on_all_shards: consider can_flush on all shards memtable_list: unify can_flush and may_flush test: database_test: add test_flush_empty_table_waits_on_outstanding_flush replica: table, storage_group, compaction_group: add needs_flush test: database_test: do_with_some_data_in_thread: accept void callback function	2026-01-08 21:55:19 +02:00
Avi Kivity	633e6e0037	build: update toolchain generation procedure for optimized clang Explain where to pick up existing clang archives, and how to upload new ones. Closes scylladb/scylladb#27690	2026-01-08 21:55:18 +02:00
Evgeniy Naydanov	a9da14be19	test: dtest: reproducer for parallel rebuild failure 2-DC cluster parallel non-RBNO rebuild failure when expanding RF in DC2. Steps to reproduce: 1. Provision a cluster with 2 datacenters and at least 2 nodes in the second datacenter. 2. Let’s assume datacenter names are "dc1" and "dc2". 3. Create a keyspace ("keyspace1") with RF=0 in dc2. 4. Populate some data into dc1. 5. Change keyspace1 replication in dc2 to 2. 6. On 2 nodes in dc2 run the following command in parallel: nodetool rebuild --source-dc dc1 Parallel execution of rebuilds is not possible with RBNO enabled. This test is the repro for #27804 Closes scylladb/scylladb#27747	2026-01-08 21:55:18 +02:00
Botond Dénes	9b4a7f1d14	Merge 'test: cluster: object_store: test_backup: modernize do_abort_restore' from Benny Halevy Currently the function uses a regular expression to check the system log for a specific message. This is tangential to the ability to cleanly abort the restore task, plus the regular expression has a syntax error: ``` test/cluster/object_store/test_backup.py:534 /home/bhalevy/dev/scylla/test/cluster/object_store/test_backup.py:534: SyntaxWarning: "\(" is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\("? A raw string is also an option. await wait_for_first_completed([l.wait_for("Failed to handle STREAM_MUTATION_FRAGMENTS \(receive and distribute phase\) for .+: Streaming aborted", timeout=10) for l in logs]) ``` Thsi change modernizes the implementation by: - using auto_dc_rack for manager.servers_add - using new_test_keyspace to generate and auto delete the keyspace - using async gatherio and a prepared statement to insert the data - simplifing the keys and values by NOT using os.urandom (that is notoriously slow) - inserting fewer keys in debug mode - removing the log check With that, the test can be reenabled in all modes. * No backport needed since the test was disabled Closes scylladb/scylladb#27892 * github.com:scylladb/scylladb: test_backup: do_abort_restore: reduce data footprint test_backup: do_abort_restore: use error injection test_backup: do_abort_restore: use asyncio for cql test_backup: do_abort_restore: use new_test_keyspace test_backup: do_abort_restore: use logger rather than print test_backup: do_abort_restore: pass auto_rack_dc to servers_add	2026-01-08 21:55:18 +02:00
Anna Stuchlik	f614482e66	doc: add the patch release upgrade procedure for version 2025.4 Adds the patch upgrade guide based on previous upgrade guides. Fixes https://github.com/scylladb/scylladb/issues/27982 Closes scylladb/scylladb#27985	2026-01-08 21:55:18 +02:00
Asias He	4f77dd058d	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#27679	2026-01-08 21:55:18 +02:00
Anna Stuchlik	3f1c7c70f5	doc: remove the link to the Download Center ... from the OS support page. Fixes https://github.com/scylladb/scylladb/issues/28047 Closes scylladb/scylladb#28048	2026-01-08 21:55:18 +02:00
Asias He	0aabf51380	repair: Fix sstable_list_to_mark_as_repaired with multishard writer It was obseved: ``` test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to segfault. backtrace pointed to a failure when allocating an object from the chain of freed objects, which indicates memory corruption. (gdb) bt at ./seastar/include/seastar/core/shared_ptr.hh:275 at ./seastar/include/seastar/core/shared_ptr.hh:430 Usual suspect is use-after-free, so ran the reproducer in the sanitize mode, which indicated shared ptr was being copied into another cpu through the multi shard writer: seastar - shared_ptr accessed on non-owner cpu, at: ... -------- seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer... ``` The multishard writer itself was fine, the problem was in the streaming consumer for repair copying a shared ptr. It could work fine with same smp setting, since there will be only 1 shard in the consumer path, from rpc handler all the way to the consumer. But with mixed smp setting, the ptr would be copied into the cpus involved, and since the shared ptr is not cpu safe, the refcount change can go wrong, causing double free, use-after-free. To fix, we pass a generic incremental repair handler to the streaming consumer. The handler is safe to be copied to different shards. It will be a no op if incremental repair is not enabled or on a different shard. A reproducer test is added. The test could reproduce the crash consistently before the fix and work well after the fix. Fixes #27666 Closes scylladb/scylladb#27870	2026-01-08 21:55:18 +02:00
Radosław Cybulski	5f48ab3875	storage_proxy: fix invalid assert Change invalid `assert(true)` into `SCYLLA_ASSERT(false)`, as the latter was clearly meant. Closes scylladb/scylladb#27900	2026-01-08 21:55:18 +02:00
Andrei Chekun	c950c2e582	test.py: convert skip_mode function to pytest.mark Function skip_mode works only on function and only in cluster test. This if OK when we need to skip one test, but it's not possible to use it with pytestmark to automatically mark all tests in the file. The goal of this PR is to migrate skip_mode to be dynamic pytest.mark that can be used as ordinary mark. Closes scylladb/scylladb#27853 [avi: apply to test/cluster/test_tablets.py::test_table_creation_wakes_up_balancer]	2026-01-08 21:55:16 +02:00
Tomasz Grabiec	a52de4ecdc	test: cluster: test_topology_ops[_encrypted]: Fix failures due to background migrations fencing out writes The test if flaky, with failures in: for server in servers: > await check_node_log_for_failed_mutations(manager, server) test/cluster/test_topology_ops_encrypted.py:84: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ manager = <test.pylib.manager_client.ManagerClient object at 0xffff602e8590> server = ServerInfo(server_id=1769, ip_addr='127.82.127.43', rpc_address='127.82.127.43', datacenter='DEFAULT_DC', rack='DEFAULT_RACK', pid=186578) async def check_node_log_for_failed_mutations(manager: ManagerClient, server: ServerInfo): logging.info(f"Checking that node {server} had no failed mutations") log = await manager.server_open_log(server.server_id) occurrences = await log.grep(expr="Failed to apply mutation from", filter_expr="(TRACE\|DEBUG\|INFO)") > assert len(occurrences) == 0 E AssertionError test/cluster/util.py:319: AssertionError As diagnosed by Gleb in https://github.com/scylladb/scylladb/issues/27942#issuecomment-3710013625: "The fencing errors here look legit given that we do not wait for all requests to complete while shutting down the storage proxy. The scenario is this: Test does writes to rf=3 keyspace with cl=one. One node is shutting down while there is a tablet migration. Tablet migration executes barrier and drain which fails on a node that is been shutdown. The topology coordinator proceeds fencing the old topology, but there still can be un-handled mutation requests from the shutting down node on other nodes and they will generate fencing errors like they should. They way to avoid it (though it is benign) is to wait for all outgoing storage proxy requests to complete during shutdown, but even then the error may still happen since a request may timeout before it is processed by the other side, so it may be completed by a storage proxy coordinator side, but still not handled by replica side. This what we have fencing for in the first place." Fix by diabling background tablet migrations, so that we have no topology barriers concurrent with node shutdown. Fixes #27942 Closes scylladb/scylladb#28034	2026-01-08 21:53:47 +02:00
Tomasz Grabiec	34df158605	test: cluster: Fix NoHostAvailable error in test_not_enough_token_owners The driver must see server_c before we stop server_a, otherwise there will be no live host in the pool when we attempt to drop the keyspace: ``` @pytest.mark.asyncio async def test_not_enough_token_owners(manager: ManagerClient): """ Test that: - the first node in the cluster cannot be a zero-token node - removenode and decommission of the only token owner fail in the presence of zero-token nodes - removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token owners would fall below the RF of some keyspace using tablets """ logging.info('Trying to add a zero-token server as the first server in the cluster') await manager.server_add(config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}, expected_error='Cannot start the first node in the cluster as zero-token') logging.info('Adding the first server') server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"}) logging.info('Adding two zero-token servers') # The second server is needed only to preserve the Raft majority. server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0] logging.info(f'Trying to decommission the only token owner {server_a}') await manager.decommission_node(server_a.server_id, expected_error='Cannot decommission the last token-owning node in the cluster') logging.info(f'Stopping {server_a}') await manager.server_stop_gracefully(server_a.server_id) logging.info(f'Trying to remove the only token owner {server_a} by {server_b}') await manager.remove_node(server_b.server_id, server_a.server_id, expected_error='cannot be removed because it is the last token-owning node in the cluster') logging.info(f'Starting {server_a}') await manager.server_start(server_a.server_id) logging.info('Adding a normal server') await manager.server_add(property_file={"dc": "dc1", "rack": "r2"}) cql = manager.get_cql() await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60) > async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name: test/cluster/test_not_enough_token_owners.py:57: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/lib64/python3.14/contextlib.py:221: in __aexit__ await anext(self.gen) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830> opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }" host = None @asynccontextmanager async def new_test_keyspace(manager: ManagerClient, opts, host=None): """ A utility function for creating a new temporary keyspace with given options. It can be used in a "async with", as: async with new_test_keyspace(ManagerClient, '...') as keyspace: """ keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host) try: yield keyspace except: logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation") raise else: > await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host) E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')}) test/cluster/util.py:544: NoHostAvailable ``` Fixes #28011 Closes scylladb/scylladb#28040	2026-01-08 21:53:47 +02:00
Andrei Chekun	ee0bf35615	test.py: add custome exit code for pytest in case maxfail reached This PR adds custom exit code in case when maxfail reached. This is needed for easier detection why pytest failed in CI. Closes scylladb/scylladb#28018	2026-01-08 21:53:47 +02:00
Anna Stuchlik	1b653166f1	doc: add the version variable to the Web Installer instructions This commit replaces a fixed version name with the variable for the current version in the instructions for installing a non-default version with Web Installer. This will make using the installer more user-friendly. Fixes https://github.com/scylladb/scylladb/issues/28005	2026-01-08 10:12:21 +01:00
Benny Halevy	ebd667a8e0	test: database_test: do_with_some_data: randomize keys With randomized keys, and since we're inserting only 2 keys, it is possible that they would end up owned only by a single shard, reproducing #27639 in snapshot_list_contains_dropped_tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:46 +02:00
Benny Halevy	93b827c185	database: truncate_table_on_all_shards: drop outdated TODO comment The comment was added in `83323e155e` Since then, table::seal_active_memtable was improved to guarantee waiting on oustanding flushes on success (See `d55a2ac762`), so we can remove this TODO comment (it also not covered by any issue so nobody is planned to ever work on it). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:46 +02:00
Benny Halevy	2a803d2261	database: truncate_table_on_all_shards: consider can_flush on all shards can_flush might return a different value for each shard so check it right before deciding whether to flush or clear a memtable shard. Note that under normal condition can_flush would always return true now that it checks only the presence of the seal memtable function rather than check memtable_list::empty(). Fixes #27639 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:46 +02:00
Benny Halevy	02ee341a03	memtable_list: unify can_flush and may_flush Now that we have a unit test proving that it's safe to flush an empty memtable list there is no need to distinguish between may_flush and can_flush. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:46 +02:00
Benny Halevy	0342a24ee0	test: database_test: add test_flush_empty_table_waits_on_outstanding_flush Test that table::flush waits on outstanding flushes, even if the active memtable is empty Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:45 +02:00
Benny Halevy	5be6b80936	replica: table, storage_group, compaction_group: add needs_flush Table needs flush if not all its memtable lists are empty. To be used in the next patch for a unit test. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:41:22 +02:00
Benny Halevy	ec4069246d	test: database_test: do_with_some_data_in_thread: accept void callback function Many test cases already assume `func` is being called a seastar thread and although the function they pass returns a (ready) future, it serves no purpose other than to conform to the interface. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:41:22 +02:00
Patryk Jędrzejczak	946a2bb988	storage_service: do not call raft_topology_update_ip for left nodes This `raft_topology_update_ip` call always returns after `t.find(raft_id)` returns `nullptr`, so it effectively does nothing. It's not a bug, since there is no reason to update `system.peers` for left nodes anyway. We delete the rows corresponding to left nodes in `process_left_node` (called just above). Closes scylladb/scylladb#27899	2026-01-07 16:52:13 +01:00
Michał Jadwiszczak	be16e42cb0	service/storage_service: update service levels cache after upgrade to v2 Service levels cache is empty after upgrade to consistent topology if no mutations are commited to `system.service_levels_v2` or rolling restart is not done. To fix the bug, this commit adds service levels cache reloading after upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`. Fixes SCYLLADB-90	2026-01-07 14:06:13 +01:00
Michał Jadwiszczak	53d0a2b5dc	service/storage_service: check if service levels were already upgraded before doing migration to raft There is no need to call `service_level_controller::upgrade_to_v2()` on every topology state load, we only need to do it once.	2026-01-07 14:06:13 +01:00
Nadav Har'El	f7eae50d98	test/alternator: fix some expected error messages to fit DynamoDB All tests I am fixing in this patch do pass for me on DynamoDB, but other developers report that they fail because some DynamoDB servers apparently use slightly different error messages, with less detail about the cause of an error. For example, some of our tests currently expect an error message that looks like: An error occurred (ValidationException) when calling the Query operation: Invalid operator used in KeyConditionExpression: attribute_exists But some servers don't report the ": attribute_exists" at the end, so we can't use the word "attribute_exists" it in the test to recognize the correct error, and needs to use a different word (which both versions of DynamoDB and Alternator all print). As another example, the good old DynamoDB error: An error occurred (ValidationException) when calling the Query operation: 1 validation error detected: Value 'DOG' at 'conditionalOperator' failed to satisfy constraint: Member must satisfy enum value set: [OR, AND] Got replaced by the following less informative message: An error occurred (ValidationException) when calling the Query operation: Failed to satisfy constraint: Member must satisfy enum value set: [ALL, OR]' So we need to fix the test to allow it too. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-01-07 14:06:33 +02:00
Nadav Har'El	e97fbc2d65	test/alternator: fix compressed request test on non-us-east1 The test test_compressed_request.py::test_compressed_request coerces boto3 to send a compressed request, and wrongly used region_name=us-east-1 to set up the connection. Theoretically, this doesn't matter because we also set the correct URL (for either Alternator or the desired region in AWS). But in fact it does matter, because region name is part of the request's signature, and DynamoDB refuses the request if it comes to a different region than it is signed for. So this test fails when run on DynamoDB on any other region except us-east-1. The fix is simple - don't use the constant "us-east-1", but pick up the correct region name from the original connection. The functions new_dynamodb_session(), new_dynamodb() and new_dynamodb_stremas() had the same bug and we fix it too, but it didn't break any test because the only tests using these functions were Scylla-only so the AWS region problem didn't apply to them.	2026-01-07 13:33:46 +02:00
Patryk Jędrzejczak	f0d159abb0	Merge 'test/raft: use valid sentinel in liveness check to prevent digest errors' from Emil Maskovsky Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants. The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value. By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test. The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry. Fixes: scylladb/scylladb#27307 Backporting to active branches - this is a test-only fix (low risk) for a flaky test that exists in older branches (thus affects the CI of active branches). Closes scylladb/scylladb#28010 * https://github.com/scylladb/scylladb: test/raft: use valid sentinel in liveness check to prevent digest errors test/raft: improve debugging in randomized_nemesis_test	2026-01-07 12:31:21 +01:00
Nadav Har'El	2c02e463ff	test/alternator: fix test's expected error message on DynamoDB The Alternator test test_tag.py::test_tag_lsi_gsi expects to see an error - it's not allowed to set a tag on a GSI or LSI - but the error message that DynamoDB prints recently changed - instead of saying "ResourceArn" the new error message says "resource arn". Change the test to allow both forms, so it will pass on both Alternator (which still uses the word ResourceArn - which is the name of the parameter) and on DynamoDB (which uses "resource arn"). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-01-07 12:51:10 +02:00
Nadav Har'El	4f3150c282	test/alternator: mark Alternator-only test scylla_only The test test_batch.py::test_batch_write_item_large_broken_connection failed on DynamoDB (Refs #26079). It turns out this test has many problems: 1. This test wrongly assumes a batch write needs to complete in one attempt - and this fails on DynamoDB with low WCU capacity where the batch needs to be resumed in multiple requests. Using boto3's batch_writer() fixes this problem. 2. This test has NOTHING to do with batches - so is mis-named and mis-placed. The batch write is just a way to prepare some data in the table, and the real test is about Query'ing the data back and observing the long response and reproducing issue #14454. I did not rename or move the test, but left a comment explaining the situation. 3. This test is written to assume the Query's response uses HTTP chunked encoding. Which isn't actually true for DynamoDB, at least not at the time of this writing. So the test fails on DynamoDB. For the last reason, I made this test scylla_only. This test can't really be run on DynamoDB without rewriting it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-01-07 12:51:10 +02:00

1 2 3 4 5 ...

51364 Commits