scylladb

Author	SHA1	Message	Date
Robert Bindar	c575bbf1e8	test_refresh_deletes_uploaded_sstables should wait for sstables to get deleted SSTable unlinking is async, so in some cases it may happen that the upload dir is not empty immediately after refresh is done. This patch adjusts test_refresh_deletes_uploaded_sstables so it waits with a timeout till the upload dir becomes empty instead of just assuming the API will sync on sstables being gone. Fixes SCYLLADB-1190 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#29215	2026-03-26 08:43:14 +03:00
Marcin Maliszkiewicz	7fdd650009	Merge 'test: audit: clean up test helper class naming' from Dario Mirovic Remove unused `pytest.mark.single_node` marker from `TestCQLAudit`. Rename `TestCQLAudit` to `CQLAuditTester` to reflect that it is a test helper, not a test class. This avoids accidental pytest collection and subsequent warning about `__init__`. Logs before the fixes: ``` test/cluster/test_audit.py:514: 14 warnings /home/dario/dev/scylladb/test/cluster/test_audit.py:514: PytestCollectionWarning: cannot collect test class 'TestCQLAudit' because it has a __init__ constructor (from: cluster/test_audit.py) @pytest.mark.single_node ``` Fixes SCYLLADB-1237 This is an addition to the latest master code. No backport needed. Closes scylladb/scylladb#29237 * github.com:scylladb/scylladb: test: audit: rename TestCQLAudit to CQLAuditTester test: audit: remove unused pytest.mark.single_node	2026-03-25 15:30:16 +01:00
Dario Mirovic	552a2d0995	test: audit: rename TestCQLAudit to CQLAuditTester pytest tries to collect tests for execution in several ways. One is to pick all classes that start with 'Test'. Those classes must not have custom '__init__' constructor. TestCQLAudit does. TestCQLAudit after migration from test/cluster/dtest is not a test class anymore, but rather a helper class. There are two ways to fix this: 1. Add __init__ = False to the TestCQLAudit class 2. Rename it to not start with 'Test' Option 2 feels better because the new name itself does not convey the wrong message about its role. Fixes SCYLLADB-1237	2026-03-25 13:21:08 +01:00
Dario Mirovic	73de865ca3	test: audit: remove unused pytest.mark.single_node Remove unused pytest.mark.single_node in TestCQLAudit class. This is a leftover from audit tests migration from test/cluster/dtest to test/cluster. Refs SCYLLADB-1237	2026-03-25 13:18:37 +01:00
Marcin Maliszkiewicz	f988ec18cb	test/lib: fix port in-use detection in start_docker_service Previously, the result of when_all was discarded. when_all stores exceptions in the returned futures rather than throwing, so the outer catch(in_use&) could never trigger. Now we capture the when_all result and inspect each future individually to properly detect in_use from either stream. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1216 Closes scylladb/scylladb#29219	2026-03-25 11:45:53 +02:00
Artsiom Mishuta	cd1679934c	test/pylib: use exponential backoff in wait_for() Change wait_for() defaults from period=1s/no backoff to period=0.1s with 1.5x backoff capped at 1.0s. This catches fast conditions in 100ms instead of 1000ms, benefiting ~100 call sites automatically. Add completion logging with elapsed time and iteration count. Tested local with test/cluster/test_fencing.py::test_fence_hints (dev mode), log output: wait_for(at_least_one_hint_failed) completed in 0.83s (4 iterations) wait_for(exactly_one_hint_sent) completed in 1.34s (5 iterations) Fixes SCYLLADB-738 Closes scylladb/scylladb#29173	2026-03-24 23:49:49 +02:00
Botond Dénes	d52fbf7ada	Merge 'test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces' from Dawid Mędrek The test was flaky. The scenario looked like this: 1. Stop server 1. 2. Set its rf_rack_valid_keyspaces configuration option to true. 3. Create an RF-rack-invalid keyspace. 4. Start server 1 and expect a failure during start-up. It was wrong. We cannot predict when the Raft mutation corresponding to the newly created keyspace will arrive at the node or when it will be processed. If the check of the RF-rack-valid keyspaces we perform at start-up was done before that, it won't include the keyspace. This will lead to a test failure. Unfortunately, it's not feasible to perform a read barrier during start-up. What's more, although it would help the test, it wouldn't be useful otherwise. Because of that, we simply fix the test, at least for now. The new scenario looks like this: 1. Disable the rf_rack_valid_keyspaces configuration option on server 1. 2. Start the server. 3. Create an RF-rack-invalid keyspace. 4. Perform a read barrier on server 1. This will ensure that it has observed all Raft mutations, and we won't run into the same problem. 5. Stop the node. 6. Set its rf_rack_valid_keyspaces configuration option to true. 7. Try to start the node and observe a failure. This will make the test perform consistently. --- I ran the test (in dev mode, on my local machine) three times before these changes, and three times with them. I include the time results below. Before: ``` real 0m47.570s user 0m41.631s sys 0m8.634s real 0m50.495s user 0m42.499s sys 0m8.607s real 0m50.375s user 0m41.832s sys 0m8.789s ``` After: ``` real 0m50.509s user 0m43.535s sys 0m9.715s real 0m50.857s user 0m44.185s sys 0m9.811s real 0m50.873s user 0m44.289s sys 0m9.737s ``` Fixes SCYLLADB-1137 Backport: The test is present on all supported branches, and so we should backport these changes to them. Closes scylladb/scylladb#29218 * github.com:scylladb/scylladb: test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py	2026-03-24 21:09:19 +02:00
Patryk Jędrzejczak	141aa2d696	Merge 'test/cluster/test_incremental_repair.py: fix typo + enable compaction DEBUG logs' from Botond Dénes This PR contains two small improvements to `test_incremental_repair.py` motivated by the sporadic failure of `test_tablet_incremental_repair_and_scrubsstables_abort`. The test fails with `assert 3 == 2` on `len(sst_add)` in the second repair round. The extra SSTable has `repaired_at=0`, meaning scrub unexpectedly produced more unrepaired SSTables than anticipated. Since scrub (and compaction in general) logs at DEBUG level and the test did not enable debug logging, the existing logs do not contain enough information to determine the root cause. Commit 1 fixes a long-standing typo in the helper function name (`preapre` -> `prepare`). Commit 2 enables `compaction=debug` for the Scylla nodes started by `do_tablet_incremental_repair_and_ops`, which covers all `test_tablet_incremental_repair_and_` variants. This will capture full compaction/scrub activity on the next reproduction, making the failure diagnosable. Refs: SCYLLADB-1086 Backport: test improvement, no backport Closes scylladb/scylladb#29175 https://github.com/scylladb/scylladb: test/cluster/test_incremental_repair.py: enable compaction DEBUG logs in do_tablet_incremental_repair_and_ops test/cluster/test_incremental_repair.py: fix typo preapre -> prepare	2026-03-24 16:27:01 +01:00
Ernest Zaslavsky	c670183be8	cmake: fix precompiled header (PCH) creation Two issues prevented the precompiled header from compiling successfully when using CMake directly (rather than the configure.py + ninja build system): a) Propagate build flags to Rust binding targets reusing the PCH. The wasmtime_bindings and inc targets reuse the PCH from scylla-precompiled-header, which is compiled with Seastar's flags (including sanitizer flags in Debug/Sanitize modes). Without matching compile options, the compiler rejects the PCH due to flag mismatch (e.g., -fsanitize=address). Link these targets against Seastar::seastar to inherit the required compile options. Closes scylladb/scylladb#28941	2026-03-24 15:53:40 +02:00
Dawid Mędrek	e639dcda0b	test: cluster: Deflake test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces The test was flaky. The scenario looked like this: 1. Stop server 1. 2. Set its rf_rack_valid_keyspaces configuration option to true. 3. Create an RF-rack-invalid keyspace. 4. Start server 1 and expect a failure during start-up. It was wrong. We cannot predict when the Raft mutation corresponding to the newly created keyspace will arrive at the node or when it will be processed. If the check of the RF-rack-valid keyspaces we perform at start-up was done before that, it won't include the keyspace. This will lead to a test failure. Unfortunately, it's not feasible to perform a read barrier during start-up. What's more, although it would help the test, it wouldn't be useful otherwise. Because of that, we simply fix the test, at least for now. The new scenario looks like this: 1. Disable the rf_rack_valid_keyspaces configuration option on server 1. 2. Start the server. 3. Create an RF-rack-invalid keyspace. 4. Perform a read barrier on server 1. This will ensure that it has observed all Raft mutations, and we won't run into the same problem. 5. Stop the node. 6. Set its rf_rack_valid_keyspaces configuration option to true. 7. Try to start the node and observe a failure. This will make the test perform consistently. --- I ran the test (in dev mode, on my local machine) three times before these changes, and three times with them. I include the time results below. Before: ``` real 0m47.570s user 0m41.631s sys 0m8.634s real 0m50.495s user 0m42.499s sys 0m8.607s real 0m50.375s user 0m41.832s sys 0m8.789s ``` After: ``` real 0m50.509s user 0m43.535s sys 0m9.715s real 0m50.857s user 0m44.185s sys 0m9.811s real 0m50.873s user 0m44.289s sys 0m9.737s ``` Fixes SCYLLADB-1137	2026-03-24 14:27:36 +01:00
Patryk Jędrzejczak	503a6e2d7e	locator: everywhere_replication_strategy: fix sanity_check_read_replicas when read_new is true ERMs created in `calculate_vnode_effective_replication_map` have RF computed based on the old token metadata during a topology change. The reading replicas, however, are computed based on the new token metadata (`target_token_metadata`) when `read_new` is true. That can create a mismatch for EverywhereStrategy during some topology changes - RF can be equal to the number of reading replicas +-1. During bootstrap, this can cause the `everywhere_replication_strategy::sanity_check_read_replicas` check to fail in debug mode. We fix the check in this commit by allowing one more reading replica when `read_new` is true. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1147 Closes scylladb/scylladb#29150	2026-03-24 13:43:39 +01:00
Jenkins Promoter	0f02c0d6fa	Update pgo profiles - x86_64	2026-03-24 14:11:38 +02:00
Dawid Mędrek	4fead4baae	test: cluster: Mark test with @pytest.mark.asyncio in test_multidc.py One of the tests, test_startup_with_keyspaces_violating_rf_rack_valid_keyspaces, didn't have the marker. Let's add it now.	2026-03-24 12:52:00 +01:00
Botond Dénes	ffd58ca1f0	Merge 'test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints' from Dawid Mędrek Before these changes, we would send mutations to the node and immediately query the metrics to see how many hints had been written. However, that could lead to random failures of the test: even if the mutations have finished executing, hints are stored asynchronously, so we don't have a guarantee they have already been processed. To prevent such failures, we rewrite the check: we will perform multiple checks against the metrics until we have confirmed that the hints have indeed been written or we hit the timeout. We're generous with the timeout: we give the test 60 seconds. That should be enough time to avoid flakiness even on super slow machines, and if the test does fail, we will know something is really wrong. As a bonus, we improve the test in general too. We explicitly express the preconditions we rely on, as well as bump the log level. If the test fails in the future, it might be very difficult do debug it without this additional information. Fixes SCYLLADB-1133 Backport: The test is present on all supported branches. To avoid running into more failures, we should backport these changes to them. Closes scylladb/scylladb#29191 * github.com:scylladb/scylladb: test: cluster: Increase log level in test_write_cl_any_to_dead_node_generates_hints test: cluster: Await all mutations concurrently in test_write_cl_any_to_dead_node_generates_hints test: cluster: Specify min_tablet_count in test_write_cl_any_to_dead_node_generates_hints test: cluster: Use new_test_table in test_write_cl_any_to_dead_node_generates_hints test: cluster: Introduce auxiliary function keyspace_has_tablets test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints	2026-03-24 13:39:56 +02:00
Andrei Chekun	f6fd3bbea0	test.py: reduce timeout for one test Reduce the timeout for one test to 60 minutes. The longest test we had so far was ~10-15 minutes. So reducing this timeout is pretty safe and should help with hanging tests. Closes scylladb/scylladb#29212	2026-03-24 12:50:10 +02:00
Marcin Maliszkiewicz	66be0f4577	Merge 'test: cluster: audit test suite optimization' from Dario Mirovic Migrate audit tests from test/cluster/dtest to test/cluster. Optimize their execution time through cluster reuse. The audit test suite is heavy. There are more than 70 test executions. Environment preparation is a significant part of each test case execution time. This PR: 1. Copies audit tests from test/cluster/dtest to test/cluster, refactoring and enabling them 2. Groups tests functions by non-live cluster configuration variations to enable cluster reuse between them - Execution time reduced from 4m 29s to 2m 47s, which is ~38% execution time decrease 3. Removes the old audit tests from test/cluster/dtest Includes two supporting changes: - Allow specifying `AuthProvider` in `ManagerClient.get_cql_exclusive` - Fix server log file handling for clean clusters Refs [SCYLLADB-573](https://scylladb.atlassian.net/browse/SCYLLADB-573) This PR is an improvement and does not require a backport. [SCYLLADB-573]: https://scylladb.atlassian.net/browse/SCYLLADB-573?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#28650 * github.com:scylladb/scylladb: test: cluster: fix log clear race condition in test_audit.py test: pylib: shut down exclusive cql connections in ManagerClient test: cluster: fix multinode audit entry comparison in test_audit.py test: cluster: dtest: remove old audit tests test: cluster: group migrated audit tests for cluster reuse test: cluster: enable migrated audit tests and make them work test: pylib: manager_client: specify AuthProvider in get_cql_exclusive test: pylib: scylla cluster after_test log fix test: audit: copy audit test from dtest	2026-03-24 09:29:52 +01:00
Dario Mirovic	120f381a9d	pgo: fix maintenance socket path too long Maintenance socket path used for PGO is in the node workdir. When the node workdir path is too long, the maintenance socket path (workdir/cql.m) can exceed the Unix domain socket sun_path limit and failing the PGO training pipeline. To prevent this: - pass an explicit --maintenance-socket override pointing to a short determinitic path in /tmp derived from the MD5 hash of the workdir maintenance socket path - update maintenance_socket_path to return the matching short path so that exec_cql.py connects to the right socket The short path socket files are cleaned up after the cluster stops. The path is using MD5 hash of the workdir path, so it is deterministic. Fixes SCYLLADB-1070 Closes scylladb/scylladb#29149	2026-03-24 09:17:10 +01:00
Pavel Emelyanov	f112e42ddd	raft: Fix split mutations freeze Commit `faa0ee9844` accidentally broke the way split snapshot mutation was frozen -- instead of appending the sub-mutation `m` the commit kept the old variable name of `mut` which in the new code corresponds to "old" non-split mutation Fixes #29051 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29052	2026-03-24 08:53:50 +02:00
Botond Dénes	56c375b1f3	Merge 'table: don't close a disengaged querier in query()' from Pavel Emelyanov There's a flaw in table::query() -- calling querier_opt->close() can dereferences a disengaged std::optional. The fix pretty simple. Once fixed, there are two if-s checking for querier_opt being engaged or not that are worth being merged. The problem doesn't really shows itself becase table::query() is not called with null saved_querier, so the de-facto if is always correct. However, better to be on safe-side. The problem doesn't show itself for real, not worth backporting Closes scylladb/scylladb#29142 * github.com:scylladb/scylladb: table: merge adjacent querier_opt checks in query() table: don't close a disengaged querier in query()	2026-03-24 08:47:35 +02:00
Yaniv Kaul	e59a21752d	.github/workflows/trigger_jenkins.yaml: add workflow permissions Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/147. To fix the problem, add an explicit `permissions:` block to the workflow (either at the top level or inside the `trigger-jenkins` job) that constrains the `GITHUB_TOKEN` to the minimal necessary privileges. This codifies least-privilege in the workflow itself instead of relying on repository or organization defaults. The best minimal, non‑breaking change is to define a root‑level `permissions:` block with read‑only contents access because the job does not perform any write operations to the repository, nor does it interact with issues, pull requests, or other GitHub resources. A conservative, widely accepted baseline is `contents: read`. If later steps require more permissions, they can be added explicitly, but for this snippet, no such need is visible. Concretely, in `.github/workflows/trigger_jenkins.yaml`, insert: ```yaml permissions: contents: read ``` between the `name:` block and the `on:` block (e.g., after line 2). No additional methods, imports, or definitions are needed since this is a pure YAML configuration change and does not alter runtime behavior of the existing shell steps. Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27815	2026-03-24 08:40:30 +02:00
Yaniv Kaul	85a531819b	.github/workflows/trigger-scylla-ci.yaml: add permissions to workflow Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/169. In general, the fix is to add an explicit `permissions:` block to the workflow (at the root level or per job) so that the `GITHUB_TOKEN` has only the minimal scopes needed. Since this job only reads event data and uses secrets to talk to Jenkins, we can restrict `GITHUB_TOKEN` to read‑only repository contents. The single best fix here is to add a top‑level `permissions:` block right under the `name:` (and before `on:`) in `.github/workflows/trigger-scylla-ci.yaml`, setting `contents: read`. This applies to all jobs in the workflow, including `trigger-jenkins`, and does not alter any existing steps or logic. No additional imports or methods are needed, as this is purely a YAML configuration change for GitHub Actions. Concretely, edit `.github/workflows/trigger-scylla-ci.yaml` to insert: ```yaml permissions: contents: read ``` after line 1. No other lines in the file need to change. Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27812	2026-03-24 08:37:49 +02:00
Dawid Mędrek	148217bed6	test: cluster: Increase log level in test_write_cl_any_to_dead_node_generates_hints We increase the log level of `hints_manager` to TRACE in the test. If it fails, it may be incredibly difficult to debug it without any additional information.	2026-03-23 19:19:17 +01:00
Dawid Mędrek	2b472fe7fd	test: cluster: Await all mutations concurrently in test_write_cl_any_to_dead_node_generates_hints	2026-03-23 19:19:17 +01:00
Dawid Mędrek	ae12c712ce	test: cluster: Specify min_tablet_count in test_write_cl_any_to_dead_node_generates_hints The test relies on the assumption that mutations will be distributed more or less uniformly over the nodes. Although in practice this should not be possible, theoretically it's possible that there's only one tablet allocated for the table. To clearly indicate this precondition, we explicitly set the property `min_tablet_count` when creating the table. This way, we have a gurantee that the table has multiple tablets. The load balancer should now take care of distributing them over the nodes equally. Thanks to that, `servers[1]` will have some tablets, and so it'll be the target for some of the mutations we perform.	2026-03-23 19:19:14 +01:00
Dawid Mędrek	dd446aa442	test: cluster: Use new_test_table in test_write_cl_any_to_dead_node_generates_hints The context manager is the de-facto standard in the test suite. It will also allow us for a prettier way to conditionally enable per-table tablet options in the following commit.	2026-03-23 19:07:01 +01:00
Dawid Mędrek	dea79b09a9	test: cluster: Introduce auxiliary function keyspace_has_tablets The function is adapted from its counterpart in the cqlpy test suite: cqlpy/util.py::keyspace_has_tablets. We will use it in a commit in this series to conditionally set tablet properties when creating a table. It might also be useful in general.	2026-03-23 19:07:01 +01:00
Dawid Mędrek	3d04fd1d13	test: cluster: Deflake test_write_cl_any_to_dead_node_generates_hints Before these changes, we would send mutations to the node and immediately query the metrics to see how many hints had been written. However, that could lead to random failures of the test: even if the mutations have finished executing, hints are stored asynchronously, so we don't have a guarantee they have already been processed. To prevent such failures, we rewrite the check: we will perform multiple checks against the metrics until we have confirmed that the hints have indeed been written or we hit the timeout. We're generous with the timeout: we give the test 60 seconds. That should be enough time to avoid flakiness even on super slow machines, and if the test does fail, we will know something is really wrong. Fixes SCYLLADB-1133	2026-03-23 19:06:57 +01:00
Botond Dénes	772b32d9f7	test/scylla_gdb: fix flakiness by preparing objects at test time Fixtures previously ran GDB once (module scope) to find live objects (sstables, tasks, schemas) and stored their addresses. Tests then reused those addresses in separate GDB invocations. Sometimes these addresses would become stale and the test would step on use-after-free (e.g. sstables compacted away between invocations). Fix by dropping the fixtures. The helper functions used by the fixtures to obtain the required objects are converted to gdb convenience functions, which can be used in the same expression as the test command invocation. Thus, the object is aquired on-demand at the moment it is used, so it is guaranteed to be fresh and relevant. Fixes: SCYLLADB-1020 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#28999	2026-03-23 16:54:03 +02:00
Piotr Dulikowski	60fb5270a9	logstor: fix fmt::format use with std::filesystem::path The version of fmt installed on my machine refuses to work with `std::filesystem::path` directly. Add `.string()` calls in places that attempt to print paths directly in order to make them work. Closes scylladb/scylladb#29148	2026-03-23 15:15:52 +01:00
Pavel Emelyanov	3b9398dfc8	Merge 'encryption: fix deadlock in encrypted_data_source::get()' from Ernest Zaslavsky When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS. In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call. Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely. A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128 Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side Closes scylladb/scylladb#29110 * github.com:scylladb/scylladb: encryption: fix deadlock in encrypted_data_source::get() test_lib: mark `limiting_data_source_impl` as not `final` Fix formatting after previous patch Fix indentation after previous patch test_lib: make limiting_data_source_impl available to tests	2026-03-23 17:12:44 +03:00
Botond Dénes	f5438e0587	test/cluster/test_incremental_repair.py: enable compaction DEBUG logs in do_tablet_incremental_repair_and_ops The test sporadically fails because scrub produces an unexpected number of SSTables. Compaction logs are needed to diagnose why, but were not captured since scrub runs at DEBUG level. Enable compaction=debug for the servers started by do_tablet_incremental_repair_and_ops so the next reproduction provides enough information to root-cause the issue. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-23 15:48:26 +02:00
Botond Dénes	f6ab576ed9	test/cluster/test_incremental_repair.py: fix typo preapre -> prepare Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-23 15:48:12 +02:00
Piotr Dulikowski	df68d0c0f7	directories: add missing seastar/util/closeable.hh include Without this include the file would not compile on its own. The issue was most likely masked by the use of precompiled headers in our CI. Closes scylladb/scylladb#29170	2026-03-23 15:46:56 +03:00
Yaniv Michael Kaul	051107f5bc	scylla-gdb: fix sstable-summary crash on ms-format sstables The 'scylla sstable-summary' GDB command crashes with 'ValueError: Argument "count" should be greater than zero' when inspecting ms-format (trie-based) sstables. This happens because ms-format sstables don't populate the traditional summary structure, leaving all fields zeroed out, which causes gdb.read_memory() to be called with a zero count. Fix by: - Adding zero-length guards to sstring.to_hex() and sstring.as_bytes() to return early when the data length is zero, consistent with the existing guard in managed_bytes.get(). - Adding the same guard to scylla_sstable_summary.to_hex(). - Detecting ms-format sstables (version == 5) early in scylla_sstable_summary.invoke() and printing an informative message instead of attempting to read the unpopulated summary. Fixes: SCYLLADB-1180 Closes scylladb/scylladb#29162	2026-03-23 12:44:47 +02:00
Piotr Szymaniak	c8e7e20c5c	test/cluster: retry create_table on transient schema agreement timeout In test_index_requires_rf_rack_valid_keyspace, the create_table call for a plain tablet-based table can fail with 'Unable to reach schema agreement' after the server's 10s timeout is exceeded. This happens when schema gossip propagation across the 4-node cluster takes longer than expected after a sequence of rapid schema changes earlier in the test. Add a retry (up to 2 attempts) on schema agreement errors for this specific create_table call rather than increasing the server-side timeout. Fixes: SCYLLADB-1135 Closes scylladb/scylladb#29132	2026-03-23 10:45:30 +02:00
Yaniv Kaul	fb1f995d6b	.github/workflows/backport-pr-fixes-validation.yaml: workflow does not contain permissions (Potential fix for code scanning alert no. 139) Potential fix for https://github.com/scylladb/scylladb/security/code-scanning/139, To fix the problem, explicitly restrict the `GITHUB_TOKEN` permissions for this workflow/job so it has only what is needed. The script reads PR data and repository info (which is covered by `contents: read`/default read scopes) and posts a comment via `github.rest.issues.createComment`, which requires `issues: write`. No other write scopes (e.g., `contents: write`, `pull-requests: write`) are necessary. The best fix without changing functionality is to add a `permissions` block scoped to this job (or at the workflow root). Since we only see a single job here, we’ll add it under `check-fixes-prefix`. Concretely, in `.github/workflows/backport-pr-fixes-validation.yaml`, between the `runs-on: ubuntu-latest` line (line 10) and `steps:` (line 11), add: ```yaml permissions: contents: read issues: write ``` This keeps the token minimally privileged while still allowing the script to create issue/PR comments. Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27810	2026-03-23 10:30:01 +02:00
Piotr Smaron	32225797cd	dtest: fix flaky test_writes_schema_recreated_while_node_down `read_barrier(session2)` was supposed to ensure `node2` has caught up on schema before a CL=ALL write. But `patient_cql_connection(node2)` creates a cluster-aware driver session `(TokenAwarePolicy(DCAwareRoundRobinPolicy()))` that can route the barrier CQL statement to any node — not necessarily `node2`. If the barrier runs on `node1` or `node3` (which already have the new schema), it's a no-op, and `node2` remains stale, thus the observed `WriteFailure`. The fix is to switch to `patient_exclusive_cql_connection(node2)`, which uses `WhiteListRoundRobinPolicy([node2_ip])` to pin all CQL to `node2`. This is already the established pattern used by other tests in the same file. Fixes: SCYLLADB-1139 No need to backport yet, appeared only on master. Closes scylladb/scylladb#29151	2026-03-23 10:25:54 +02:00
Michał Chojnowski	f29525f3a6	test/boost/cache_algorithm_test: disable sstable compression to avoid giant index pages The test intentionally creates huge index pages. But since `5e7fb08bf3`, the index reader allocates a block of memory for a whole index page, instead of incrementally allocating small pieces during index parsing. This giant allocation causes the test to fail spuriously in CI sometimes. Fix this by disabling sstable compression on the test table, which puts a hard cap of 2000 keys per index page. Fixes: SCYLLADB-1152 Closes scylladb/scylladb#29152	2026-03-23 09:57:11 +02:00
Raphael S. Carvalho	05b11a3b82	sstables_loader: use new sstable add path Use add_new_sstable_and_update_cache() when attaching SSTables downloaded by the node-scoped local loader. This is the correct variant for new SSTables: it can unlink the SSTable on failure to add it, and it can split the SSTable if a tablet split is in progress. The older add_sstable_and_update_cache() helper is intended for preexisting SSTables that are already stable on disk. Additionally, downloaded SSTables are now left unsealed (TemporaryTOC) until they are successfully added to the table's SSTable set. The download path (download_fully_contained_sstables) passes leave_unsealed=true to create_stream_sink, and attach_sstable opens the SSTable with unsealed_sstable=true and seals it only inside the on_add callback — matching the pattern used by stream_blob.cc and storage_service.cc for tablet streaming. This prevents a data-resurrection hazard: previously, if the process crashed between download and attach_sstable, or if attach_sstable failed mid-loop, sealed (TOC) SSTables would remain in the table directory and be reloaded by distributed_loader on restart. With TemporaryTOC, sstable_directory automatically cleans them up on restart instead. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1085. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#29072	2026-03-23 10:33:04 +03:00
Piotr Szymaniak	f511264831	alternator/test: fix test_ttl_with_load_and_decommission flaky Connection refused error The native Scylla nodetool reports ECONNREFUSED as 'Connection refused', not as 'ConnectException' (which is the Java nodetool format). Add 'Connection refused' to the valid_errors list so that transient connection failures during concurrent decommission/bootstrap topology changes are properly tolerated. Fixes SCYLLADB-1167 Closes scylladb/scylladb#29156	2026-03-22 11:01:45 +02:00
Pavel Emelyanov	7dce43363e	table: merge adjacent querier_opt checks in query() After the previous fix both guarding if-s start with 'if (querier_opt &&'. Merge them into a single outer 'if (querier_opt)' block to avoid the redundant check and make the structure easier to follow. No functional change. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 14:48:08 +03:00
Piotr Dulikowski	cc695bc3f7	Merge 'vector_search: fix race condition on connection timeout' from Karol Nowacki When a `with_connect` operation timed out, the underlying connection attempt continued to run in the reactor. This could lead to a crash if the connection was established/rejected after the client object had already been destroyed. This issue was observed during the teardown phase of a upcoming high-availability test case. This commit fixes the race condition by ensuring the connection attempt is properly canceled on timeout. Additionally, the explicit TLS handshake previously forced during the connection is now deferred to the first I/O operation, which is the default and preferred behavior. Fixes: SCYLLADB-832 Backports to 2026.1 and 2025.4 are required, as this issue also exists on those branches and is causing CI flakiness. Closes scylladb/scylladb#29031 * github.com:scylladb/scylladb: vector_search: test: fix flaky test vector_search: fix race condition on connection timeout	2026-03-20 11:12:04 +01:00
Petr Gusev	4bfcd035ae	test_fencing: add missing await-s Fixes SCYLLADB-1099 Closes scylladb/scylladb#29133	2026-03-20 10:55:35 +01:00
Pavel Emelyanov	9c1c41df03	table: don't close a disengaged querier in query() The condition guarding querier_opt->close() was: When saved_querier is null the short-circuit makes the whole condition true regardless of whether querier_opt is engaged. If partition_ranges is empty, query_state::done() is true before the while-loop body ever runs, so querier_opt is never created. Calling querier_opt->close() then dereferences a disengaged std::optional — undefined behaviour. Fix by checking querier_opt first: This preserves all existing semantics (close when not saving, or when saving wouldn't be useful) while making the no-querier path safe. Why this doesn't surface today: the sole production call site, database::query(), in practice. The API header documents nullptr as valid ("Pass nullptr when queriers are not saved"), so the bug is real but latent. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 12:25:13 +03:00
Pavel Emelyanov	c4a0f6f2e6	object_store: Don't leave dangling objects by iterating moved-from names vector The code in upload_file std::move()-s vector of names into merge_objects() method, then iterates over this vector to delete objects. The iteration is apparently a no-op on moved-from vector. The fix is to make merge_objects() helper get vector of names by const reference -- the method doesn't modify the names collection, the caller keeps one in stable storage. Fixes #29060 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29061	2026-03-20 10:09:30 +02:00
Pavel Emelyanov	712ba5a31f	utils: Use yielding directory_lister in owner verification Switch directories::do_verify_owner_and_mode() from lister::scan_dir() to utils::directory_lister while preserving the previous hidden-entry behavior. Make do_verify_subpath use lister::filter_type directly so the verification helper can pass it straight into directory_lister, and keep a single yielding iteration loop for directory traversal. Minus one scan_dir user twards scan_dir removal from code. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29064	2026-03-20 10:08:38 +02:00
Pavel Emelyanov	961fc9e041	s3: Don't rearm credential timers when credentials are not refreshed The update_credentials_and_rearm() may get "empty" credentials from _creds_provider_chain.get_aws_credentials() -- it doesn't throw, but returns default-initialized value. In that case the expires_at will be set to time_point::min, and it's probably not a good idea to arm the refresh timer and, even worse idea, to subtract 1h from it. Fixes #29056 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29057	2026-03-20 10:07:01 +02:00
Pavel Emelyanov	0a8dc4532b	s3: Fix missing upload ID in copy_part trace log The format string had two {} placeholders but three arguments, the _upload_id one is skipped from formatting Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29053	2026-03-20 10:05:44 +02:00
Botond Dénes	bb5c328a16	Merge 'Squash two primary-replica restoration tests together' from Pavel Emelyanov The test_restore_primary_replica_same_domain and test_restore_primary_replica_different_domain tests have very much in common. Previously both tests were also split each into two, so we have four tests, and now we have two that can also be squashed, the lines-of-code savings still worth it. This is the continuation of #28569 Tests improvement, not backporting Closes scylladb/scylladb#28994 * github.com:scylladb/scylladb: test: Replace a bunch of ternary operators with an if-else block test: Squash test_restore_primary_replica_same\|different_domain tests test: Use the same regexp in test_restore_primary_replica_different\|same_domain-s	2026-03-20 10:05:16 +02:00
Pavel Emelyanov	ea2a214959	test/backup: Use unique_name() for backup prefix instead of cf_dir The do_test_backup_abort() fetched the node's workdir and resolved cf_dir solely to construct a unique-ish backup prefix: prefix = f'{cf_dir}/backup' The comment already acknowledged this was only "unique(ish)" — relying on the UUID-derived cf_dir name as a uniqueness source is roundabout. unique_name() is already imported and used for exactly this purpose elsewhere in the file. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29030	2026-03-20 10:04:22 +02:00
Pavel Emelyanov	65032877d4	api: Move /storage_service/toppartitions from storage_service.cc to column_family.cc The endpoint URL remains intact. Having it next to another toppartitions endpoint (the /column_family/toppartitions one) is natural. This endpoint only needs sharded<replica::database>&, grabs it from http_context and doesn't use any other service. In column_family.cc the database reference is already available as a parameter. Once more user of http_context.db is gone. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#28996	2026-03-20 09:52:33 +02:00
Botond Dénes	de0bdf1a65	Merge 'Decouple test_refresh_deletes_uploaded_sstables from backup test-suite' from Pavel Emelyanov The test in question uses several helpers from the backup sute, but it doesn't really need them -- the operations it want to perform can be performed with standard pylib methods. "While at it" also collect some dangling effectively unused local variables from this test (these were apparently left from backup tests this one was copied-and-reworked from) Enhancing tests, not backporting Closes scylladb/scylladb#29130 * github.com:scylladb/scylladb: test/refresh: Simplify refresh invocation test/refresh: Remove r_servers alias for servers test/refresh: Replace check_mutation_replicas with a plain CQL SELECT test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests test/refresh: Remove unused wait_for_cql_and_get_hosts import	2026-03-20 09:29:15 +02:00
Botond Dénes	97430e2df5	Merge 'Fix object storage lister entries walking loop' from Pavel Emelyanov Two issues found in the lister returned by gs_client_wrapper::make_object_lister() Lister can report EOF too early in case filter is active, another one is potential vector out-of-bounds access Fixes #29058 The code appeared in 2026.1, worth fixing it there as well Closes scylladb/scylladb#29059 * github.com:scylladb/scylladb: sstables: Fix object storage lister not resetting position in batch vector sstables: Fix object storage lister skipping entries when filter is active	2026-03-20 09:12:42 +02:00
Botond Dénes	5573c3b18e	Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec When it deadlocks, groups stop merging and compaction group merge backlog will run-away. Also, graceful shutdown will be blocked on it. Found by flaky unit test test_merge_chooses_best_replica_with_odd_count, which timed-out in 1 in 100 runs. Reason for deadlock: When storage groups are merged, the main compaction group of the new storage group takes a compaction lock, which is appended to _compaction_reenablers_for_merging, and released when the merge completion fiber is done with the whole batch. If we accumulate more than 1 merge cycle for the fiber, deadlock occurs. Lock order will be this Initial state: cg0: main cg1: main cg2: main cg3: main After 1st merge: cg0': main [locked], merging_groups=[cg0.main, cg1.main] cg1': main [locked], merging_groups=[cg2.main, cg3.main] After 2nd merge: cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main] merge completion fiber will try to stop cg0'.main, which will be blocked on compaction lock. which is held by the reenabler in _compaction_reenablers_for_merging, hence deadlock. The fix is to wait for background merge to finish before we start the next merge. It's achieved by holding old erm in the background merge, and doing a topology barrier from the merge finalizing transition. Background merge is supposed to be a relatively quick operation, it's stopping compaction groups. So may wait for active requests. It shouldn't prolong the barrier indefinitely. Tablet tests which trigger merge need to be adjusted to call the barrier, otherwise they will be vulnerable to the deadlock. Fixes SCYLLADB-928 Backport to >= 2025.4 because it's the earliest vulnerable due to `f9021777d8`. Closes scylladb/scylladb#29007 * github.com:scylladb/scylladb: tablets: Fix deadlock in background storage group merge fiber replica: table: Propagate old erm to storage group merge test: boost: tablets_test: Save tablet metadata when ACKing split resize decision storage_service: Extract local_topology_barrier()	2026-03-20 09:05:52 +02:00
Botond Dénes	34473302b0	Merge 'docs: document existing guardrails' from Andrzej Jackowski This patch series introduces a new documentation for exiting guardrails. Moreover: - Warning / failure messages of recently added write CL guardrails (SCYLLADB-259) are rephrased, so all guardrails have similar messages. - Some new tests are added, to help verify the correctness of the documentation and avoid situations where the documentation and implementation diverge. Fixes: [SCYLLADB-257](https://scylladb.atlassian.net/browse/SCYLLADB-257) No backport, just new docs and tests. [SCYLLADB-257]: https://scylladb.atlassian.net/browse/SCYLLADB-257?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#29011 * github.com:scylladb/scylladb: test: add new guardrail tests matching documentation scenarios test: add metric assertions to guardrail replication strategy tests test: use regex matching in guardrail replication strategy tests test: extract ks_opts helper in test_guardrail_replication_strategy docs: document CQL guardrails cql: improve write consistency level guardrail messages	2026-03-20 08:56:00 +02:00
artem.penner	9898e5700b	scylla-node-exporter: Add systemd collector to node exporter This PR enables the node_exporter systemd collector and configures the unit whitelist to include scylla-server.service and systemd-coredump services. Motivation: We currently lack visibility into system-level service states, which is critical for diagnosing stability issues. This configuration enables two specific use cases: - Detecting Coredump Loops: We encounter scenarios where ScyllaDB enters a restart loop. To pinpoint SIGSEGV (coredumps) as the root cause, we need to track when the systemd-coredump service becomes active, indicating a dump is being processed. - Identifying Startup Failures: We need to detect when the scylla-server unit enters a failed state. This is essential for catching unrecoverable errors (e.g., corrupted commitlogs or configuration bugs) that prevent the server from starting. example of promql queries: - `node_systemd_unit_state{name=~"systemd-coredump@.*", state="active"} == 1` - `node_systemd_unit_state{name="scylla-server.service", state="failed"} == 1` Closes #28402	2026-03-20 08:39:56 +02:00
Andrzej Jackowski	10c4b9b5b0	test: verify signal() detects resource negative leak in rcs reader_concurrency_semaphore::signal() guards against available resources exceeding the initial limit after a signal, which would indicate a bug such as double-returning resources. It reports the issue via on_internal_error_noexcept and clamps resources back to the initial values. However, before this commit there were no tests that verified this behavior, so bugs like SCYLLADB-1014 went undetected. Add a test that artificially signals resources that were never consumed and verifies that signal() detects the negative leak and clamps available resources back to the initial limit. Refs: SCYLLADB-1014 Fixes: SCYLLADB-1031 Closes scylladb/scylladb#28993	2026-03-20 09:21:20 +03:00
Botond Dénes	f9adbc7548	test/cqlpy/test_tombstone_limit.py: disable tombstone-gc for test table Since `7564a56dc8`, all tables default to repair-mode tombstone-gc, which is identical to immediate-mode for RF=1 tables. Consequently the tombstones written by the tests in this test file are immediately collectible and with some unlucky timing, some of them can be collected before the end of the test, failing the empty-page prefix check because the empty pages prefix will be smaller than expected based on the number of tombstones written. Disable tombstone-gc to remove this source of flakyness. Fixes: SCYLLADB-1062 Closes scylladb/scylladb#29077	2026-03-20 09:14:29 +03:00
Michał Chojnowski	6b18d95dec	test: add a missing reconnect_driver in test_sstable_compression_dictionaries_upgrade.py Need to work around https://github.com/scylladb/python-driver/issues/295, lest a CQL query fail spuriously after the cluster restart. Fixes: SCYLLADB-1114 Closes scylladb/scylladb#29118	2026-03-20 09:05:14 +03:00
Botond Dénes	89388510a0	test/cluster/test_data_resurrection_in_memtable.py: use explicit CL The test has expectation w.r.t which write makes it to which nodes: * inserts make it to all nodes * delete makes it to all-1 (QUORUM) node However, this was not expressed with CL, and the default CL=ONE allowed for some nodes missing the writes and this violating the tests expectations on what data is persent on which nodes. This resulted on the test being flaky and failing on the data checks. Use explicit CL for the ingestion to prevent this. The improvements to the test introduced in `a8dd13731f` was of great help in investigating this: traces are now available and the check happens after the data was dumped to logs. Fixes: SCYLLADB-870 Fixes: SCYLLADB-812 Fixes: SCYLLADB-1102 Closes scylladb/scylladb#29128	2026-03-20 09:02:57 +03:00
Avi Kivity	6b259babeb	Merge 'logstor: initial log-structured storage for key-value tables' from Michael Litvak Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables. Main flows and components: * The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks. * The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable. * On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO. * On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record. * We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage. * The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments. * Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group. Currently this mode is experimental and requires an experimental flag to be enabled. Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl. to use, add to config: ``` enable_logstor: true experimental_features: - logstor ``` create a table: ``` CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor'; ``` INSERT, SELECT, DELETE work as expected UPDATE not supported yet no backport - new feature Closes scylladb/scylladb#28706 * github.com:scylladb/scylladb: logstor: trigger separator flush for buffers that hold old segments docs/dev: add logstor documentation logstor: recover segments into compaction groups logstor: range read logstor: change index to btree by token per table logstor: move segments to replica::compaction_group db: update dirty mem limits dynamically logstor: track memory usage logstor: logstor stats api logstor: compaction buffer pool logstor: separator: flush buffer when full logstor: hold segment until index updates logstor: truncate table logstor: enable/disable compaction per table logstor: separator buffer pool test: logstor: add separator and compaction tests logstor: segment and separator barrier logstor: separator debt controller logstor: compaction controller logstor: recovery: recover mixed segments using separator logstor: wait for pending reads in compaction logstor: separator logstor: compaction groups logstor: cache files for read logstor: recovery: initial logstor: add segment generation logstor: reserve segments for compaction logstor: index: buckets logstor: add buffer header logstor: add group_id logstor: record generation logstor: generation utility logstor: use RIPEMD-160 for index key test: add test_logstor.py api: add logstor compaction trigger endpoint replica: add logstor to db schema: add logstor cf property logstor: initial commit db: disable tablet balancing with logstor db: add logstor experimental feature flag	2026-03-20 00:18:09 +02:00
Avi Kivity	062751fcec	Merge 'db/config: enable ms sstable format by default' from Łukasz Paszkowski Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes. Make the new format a new default for new clusters by naming ms in the default scylla.yaml. New functionality. No backport needed. This PR is basically Michał's one https://github.com/scylladb/scylladb/pull/26377, Jakub's https://github.com/scylladb/scylladb/pull/27332 fixing `sstables_manager::get_highest_supported_format()` and one test fix. Closes scylladb/scylladb#28960 * github.com:scylladb/scylladb: db/config: announce ms format as highest supported db/config: enable `ms` sstable format by default cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format api/system: add /system/chosen_sstable_version test/cluster/dtest: reduce num_tokens to 16	2026-03-19 18:19:01 +02:00
Pavel Emelyanov	969dddb630	test/refresh: Simplify refresh invocation take_snapshot return values were unused so drop them. do_refresh was a thin wrapper around load_new_sstables that added no logic; inline it directly into the gather expression. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:57 +03:00
Pavel Emelyanov	de21572b31	test/refresh: Remove r_servers alias for servers r_servers = servers was a no-op assignment; use servers directly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:52 +03:00
Pavel Emelyanov	20b1531e6d	test/refresh: Replace check_mutation_replicas with a plain CQL SELECT The goal of test_refresh_deletes_uploaded_sstables is to verify that sstables are removed from the upload directory after refresh. The replica check was just a sanity guard; a simple SELECT of all keys is sufficient and much lighter. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-19 18:42:48 +03:00
Pavel Emelyanov	c591b9ebe2	test/refresh: Inline keyspace/table/data setup in test_refresh_deletes_uploaded_sstables Replace create_dataset() with explicit keyspace creation via new_test_keyspace, inline CREATE TABLE, and direct cql.run_async inserts — matching the pattern used in do_test_streaming_scopes. This removes the last dependency on backup helpers for dataset setup and makes the test self-contained. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:44 +03:00
Pavel Emelyanov	06006a6328	test/refresh: Prepare indentation for new_test_keyspace in test_refresh_deletes_uploaded_sstables Wrap the test body under if True: to pre-indent it, making the subsequent patch that introduces new_test_keyspace a pure content change with no whitespace noise. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:40 +03:00
Pavel Emelyanov	67d8cde42d	test/refresh: Decouple test_refresh_deletes_uploaded_sstables from backup tests Replace create_cluster() from object_store/test_backup.py with a plain manager.servers_add(2) call. The test does not use object storage, so there is no need to pull in the backup helper along with its config and logging knobs. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:36 +03:00
Pavel Emelyanov	04f046d2d8	test/refresh: Remove unused wait_for_cql_and_get_hosts import Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-19 18:42:32 +03:00
Botond Dénes	e8b37d1a89	Merge 'doc: fix the installation section' from Anna Stuchlik This PR fixes the Installation page: - Replaces `http `with `https `in the download command. - Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before). Fixes https://github.com/scylladb/scylladb/issues/29087 Fixes https://github.com/scylladb/scylladb/issues/29087 This update affects all supported versions and should be backported as a bug fix. Closes scylladb/scylladb#29088 * github.com:scylladb/scylladb: doc: remove the Open Source Example from Installation doc: replace http with https in the installation instructions	2026-03-19 17:13:53 +02:00
Dario Mirovic	d2c44722e1	test: cluster: fix log clear race condition in test_audit.py assert_entries_were_added: - takes a "before" snapshot of the audit log - yields to execute a statement - takes an "after" snapshot of the audit log - computes new rows by diffing "after" minus "before" If an audit entry generated by prepare() arrives between the snapshot and the diff, it inflates the new row count and the test fails with assert 2 <= 1. Fix by: - Adding clear_audit_logs() at the end of prepare(), after all setup - Waiting for the "completed re-reading configuration file" log message after server_update_config - Draining pending syslog lines before clearing the buffer Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	821f8696a7	test: pylib: shut down exclusive cql connections in ManagerClient get_cql_exclusive() creates a Cluster object per call, but never records it. driver_close() cannot shut it down. The cluster's internal scheduler thread then tries to submit work to an already shut down executor. This causes RuntimeError: RuntimeError: cannot schedule new futures after shutdown Fix this by tracking every exclusive Cluster in a list and shutting them all down in driver_close(). Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	d94999f87b	test: cluster: fix multinode audit entry comparison in test_audit.py assert_entries_were_added computes new audit rows by slicing the "after" list at the length of the "before" list: rows_after[len(rows_before):]. This assumes new rows always appear at the tail of the combined sorted list. In a multinode setup, each node generates its own event_time timestamps. A new row from node A can sort before an old row from node B, breaking the tail assumption. The assertion "new rows are not the last rows in the audit table" then fires. Fix this by splitting the before/after lists per node and computing the new rows tail independently for each node. This guarantees that per node ordering, which is monotonic, is respected, and the combined new rows are sorted afterwards. Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	249a6cec1b	test: cluster: dtest: remove old audit tests Since audit tests have been migrated to test/cluster/test_audit.py, old tests in test/cluster/dtest/audit_test.py have to be removed. Refs SCYLLADB-573	2026-03-19 16:12:13 +01:00
Dario Mirovic	adc790a8bf	test: cluster: group migrated audit tests for cluster reuse This patch reorganizes the execution flow of the test functions. They are grouped to enable cluster reuse between specific test functions. One of the main contributors to the test execution time is the cluster preparation. This patch significantly reduces the total test execution time by having way less new cluster preparation calls and more cluster reuse. Performance increase on the developer machine is around 38%: - before: 4m 29s - after: 2m 47s Fixes SCYLLADB-573	2026-03-19 16:11:47 +01:00
Dario Mirovic	967b7ff6bf	test: cluster: enable migrated audit tests and make them work Make audit tests from test/cluster/dtest to test/cluster. test/cluster environment has less overhead, and audit tests are heavy, their execution taking lots of time. This patch is part of an effort to improve audit test suite performance. This patch refactors the tests so that they execute correctly, as well as enables them. A follow up patch will remove the audit tests in test/cluster/dtest. All the tests are confirmed to be running after the change. No dead code present. Test test_audit_categories_invalid is not parametrized anymore. It never used the parametrized helper class, so it just ran the same logic three times. This is why there are now 74, and not 76, test executions. Refs SCYLLADB-573	2026-03-19 16:07:28 +01:00
Dario Mirovic	5d51501a0b	pgo: use maintenance socket for CQL setup in PGO training The default 'cassandra' superuser was removed from ScyllaDB, which broke PGO training. exec_cql.py relied on username/password auth ('cassandra'/'cassandra') to execute setup CQL scripts like auth.cql and counters.cql. Switch exec_cql.py to connect via the Unix domain maintenance socket instead. The maintenance socket bypasses authentication, no credentials are needed. Additionally, create the 'cassandra' superuser via the maintenance socket during the populate phase, so that cassandra-stress keeps working. cassandra-stress hardcodes user=cassandra password=cassandra. Changes: - exec_cql.py: replace host/port/username/password arguments with a single --socket argument; add connect_maintenance_socket() with wait ready logic - pgo.py: add maintenance_socket_path() helper; update populate_auth_conns() and populate_counters() to pass the socket path to exec_cql.py Fixes SCYLLADB-1070 Closes scylladb/scylladb#29081	2026-03-19 16:52:36 +02:00
Dario Mirovic	8367509b3b	test: pylib: manager_client: specify AuthProvider in get_cql_exclusive This patch allows ManagerClient.get_cql_exclusive to accept AuthProvider as parameter. This will be used in a follow up patch which migrates audit test suite to test/cluster and requires this functionality for some tests. Refs SCYLLADB-573	2026-03-19 15:35:24 +01:00
Dario Mirovic	0a7a69345c	test: pylib: scylla cluster after_test log fix Before any test, a pool of ScyllaCluster objects is created. At the beginning of a test suite, a ScyllaClusterManager is created, and given a reference to the pool. At the end of a test suite, the ScyllaClusterManager is destroyed. Before each test case: - ManagerClient is constructed and connected to the ScyllaClusterManager of that test suite - A ScyllaCluster object is fetched from the pool - If the pool is empty, a new ScyllaCluster object is created - If the pool is not empty, a cached ScyllaCluster object is returned After each test case: - Return ScyllaCluster object from ManagerClient to the pool - If the cluster is dirty, the pool destroys it - If the cluster is clean, the pool caches it - ManagerClient is destroyed Many actions mark a cluster as dirty. Normal test execution will always make the cluster be destroyed upon returning to the pool. ManagerClient.mark_clean is not used in the tests. When it is used, the flow with cluster reuse happens. The bug is that the log file is closed even if cluster is not dirty. This causes an error when trying to log to a reused cluster server. The solution in this patch is to not close the log file if the cluster is not dirty. Upon cluster reuse the log file will be open and functional. Another approach would be to reopen the log file if closed, but this approach seems more clean. Refs SCYLLADB-573	2026-03-19 15:35:24 +01:00
Dario Mirovic	899ae71349	test: audit: copy audit test from dtest This patch just copies the audit test suite from dtest and disables it in the test config file. Later patches will update the code and enable the test suite. Refs SCYLLADB-573	2026-03-19 15:35:24 +01:00
Andrzej Jackowski	4deeb7ebfc	test: add new guardrail tests matching documentation scenarios Add tests for RF guardrails (min/max warn/fail, RF=0 bypass, threshold=-1 disable, ALTER KEYSPACE) and write consistency level guardrails to cover all scenarios described in guardrails.rst. Test runtime (dev): test_guardrail_replication_strategy - 6s test_guardrail_write_consistency_level - 5s Refs: SCYLLADB-257	2026-03-19 15:07:03 +01:00
Andrzej Jackowski	2a03c634c0	test: add metric assertions to guardrail replication strategy tests Verify that guardrail violations increment the corresponding metrics. Refs: SCYLLADB-257	2026-03-19 15:07:03 +01:00
Andrzej Jackowski	81c4e717e2	test: use regex matching in guardrail replication strategy tests Replace loose substring assertions with regex-based matching against the exact server message formats. Add regex constants for all guardrail messages and rewrite create_ks_and_assert_warnings_and_errors() to verify count and content of warnings and failures. Refs: SCYLLADB-257	2026-03-19 15:07:03 +01:00
Anna Stuchlik	6b1df5202c	doc: remove the instructions to install old versions from Web Installer The Web Installer page includes instructions to install the old pre-2025.1 Enterprise versions, which are no longer supported (since we released 2026.1). This commit removes those redundant and misleading instructions. Fixes https://github.com/scylladb/scylladb/issues/29099 Closes scylladb/scylladb#29103	2026-03-19 15:47:00 +02:00
Piotr Dulikowski	171504c84f	Merge 'auth: migrate some standard role manager APIs to use cache' from Marcin Maliszkiewicz This patchset migrates: query_all_directly_granted, query_all, get_attribute, query_attribute_for_all functions to use cache instead of doing CQL queries. It also includes some preparatory work which fixes cache update order and triggering. Main motivation behind this is to make sure that all calls from service_level_controller::auth_integration are cached, which we achieve here. Alternative implementation could move the whole auth_integration data into auth cache but since auth_integration manages also lifetime and contains service levels specific logic such solution would be too complex for little (if any) gain. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-159 Backport: no, not a bug Closes scylladb/scylladb#28791 * github.com:scylladb/scylladb: auth: switch query_attribute_for_all to use cache auth: switch get_attribute to use cache auth: cache: add heterogeneous map lookups auth: switch query_all to use cache auth: switch query_all_directly_granted to use cache auth: cache: add ability to go over all roles raft: service: reload auth cache before service levels service: raft: move update_service_levels_effective_cache check	2026-03-19 14:37:22 +01:00
Avi Kivity	5e7fb08bf3	Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization. Refs https://scylladb.atlassian.net/browse/SCYLLADB-620 This PR reduces the impact by several changes: - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition. - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct. - index entries and key storage are now trivially moveable, and batched inside vector storage so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction. - LSA eviction is now pretty much constant time for the whole page regardless of the number of entries, because elements are trivial and batched inside vectors. Page eviction cost dropped from 50 us to 1 us. Performance evaluated with: scylla perf-simple-query -c1 -m200M --partitions=1000000 Before: ``` 7774.96 tps (166.0 allocs/op, 521.7 logallocs/op, 54.0 tasks/op, 802428 insns/op, 430457 cycles/op, 0 errors) 7511.08 tps (166.1 allocs/op, 527.2 logallocs/op, 54.0 tasks/op, 804185 insns/op, 430752 cycles/op, 0 errors) 7740.44 tps (166.3 allocs/op, 526.2 logallocs/op, 54.2 tasks/op, 805347 insns/op, 432117 cycles/op, 0 errors) 7818.72 tps (165.2 allocs/op, 517.6 logallocs/op, 53.7 tasks/op, 794965 insns/op, 427751 cycles/op, 0 errors) 7865.49 tps (165.1 allocs/op, 513.3 logallocs/op, 53.6 tasks/op, 788898 insns/op, 425171 cycles/op, 0 errors) ``` After (+318%): ``` 32492.40 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109236 insns/op, 103203 cycles/op, 0 errors) 32591.99 tps (130.4 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 108947 insns/op, 102889 cycles/op, 0 errors) 32514.52 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109118 insns/op, 103219 cycles/op, 0 errors) 32491.14 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109349 insns/op, 103272 cycles/op, 0 errors) 32582.90 tps (130.5 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109269 insns/op, 102872 cycles/op, 0 errors) 32479.43 tps (130.6 allocs/op, 12.8 logallocs/op, 36.0 tasks/op, 109313 insns/op, 103242 cycles/op, 0 errors) 32418.48 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109201 insns/op, 103301 cycles/op, 0 errors) 31394.14 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109267 insns/op, 103301 cycles/op, 0 errors) 32298.55 tps (130.7 allocs/op, 12.8 logallocs/op, 36.1 tasks/op, 109323 insns/op, 103551 cycles/op, 0 errors) ``` When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost): perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0 Before: ``` 9124.57 tps (146.2 allocs/op, 789.0 logallocs/op, 45.3 tasks/op, 889320 insns/op, 357937 cycles/op, 0 errors) 9437.23 tps (146.1 allocs/op, 789.3 logallocs/op, 45.3 tasks/op, 889613 insns/op, 357782 cycles/op, 0 errors) 9455.65 tps (146.0 allocs/op, 787.4 logallocs/op, 45.2 tasks/op, 887606 insns/op, 357167 cycles/op, 0 errors) 9451.22 tps (146.0 allocs/op, 787.4 logallocs/op, 45.3 tasks/op, 887627 insns/op, 357357 cycles/op, 0 errors) 9429.50 tps (146.0 allocs/op, 787.4 logallocs/op, 45.3 tasks/op, 887761 insns/op, 358148 cycles/op, 0 errors) 9430.29 tps (146.1 allocs/op, 788.2 logallocs/op, 45.3 tasks/op, 888501 insns/op, 357679 cycles/op, 0 errors) 9454.08 tps (146.0 allocs/op, 787.3 logallocs/op, 45.3 tasks/op, 887545 insns/op, 357132 cycles/op, 0 errors) ``` After (+55%): ``` 14484.84 tps (150.7 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 396164 insns/op, 229490 cycles/op, 0 errors) 14526.21 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 396401 insns/op, 228824 cycles/op, 0 errors) 14567.53 tps (150.7 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 396319 insns/op, 228701 cycles/op, 0 errors) 14545.63 tps (150.6 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395889 insns/op, 228493 cycles/op, 0 errors) 14626.06 tps (150.5 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395254 insns/op, 227891 cycles/op, 0 errors) 14593.74 tps (150.5 allocs/op, 6.5 logallocs/op, 44.7 tasks/op, 395480 insns/op, 227993 cycles/op, 0 errors) 14538.10 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 397035 insns/op, 228831 cycles/op, 0 errors) 14527.18 tps (150.8 allocs/op, 6.5 logallocs/op, 44.8 tasks/op, 396992 insns/op, 228839 cycles/op, 0 errors) ``` Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages): Before: ``` 33906.70 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170553 insns/op, 98104 cycles/op, 0 errors) 32696.16 tps (146.0 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170369 insns/op, 98405 cycles/op, 0 errors) 33889.05 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170551 insns/op, 98135 cycles/op, 0 errors) 33893.24 tps (146.1 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170488 insns/op, 98168 cycles/op, 0 errors) 33836.73 tps (146.1 allocs/op, 83.6 logallocs/op, 45.1 tasks/op, 170528 insns/op, 98226 cycles/op, 0 errors) 33897.61 tps (146.0 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170428 insns/op, 98081 cycles/op, 0 errors) 33834.73 tps (146.1 allocs/op, 83.5 logallocs/op, 45.1 tasks/op, 170438 insns/op, 98178 cycles/op, 0 errors) 33776.31 tps (146.3 allocs/op, 83.9 logallocs/op, 45.2 tasks/op, 170958 insns/op, 98418 cycles/op, 0 errors) 33808.08 tps (146.3 allocs/op, 83.9 logallocs/op, 45.2 tasks/op, 170940 insns/op, 98388 cycles/op, 0 errors) ``` After (+18%): ``` 40081.51 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121047 insns/op, 82231 cycles/op, 0 errors) 40005.85 tps (148.6 allocs/op, 4.4 logallocs/op, 45.2 tasks/op, 121327 insns/op, 82545 cycles/op, 0 errors) 39816.75 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121067 insns/op, 82419 cycles/op, 0 errors) 39953.11 tps (148.1 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121027 insns/op, 82258 cycles/op, 0 errors) 40073.96 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121006 insns/op, 82313 cycles/op, 0 errors) 39882.25 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 120925 insns/op, 82320 cycles/op, 0 errors) 39916.08 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121054 insns/op, 82393 cycles/op, 0 errors) 39786.30 tps (148.2 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121027 insns/op, 82465 cycles/op, 0 errors) 38662.45 tps (148.3 allocs/op, 4.4 logallocs/op, 45.0 tasks/op, 121108 insns/op, 82312 cycles/op, 0 errors) 39849.42 tps (148.3 allocs/op, 4.4 logallocs/op, 45.1 tasks/op, 121098 insns/op, 82447 cycles/op, 0 errors) ``` Closes scylladb/scylladb#28603 * github.com:scylladb/scylladb: sstables: mx: index_reader: Optimize parsing for no promoted index case vint: Use std::countl_zero() test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement sstables: mx: index_reader: Amoritze partition key storage managed_bytes: Hoist write_fragmented() to common header utils: managed_vector: Use std::uninitialized_move() to move objects sstables: mx: index_reader: Keep promoted_index info next to index_entry sstables: mx: index_reader: Extract partition_index_page::clear_gently() sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation sstables: mx: index_reader: Keep index_entry directly in the vector dht: Introduce raw_token test: perf_simple_query: Add 'sstable-format' command-line option test: perf_simple_query: Add 'sstable-summary-ratio' command-line option test: perf-simple-query: Add option to disable index cache test: cql_test_env: Respect enable-index-cache config	2026-03-19 14:42:50 +02:00
Botond Dénes	4981e72607	Merge 'replica: avoid unnecessary computation on token lookup hot path' from Łukasz Paszkowski `storage_group_of()` sits on the replica-side token lookup hot path, yet it called `tablet_map::get_tablet_id_and_range_side()`, which always computes both the tablet id and the post-split range side — even though most callers only need the storage group id. The range-side computation is only relevant when a storage group is in tablet splitting mode, but we were paying for it unconditionally on every lookup. This series fixes that by: 1. Adding `tablet_map::get_tablet_range_side()` so the range side can be computed independently when needed. 2. Adding lazy `select_compaction_group()` overloads that defer the range-side computation until splitting mode is actually active. 3. Switching `storage_group_of()` to use the cheaper `get_tablet_id()` path, only computing the range side on demand. Improvements. No backport is required. Closes scylladb/scylladb#28963 * github.com:scylladb/scylladb: replica/table: avoid computing token range side in storage_group_of() on hot path replica/compaction_group: add lazy select_compaction_group() overloads locator/tablets: add tablet_map::get_tablet_range_side()	2026-03-19 14:27:12 +02:00
Ernest Zaslavsky	aa9da87e97	encryption: fix deadlock in encrypted_data_source::get() When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS. In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call. Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely. A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.	2026-03-19 13:54:54 +02:00
Ernest Zaslavsky	f74a54f005	test_lib: mark `limiting_data_source_impl` as not `final`	2026-03-19 13:54:54 +02:00
Ernest Zaslavsky	151e945d9f	Fix formatting after previous patch	2026-03-19 13:54:44 +02:00
Andrzej Jackowski	517bb8655d	test: extract ks_opts helper in test_guardrail_replication_strategy Factor out ks_opts() to build keyspace options with tablets handling and use it across all existing replication strategy guardrail tests. No behavioral changes. This facilitates further modification of the tests later in this patch series. Refs: SCYLLADB-257	2026-03-19 12:49:41 +01:00
Andrzej Jackowski	9b24d9ee7d	docs: document CQL guardrails Add docs/cql/guardrails.rst covering replication factor, replication strategy, write consistency level, and compact storage guardrails. Fixes: SCYLLADB-257	2026-03-19 12:49:41 +01:00
Ernest Zaslavsky	537747cf5d	Fix indentation after previous patch	2026-03-19 13:48:53 +02:00
Ernest Zaslavsky	2535164542	test_lib: make limiting_data_source_impl available to tests Relocate the `limiting_data_source_impl` declaration to the header file so that test code can access it directly.	2026-03-19 13:48:53 +02:00
Botond Dénes	86d7c82993	test/cluster/test_repair.py: use tablets in test_repair_timestamp_difference After repair, the test does a major to compact all sstables into a single one, so the results can be simply checked by a select from mutation_fragments() query. Sometimes off-strategy happens parallel to this major, so after the major there are still 2 sstables, resulting in the test failing when checking that the query returns just a single row. To fix, just use tablets for the test table, tablets don't use off-strategy anymore. Fixes: SCYLLADB-940 Closes scylladb/scylladb#29071	2026-03-19 12:42:18 +03:00
Michael Litvak	399260a6c0	test: mv: fix flaky wait for commitlog sync Previously the test test_interrupt_view_build_shard_registration stopped the node ungracefully and used commitlog periodic mode to persist the view build progress in a not very reliable way. It can happen that due to timing issues, the view build progress is not persisted, or some of it is persisted in a different ordering than expected. To make the test more reliable we change it to stop the node gracefully, so the commitlog is persisted in a graceful and consistent way, without using the periodic mode delay. We need to also change the injection for the shutdown to not get stuck. Fixes SCYLLADB-1005 Closes scylladb/scylladb#29008	2026-03-19 10:41:21 +01:00
Pavel Emelyanov	f27dc12b7c	Merge 'Fix directory lister leak in table::get_snapshot_details: ' from Benny Halevy As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown. For example, see backtrace below: ``` seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57 directory_lister::~directory_lister() at ./utils/lister.cc:77 replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247 (inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129 seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695 (inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201 seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185 (inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353 seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245 seastar::app_template::run_deprecated(int, char, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266 seastar::app_template::run(int, char, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160 scylla_main(int, char*) at ./main.cc:756 ``` Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013) Requires backport to 2026.1 since the leak exists since `004c08f525` [SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#29084 * github.com:scylladb/scylladb: test/boost/database_test: add test_snapshot_ctl_details_exception_handling table: get_snapshot_details: fix indentation inside try block table: per-snapshot get_snapshot_details: fix typo in comment table: per-snapshot get_snapshot_details: always close lister using try/catch table: get_snapshot_details: always close lister using deferred_close	2026-03-19 12:40:23 +03:00
Raphael S. Carvalho	3143134968	test: avoid split/major compaction deadlock in tablet split test Run keyspace compaction asynchronously in `test_tombstone_gc_correctness_during_tablet_split` and only await it after `split_sstable_rewrite` is disabled. The problem is that `keyspace_compaction()` starts with a flush, and that flush can take around five seconds. During that window the split compaction is stopped before major compaction is retried. The stop aborts the in-flight major compaction attempt, then the split proceeds far enough to enter the `split_sstable_rewrite` injection point. At that point the test used to wait synchronously for major compaction to finish, but major compaction cannot finish yet: when it retries, it needs the same semaphore that is still effectively tied up behind the blocked split rewrite. So the test waits for major compaction, while the split waits for the injection to be released, and the code that would release that injection never runs. Starting major compaction as a task breaks that cycle. The test can first disable `split_sstable_rewrite`, let the split get out of the way, and only then wait for major compaction to complete. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-827. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#29066	2026-03-19 11:12:21 +02:00
Botond Dénes	2e47fd9f56	Merge 'tasks: do not fail the wait request if rpc fails' from Aleksandra Martyniuk During decommission, we first mark a topology request as done, then shut down a node and in the following steps we remove node from the topology. Thus, finished request does not imply that a node is removed from the topology. Due to that, in node_ops_virtual_task::wait, while gathering children from the whole cluster, we may hit the connection exception - because a node is still in topology, even though it is down. Modify the get_children method to ignore the exception and warn about the failure instead. Keep token_metadata_ptr in get_children to prevent topology from changing. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-867 Needs backports to all versions Closes scylladb/scylladb#29035 * github.com:scylladb/scylladb: tasks: fix indentation tasks: do not fail the wait request if rpc fails tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children	2026-03-19 10:03:18 +02:00
Piotr Smaron	a2ad57062f	docs/cql: clarify WHERE clause boolean limitations Document that `SELECT ... WHERE` clause currently accepts only conjunctions of relations joined by `AND` (`OR` is not supported), and that parentheses cannot be used to group boolean subexpressions. Add an unsupported query example and point readers to equivalent `IN` rewrites when applicable. This problem has been raised by one of our users in https://forum.scylladb.com/t/error-parsing-query-or-unsupported-statement/5299, and while one could infer answer to user's question by looking at the syntax of the `SELECT ... WHERE`, it's not immediately obvious to non-advanced users, so clarifying these concepts is justified. Fixes: SCYLLADB-1116 Closes scylladb/scylladb#29100	2026-03-19 09:47:22 +02:00
Michael Litvak	31d339e54a	logstor: trigger separator flush for buffers that hold old segments A compaction group has a separator buffer that holds the mixed segments alive until the separator buffer is flushed. A mixed segment can be freed only after all separator buffers that hold writes from the segment are flushed. Typically a separator buffer is flushed when it becomes full. However it's possible for example that one compaction groups is filled slower than others and holds many segments. To fix this we trigger a separator flush periodically for separator buffers that hold old segments. We track the active segment sequence number and for each separator buffer the oldest sequence number it holds.	2026-03-18 19:24:28 +01:00
Michael Litvak	ad87eda835	docs/dev: add logstor documentation	2026-03-18 19:24:28 +01:00
Michael Litvak	a0da07e5b7	logstor: recover segments into compaction groups Fix the logstor recovery to work with compaction groups. When recovering a segment find its token range and add it to the appropriate compaction groups. if it doesn't fit in a single compaction group then write each record to its compaction group's separator buffer.	2026-03-18 19:24:28 +01:00
Michael Litvak	24379acc76	logstor: range read extend the logstor mutation reader to support range read	2026-03-18 19:24:28 +01:00
Michael Litvak	a9d0211a64	logstor: change index to btree by token per table Change the primary index to be a btree that is ordered by token, similarly to a memtable, and create a index per-table instead of a single global index.	2026-03-18 19:24:28 +01:00
Michael Litvak	e7c3942d43	logstor: move segments to replica::compaction_group Add a segment_set member to replica::compaction_group that manages the logstor segments that belong to the compaction group, similarly to how it manages sstables. Add also a separator buffer in each compaction group. When writing a mutation to a compaction group, the mutation is written to the active segment and to the separator buffer of the compaction group, and when the separator buffer is flushed the segment is added to the compaction_group's segment set.	2026-03-18 19:24:28 +01:00
Michael Litvak	d69f7eb0ee	db: update dirty mem limits dynamically when logstor is enabled, update the db dirty memory limits dynamically. previously the threshold is set to 0.5 of the available memory, so 0.5 goes to memtables and 0.5 to others (cache). when logstor is enabled, we calculate the available memory excluding logstor, and divide it evenly between memtables and cache.	2026-03-18 19:24:27 +01:00
Michael Litvak	65cd0b5639	logstor: track memory usage add logstor::get_memory_usage() that returns an estimate of the memory usage by logstor. add tracking to how many unique keys are held in the index.	2026-03-18 19:24:27 +01:00
Michael Litvak	b7bdb1010a	logstor: logstor stats api add api to get logstor statistics about segments for a table	2026-03-18 19:24:27 +01:00
Michael Litvak	8bd3bd7e2a	logstor: compaction buffer pool pre-allocate write buffers for compaction	2026-03-18 19:24:27 +01:00
Michael Litvak	caf5aa47c2	logstor: separator: flush buffer when full flush separator buffers when they become full and switched instead of aggregating all the buffers and flushing them when the separator is switched.	2026-03-18 19:24:27 +01:00
Michael Litvak	6ddb7a4d13	logstor: hold segment until index updates add a write gate to write_buffer. when writing a record to the write buffer, the gate is held and passed back to the caller, and the caller holds the gate until the write operation is complete, including follow-up operations such as updating the index after the write. in particular, when writing a mutation in logstor::write, the write buffer is held open until the write is completed and updated in the index. when writing the write buffer to the active segment, we write the buffer and then wait for the write buffer gate to close, i.e. we wait for all index updates to complete before proceeding. the segment is held open until all the write operations and index updates are complete. this property is useful for correctness: when a segment is closed we know that all the writes to it are updated in the index. this is needed in compaction for example, where we take closed segments and check which records in them are alive by looking them up in the index. if the index is not updated yet then it will be wrong.	2026-03-18 19:24:27 +01:00
Michael Litvak	bd66edee5c	logstor: truncate table implement freeing all segments of a table for table truncate. first do barrier to flush all active and mixed segments and put all the table's data in compaction groups, then stop compaction for the table, then free the table's segments and remove the live entries from the index.	2026-03-18 19:24:27 +01:00
Michael Litvak	489efca47c	logstor: enable/disable compaction per table add functions to enable or disable compaction for a specific compaction group or for all compaction groups of a table.	2026-03-18 19:24:27 +01:00
Michael Litvak	21db4f3ed8	logstor: separator buffer pool pre-allocate write buffers for the separator	2026-03-18 19:24:27 +01:00
Michael Litvak	37c485e3d1	test: logstor: add separator and compaction tests	2026-03-18 19:24:27 +01:00
Michael Litvak	31aefdc07d	logstor: segment and separator barrier add barrier operation that forces switch of the active segment and separator, and waits for all existing segments to close and all separators to flush.	2026-03-18 19:24:27 +01:00
Michael Litvak	1231fafb46	logstor: separator debt controller add tracking of the total separator debt - writes that were written to a separator and waiting to be flushed, and add flow control to keep the debt in control by delaying normal writes.	2026-03-18 19:24:27 +01:00
Michael Litvak	17cb173e18	logstor: compaction controller adjust compaction shares by the compaction overhead: how many segments compaction writes to generate a single free segment for new writes.	2026-03-18 19:24:27 +01:00
Michael Litvak	1da1bb9d99	logstor: recovery: recover mixed segments using separator on recovery we may find mixed segments. recover them by adding them to a separator, reading all their records and writing them to the separator, and flush the separator.	2026-03-18 19:24:27 +01:00
Michael Litvak	b78cc787a6	logstor: wait for pending reads in compaction we free a segment from compaction after updating all live records in the segment to point to new locations in the index. we need to ensure they are no running operations that use the old locations before we free the segment.	2026-03-18 19:24:27 +01:00
Michael Litvak	600ec82bec	logstor: separator initial implementation of the separator. it replaces "mixed" segments - segments that have records from different groups, to segments by group. every write is written to the active segment and to a buffer in the active separator. the active separator has in-memory buffers by group. at some threshold number of segments we switch the active segment and separator atomically, and start flushing the separator. the separator is flushed by writing the buffers into new non-mixed segments, adding them to a compaction group, and frees the mixed segments.	2026-03-18 19:24:27 +01:00
Michael Litvak	009fc3757a	logstor: compaction groups divide the segments in the compaction manager to compaction group. compaction will compact only segments from a single compaction group at a time.	2026-03-18 19:24:27 +01:00
Michael Litvak	b3293f8579	logstor: cache files for read keep all files for all segments open for read to improve reads.	2026-03-18 19:24:26 +01:00
Michael Litvak	5a16980845	logstor: recovery: initial initial and basic recovery implementation. * find all files, read their segments and populate the index with the newest record for each key. * find which segments are used and build the usage histogram	2026-03-18 19:24:26 +01:00
Michael Litvak	bc9fc96579	logstor: add segment generation add segment generation number that is incremented when the segment is reused, and it's written to every buffer that is written to the segment. this is useful for recovery.	2026-03-18 19:24:26 +01:00
Michael Litvak	719f7cca57	logstor: reserve segments for compaction reserve segments for compaction so it always has enough segments to run and doesn't get stuck. do the compaction writes into full new segments instead of the active segment.	2026-03-18 19:24:26 +01:00
Michael Litvak	521fca5c92	logstor: index: buckets divide the primary index to buckets, each bucket containing a btree. the bucket is determined by using bits from the key hash.	2026-03-18 19:24:26 +01:00
Michael Litvak	99c3b1998a	logstor: add buffer header add a buffer header in each write buffer we write that contains some information that can be useful for recovery and reading.	2026-03-18 19:24:26 +01:00
Michael Litvak	ddd72a16b0	logstor: add group_id add group_id value to each log record that is passed with the mutation when writing it. the group_id will be used to group log records in segments, such that a segment will contain records only from a single group. this will be useful for tablet migration. we want for each tablet to have their own segments with all their records, so we can migrate them efficiently by copying these segments. the group_id value is set to a value equivalent to the tablet id.	2026-03-18 19:24:26 +01:00
Michael Litvak	08bea860ef	logstor: record generation add a record generation number for each record so we can compare records and find which one is newer.	2026-03-18 19:24:26 +01:00
Michael Litvak	28f820eb1c	logstor: generation utility basic utility for generation numbers that will be useful next. a generation number is an unsigned integer that can be incremented and compared even if it wraparounds, assuming the values we compare were written around the same time.	2026-03-18 19:24:26 +01:00
Michael Litvak	5f649dd39f	logstor: use RIPEMD-160 for index key use a 20-byte hash function for the index key to make hash collisions very unlikely. we assume there are no hash collisions.	2026-03-18 19:24:26 +01:00
Michael Litvak	a521bcbcee	test: add test_logstor.py add basic tests for key-value tables with logstor storage	2026-03-18 19:24:26 +01:00
Michael Litvak	1ae1f37ec1	api: add logstor compaction trigger endpoint add a new api endpoint that triggers logstor compaction.	2026-03-18 19:24:26 +01:00
Michael Litvak	2128b1b15c	replica: add logstor to db Add a single logstor instance in the database that is used for writing and reading to tables with kv storage	2026-03-18 19:24:26 +01:00
Michael Litvak	9172cc172e	schema: add logstor cf property add a schema property for tables with logstor storage	2026-03-18 19:24:26 +01:00
Michael Litvak	0b1343747f	logstor: initial commit initial implementation of the logstor storage engine for key-value tables that supports writes, reads and basic compaction. main components: * logstor: this is the main interface to users that supports writing and reading back mutations, and manages the internal components. * index: the primary index in-memory that maps a key to a location on disk. * write buffer: writes go initially to a write buffer. it accumulates multiple records in a buffer and writes them to the segment manager in 4k sized blocks. * segment manager: manages the storage - files, segments, compaction. it manages file and segment allocation, and writes 4k aligned buffers to the active segment sequentially. it tracks the used space in each segment. the compaction finds segment with low space usage and writes them to new segments, and frees the old segments.	2026-03-18 19:24:26 +01:00
Michael Litvak	27fd0c119f	db: disable tablet balancing with logstor initially logstor tables will not support tablet migrations, so disable tablet balancing if the experimental feature flag is set.	2026-03-18 19:24:26 +01:00
Michael Litvak	ed852a2af2	db: add logstor experimental feature flag add a new experimental feature flag for key-value tables with the new logstor storage engine.	2026-03-18 19:24:26 +01:00
Anna Stuchlik	88b98fac3a	doc: update the warning about shared dictionary training This commit updates the inadequate warning on the Advanced Internode (RPC) Compression page. The warning is replaced with a note about how training data is encrypted. Fixes https://github.com/scylladb/scylladb/issues/29109 Closes scylladb/scylladb#29111	2026-03-18 19:35:18 +02:00
Avi Kivity	46a6f8e1d3	Merge 'auth: add maintenance_socket_authorizer' from Dario Mirovic GRANT/REVOKE fails on the maintenance socket connections, because maintenance_auth_service uses allow_all_authorizer. allow_all_authorizer allows all operations, but not GRANT/REVOKE, because they make no sense in its context. This has been observed during PGO run failure in operations from ./pgo/conf/auth.cql file. This patch introduces maintenance_socket_authorizer that supports the capabilities of default_authorizer ('CassandraAuthorizer') without needing authorization. Refs SCYLLADB-1070 This is an improvement, no need for backport. Closes scylladb/scylladb#29080 * github.com:scylladb/scylladb: test: use NetworkTopologyStrategy in maintenance socket tests test: use cleanup fixture in maintenance socket auth tests auth: add maintenance_socket_authorizer	2026-03-18 19:29:57 +02:00
Pavel Emelyanov	d6c01be09b	s3/client: Don't reconstruct regex on every parse_content_range call Make the pattern static const so it is compiled once at first call rather than on every Content-Range header parse. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29054	2026-03-18 17:56:33 +02:00
Tomasz Grabiec	4410e9c61a	sstables: mx: index_reader: Optimize parsing for no promoted index case It's a common case with small partition workloads.	2026-03-18 16:25:21 +01:00
Tomasz Grabiec	32f8609b89	vint: Use std::countl_zero() It handles 0, and could generate better code for that. On Broadwell architecture, it translates to a single instruction (LZCNT). We're still on Westmere, so it translates to BSR with a conditional move. Also, drop unnecessary casts and bit arithmetic, which saves a few instructions. Move to header so that it's inlined in parsers.	2026-03-18 16:25:21 +01:00
Tomasz Grabiec	6017688445	test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement	2026-03-18 16:25:21 +01:00
Tomasz Grabiec	f55bb154ec	sstables: mx: index_reader: Amoritze partition key storage This change reduces the cost of partition index page construction and LSA migration. This is achieved by several things working together: - index entries don't store keys as separate small objects (managed_bytes) They are written into one managed_bytes fragmented storage, entries hold offset into it. Before, we paid 16 bytes for managed_bytes plus LSA descriptor for the storage (1 byte) plus back-reference in the storage (8 bytes), so 25 bytes. Now we only pay 4 bytes for the size offset. If keys are 16 bytes, that's a reduction from 31 bytes to 20 bytes per key. - index entries and key storage are now trivially moveable, so LSA migration can use memcpy() which amortizes the cost per key. memcpy(). LSA eviction is now trivial and constant time for the whole page regardless of the number of entries. Page eviction dropped from 14 us to 1 us. This improves throughput in a CPU-bound miss-heavy read workload where the partition index doesn't fit in memory. scylla perf-simple-query -c1 -m200M --partitions=1000000 Before: 15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors) 15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors) 15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors) 15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors) 15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors) 15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors) 15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors) After: 21620.18 tps (148.4 allocs/op, 13.4 logallocs/op, 43.7 tasks/op, 176817 insns/op, 153183 cycles/op, 0 errors) 20644.03 tps (149.8 allocs/op, 13.5 logallocs/op, 44.3 tasks/op, 187941 insns/op, 160409 cycles/op, 0 errors) 20588.06 tps (150.1 allocs/op, 13.5 logallocs/op, 44.5 tasks/op, 188090 insns/op, 160818 cycles/op, 0 errors) 20789.29 tps (149.5 allocs/op, 13.5 logallocs/op, 44.2 tasks/op, 186495 insns/op, 159382 cycles/op, 0 errors) 20977.89 tps (149.5 allocs/op, 13.4 logallocs/op, 44.2 tasks/op, 183969 insns/op, 158140 cycles/op, 0 errors) 21125.34 tps (149.1 allocs/op, 13.4 logallocs/op, 44.1 tasks/op, 183204 insns/op, 156925 cycles/op, 0 errors) 21244.42 tps (148.6 allocs/op, 13.4 logallocs/op, 43.8 tasks/op, 181276 insns/op, 155973 cycles/op, 0 errors) Mostly because the index now fits in memory. When it doesn't, the benefits are still visible due to lower LSA overhead.	2026-03-18 16:25:21 +01:00
Tomasz Grabiec	1452e92567	managed_bytes: Hoist write_fragmented() to common header	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	75e6412b1c	utils: managed_vector: Use std::uninitialized_move() to move objects It's shorter, and is supposed to be optimized for trivially-moveable types. Important for managed_vector<index_entry>, which can have lots of elements.	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	50dc7c6dd8	sstables: mx: index_reader: Keep promoted_index info next to index_entry Densely populated pages have no promoted index (small partitions), so we can save space in such workloads by keeping promoted index in a separate vector. For workloads which do have a promoted index, pages have only one partition. There aren't many such pages and they are long-lived, so the extra allocation of the vector is amortized. promoted_index class is removed, and replaced with equivalent parsed_promoted_index_entry for simplicity. Because it's removed, make_cursor() is moved into the index_reader class. Reducing the size of index_entry is important for performence if pages are densly populated. It helps to reduce LSA allocator pressure and compaction/eviction speed. This change, combined with the earlier change "Shave-off 16 bytes from index_entry by using raw_token", gives significant improvement in throughput in perf_simple_query run where the index doesn't fit in memory: scylla perf-simple-query -c1 -m200M --partitions=1000000 Before: 9714.78 tps (170.9 allocs/op, 16.9 logallocs/op, 55.3 tasks/op, 494788 insns/op, 343920 cycles/op, 0 errors) 9603.13 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 502358 insns/op, 348344 cycles/op, 0 errors) 9621.43 tps (171.9 allocs/op, 17.0 logallocs/op, 55.8 tasks/op, 500612 insns/op, 347508 cycles/op, 0 errors) 9597.75 tps (171.6 allocs/op, 17.0 logallocs/op, 55.6 tasks/op, 501428 insns/op, 348604 cycles/op, 0 errors) 9615.54 tps (171.6 allocs/op, 16.9 logallocs/op, 55.6 tasks/op, 501313 insns/op, 347935 cycles/op, 0 errors) 9577.03 tps (171.8 allocs/op, 17.0 logallocs/op, 55.7 tasks/op, 503283 insns/op, 349251 cycles/op, 0 errors) After: 15328.25 tps (150.0 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 286769 insns/op, 218134 cycles/op, 0 errors) 15279.01 tps (149.9 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 287696 insns/op, 218637 cycles/op, 0 errors) 15347.78 tps (149.7 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 285851 insns/op, 217795 cycles/op, 0 errors) 15403.68 tps (149.6 allocs/op, 14.1 logallocs/op, 45.2 tasks/op, 285111 insns/op, 216984 cycles/op, 0 errors) 15189.47 tps (150.0 allocs/op, 14.1 logallocs/op, 45.5 tasks/op, 289509 insns/op, 219602 cycles/op, 0 errors) 15295.04 tps (149.8 allocs/op, 14.1 logallocs/op, 45.3 tasks/op, 288021 insns/op, 218545 cycles/op, 0 errors) 15162.01 tps (149.8 allocs/op, 14.1 logallocs/op, 45.4 tasks/op, 291265 insns/op, 220451 cycles/op, 0 errors)	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	5e228a8387	sstables: mx: index_reader: Extract partition_index_page::clear_gently() There will be more elements to clear. And partition_index_page should know how to clear itself.	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	2d77e4fc28	sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token The std::optional<> adds 8 bytes. And dht::token adds 8 bytes due to _kind, which in this case is always kind::key. The size changd from 56 to 48 bytes.	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	e9c98274b5	sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation If the page has many entries, we continuously enter and leave the allocating section for every key. This can be avoided by batching LSA operations for the whole page, after collecting all the entries. Later optimizations will also build on this, where we will allocate fragmented storage for keys in LSA using a single managed_bytes constructor. This alone brings only a minor improvement, but it does reduce LSA allocations, probably due to less frequent memory reclamation: scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000 Before: 9560.42 tps (172.2 allocs/op, 19.6 logallocs/op, 57.7 tasks/op, 567741 insns/op, 345158 cycles/op, 0 errors) 9445.95 tps (173.1 allocs/op, 19.7 logallocs/op, 58.1 tasks/op, 579075 insns/op, 352173 cycles/op, 0 errors) 9576.75 tps (172.2 allocs/op, 19.6 logallocs/op, 57.6 tasks/op, 572004 insns/op, 347373 cycles/op, 0 errors) 9597.16 tps (172.2 allocs/op, 19.6 logallocs/op, 57.6 tasks/op, 569615 insns/op, 346618 cycles/op, 0 errors) 9454.07 tps (173.5 allocs/op, 19.8 logallocs/op, 58.3 tasks/op, 579213 insns/op, 351569 cycles/op, 0 errors) After: 9562.21 tps (172.0 allocs/op, 17.0 logallocs/op, 55.8 tasks/op, 499225 insns/op, 347832 cycles/op, 0 errors) 9480.20 tps (172.3 allocs/op, 17.0 logallocs/op, 55.9 tasks/op, 507271 insns/op, 350640 cycles/op, 0 errors) 9512.42 tps (172.1 allocs/op, 17.0 logallocs/op, 55.9 tasks/op, 504247 insns/op, 350392 cycles/op, 0 errors) 9498.45 tps (172.4 allocs/op, 17.1 logallocs/op, 55.9 tasks/op, 505765 insns/op, 350320 cycles/op, 0 errors) 9076.30 tps (173.5 allocs/op, 17.1 logallocs/op, 56.5 tasks/op, 512791 insns/op, 354792 cycles/op, 0 errors) 9542.62 tps (171.9 allocs/op, 17.0 logallocs/op, 55.8 tasks/op, 502532 insns/op, 348922 cycles/op, 0 errors)	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	0e0f9f41b3	sstables: mx: index_reader: Keep index_entry directly in the vector Partition index entries are relatively small, and if the workload has small partitions, index pages have a lot of elements. Currently, index entries are indirected via managed_ref, which causes increased cost of LSA eviction and compaction. This patch amortizes this cost by storing them dierctly in the managed_chunked_vector. This gives about 23% improvement in throughput in perf-simple-query for a workload where the index doesn't fit in memory: scylla perf-simple-query -c1 -m200M --duration 6000 --partitions=1000000 Before: 7774.96 tps (166.0 allocs/op, 521.7 logallocs/op, 54.0 tasks/op, 802428 insns/op, 430457 cycles/op, 0 errors) 7511.08 tps (166.1 allocs/op, 527.2 logallocs/op, 54.0 tasks/op, 804185 insns/op, 430752 cycles/op, 0 errors) 7740.44 tps (166.3 allocs/op, 526.2 logallocs/op, 54.2 tasks/op, 805347 insns/op, 432117 cycles/op, 0 errors) 7818.72 tps (165.2 allocs/op, 517.6 logallocs/op, 53.7 tasks/op, 794965 insns/op, 427751 cycles/op, 0 errors) 7865.49 tps (165.1 allocs/op, 513.3 logallocs/op, 53.6 tasks/op, 788898 insns/op, 425171 cycles/op, 0 errors) After: 9560.42 tps (172.2 allocs/op, 19.6 logallocs/op, 57.7 tasks/op, 567741 insns/op, 345158 cycles/op, 0 errors) 9445.95 tps (173.1 allocs/op, 19.7 logallocs/op, 58.1 tasks/op, 579075 insns/op, 352173 cycles/op, 0 errors) 9576.75 tps (172.2 allocs/op, 19.6 logallocs/op, 57.6 tasks/op, 572004 insns/op, 347373 cycles/op, 0 errors) 9597.16 tps (172.2 allocs/op, 19.6 logallocs/op, 57.6 tasks/op, 569615 insns/op, 346618 cycles/op, 0 errors) 9454.07 tps (173.5 allocs/op, 19.8 logallocs/op, 58.3 tasks/op, 579213 insns/op, 351569 cycles/op, 0 errors) Disabling the partition index doesn't improve the throuhgput beyond that.	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	b6bfdeb111	dht: Introduce raw_token Most tokens stored in data structures are for key-scoped tokens, and we don't need to pay for token::kind storage.	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	3775593e53	test: perf_simple_query: Add 'sstable-format' command-line option	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	6ee9bc63eb	test: perf_simple_query: Add 'sstable-summary-ratio' command-line option	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	38d130d9d0	test: perf-simple-query: Add option to disable index cache	2026-03-18 16:25:20 +01:00
Tomasz Grabiec	5ee61f067d	test: cql_test_env: Respect enable-index-cache config Mirrors the code in main.cc	2026-03-18 16:25:20 +01:00
Aleksandra Martyniuk	2d16083ba6	tasks: fix indentation	2026-03-18 15:37:24 +01:00
Aleksandra Martyniuk	1fbf3a4ba1	tasks: do not fail the wait request if rpc fails During decommission, we first mark a topology request as done, then shut down a node and in the following steps we remove node from the topology. Thus, finished request does not imply that a node is removed from the topology. Due to that, in node_ops_virtual_task::wait, while gathering children from the whole cluster, we may hit the connection exception - because a node is still in topology, even though it is down. Modify the get_children method to ignore the exception and warn about the failure instead.	2026-03-18 15:37:24 +01:00
Aleksandra Martyniuk	d4fdeb4839	tasks: pass token_metadata_ptr to task_manager::virtual_task::impl::get_children In get_children we get the vector of alive nodes with get_nodes. Yet, between this and sending rpc to those nodes there might be a preemption. Currently, the liveness of a node is checked once again before the rpcs (only with gossiper not in topology - unlike get_nodes). Modify get_children, so that it keeps a token_metadata_ptr, preventing topology from changing between get_nodes and rpcs. Remove test_get_children as it checked if the get_children method won't fail if a node is down after get_nodes - which cannot happen currently.	2026-03-18 15:37:24 +01:00
Calle Wilund	0013f22374	memtable_test::memtable_flush_period: Change sleep to use injection signal instead Fixes: SCYLLADB-942 Adds an injection signal _from_ table::seal_active_memtable to allow us to reliably wait for flushing. And does so. Closes scylladb/scylladb#29070	2026-03-18 16:23:13 +02:00
Botond Dénes	ae17596c2a	Merge 'Demote log level on split failure during shutdown' from Raphael Raph Carvalho Since commit `509f2af8db`, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951. Fixes https://github.com/scylladb/scylladb/issues/24850. Only 2026.1 is affected. Closes scylladb/scylladb#29032 * github.com:scylladb/scylladb: replica: Demote log level on split failure during shutdown service: Demote log level on split failure during shutdown	2026-03-18 16:21:05 +02:00
Pavel Emelyanov	8b1ca6dcd6	database: Rate limit all tokens from a range The limiter scans ranges to decide whether or not to rate-limit the query. However, when considering each range only the front one's token is accounted. This looks like a misprint. The limiter was introduced in `cc9a2ad41f` Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29050	2026-03-18 13:50:48 +01:00
Pavel Emelyanov	d68c92ec04	test: Replace a bunch of ternary operators with an if-else block A followup of the merge of two test cases that happened in the previous patch. Both used `foo = N if domain == bar else M` to evaluate the parameters for topology. Using if-else block makes it immediately obvious which topology and scope apply for each domain value without having to evaluate multiple inline conditionals. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-18 13:08:36 +03:00
Pavel Emelyanov	b1d4fc5e6e	test: Squash test_restore_primary_replica_same\|different_domain tests The two tests differ only in the way they set up the topology for the cluster and the post-restore checks against the resulting streams. The merge happens with the help of a "scope_is_same" boolean parameter and corresponding updates in the topology setup and post-checks. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-18 13:08:36 +03:00
Pavel Emelyanov	21c603a79e	test: Use the same regexp in test_restore_primary_replica_different\|same_domain-s The one in "different domain" test is simpler because the test performs less checks. Next patch will merge both tests and making regexp-s look identical makes the merge even smother. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-18 13:07:09 +03:00
Emil Maskovsky	34f3916e7d	.github: update test instructions for unified pytest runner Update test running instructions to reflect unified pytest-based runner. The test.py now requires full test paths with file extensions for both C++ and Python tests. No backport: The change is only relevant for recent test.py changes in master. Closes scylladb/scylladb#29062	2026-03-18 09:28:28 +01:00
Marcin Maliszkiewicz	04bf631d7f	auth: switch query_attribute_for_all to use cache	2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz	cf578fd81a	auth: switch get_attribute to use cache	2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz	06d16b6ea2	auth: cache: add heterogeneous map lookups Some callers have only string_view role name, they shouldn't need to allocate sstring to do the lookup.	2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz	7fdb1118f5	auth: switch query_all to use cache	2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz	fca11c5a21	auth: switch query_all_directly_granted to use cache	2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz	6f682f7eb1	auth: cache: add ability to go over all roles This is needed to implement auth service api where we list all roles.	2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz	61952cd985	raft: service: reload auth cache before service levels Since service levels depend on auth data, and not other way around, we need to ensure a proper loading order.	2026-03-18 09:06:20 +01:00
Marcin Maliszkiewicz	c4cfb278bc	service: raft: move update_service_levels_effective_cache check The auth::cache::includes_table function also covers role_members and role_attributes. The existing check was removed because it blocked these tables from triggering necessary cache updates. While previously non-critical (due to unused attributes and table coupling), maintaining a correct cache is essential for upcoming changes.	2026-03-18 09:06:20 +01:00
Benny Halevy	c2a6d1e930	test/boost/database_test: add test_snapshot_ctl_details_exception_handling Verify that the directory listers opened by get_snapshot_details are properly closed when handling an (injected) exception. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-18 09:37:44 +02:00
Benny Halevy	6dc4ea766b	table: get_snapshot_details: fix indentation inside try block Whitespace-only change: indent the loop body one level inside the try block added in the previous commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-18 09:28:50 +02:00
Benny Halevy	b09d45b89a	table: per-snapshot get_snapshot_details: fix typo in comment The comment says the snapshot directory may contain a `schema.sql` file, but the code treats `schema.cql` as the special-case schema file. Reported-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-18 09:27:40 +02:00
Benny Halevy	580cc309d2	table: per-snapshot get_snapshot_details: always close lister using try/catch Since this is a coroutine, we cannot just use deferred_close, but rather we need to catch an error, close the lister, and then return the error, is applicable. Fixes: SCYLLADB-1013 Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-18 09:27:23 +02:00
Benny Halevy	78c817f71e	table: get_snapshot_details: always close lister using deferred_close Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-18 09:26:26 +02:00
Dario Mirovic	71e6918f28	test: use NetworkTopologyStrategy in maintenance socket tests NetworkTopologyStrategy is the preferred choice. We should not use SimpleStrategy anymore. This patch changes the topology strategy for all the maintenance socket tests. Refs SCYLLADB-1070	2026-03-17 20:20:47 +01:00
Dario Mirovic	278535e4e3	test: use cleanup fixture in maintenance socket auth tests Add a cql_clusters pytest fixture that tracks CQL driver Cluster objects and shuts them down automatically after test completion. This replaces manual shutdown() calls at the end of each test. Also consolidate shutdown() calls in retry helpers into finally blocks for consistent cleanup. Refs SCYLLADB-1070	2026-03-17 20:15:30 +01:00
Dario Mirovic	2e4b72c6b9	auth: add maintenance_socket_authorizer GRANT/REVOKE fails on the maintenance socket connections, because maintenance_auth_service uses allow_all_authorizer. allow_all_authorizer allows all operations, but not GRANT/REVOKE, because they make no sense in its context. This has been observed during PGO run failure in operations from ./pgo/conf/auth.cql file. This patch introduces maintenance_socket_authorizer that supports the capabilities of default_authorizer ('CassandraAuthorizer') without needing authorization. Refs SCYLLADB-1070	2026-03-17 19:19:41 +01:00
Botond Dénes	172c786079	Merge 'perf-alternator: wait for alternator port before running workload' from Marcin Maliszkiewicz This patch is mostly for the purpose of running pgo CI job. We may receive connection error if asyncio.sleep(5) in pgo.py is not sufficient waiting time. In pgo.py we do wait for port but only for cql, anyway it's better to have high level check than trying to wait for alternator port there. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1071 Backport: 2026.1 - it failed on CI for that build Closes scylladb/scylladb#29063 * github.com:scylladb/scylladb: perf: add abort_source support to wait-for-port loops perf-alternator: wait for alternator port before running workload	2026-03-17 18:38:11 +02:00
Botond Dénes	5d868dcc55	Merge 's3_client: fix s3::range max value for object size' from Ernest Zaslavsky - fix s3::range max value for object size which is 50TiB and not 5. - refactor constants to make it accessible for all interested parties, also reuse these constants in tests No need to backport, doubt we will encounter an object larger than 5TiB Closes scylladb/scylladb#28601 * github.com:scylladb/scylladb: s3_client: reorganize tests in part_size_calculation_test s3_client: switch using s3 limits constants in tests s3_client: fix the s3::range max object size s3_client: remove "aws" prefix from object limits constants s3_client: make s3 object limits accessible	2026-03-17 16:34:42 +02:00
Anna Stuchlik	f4a6bb1885	doc: remove the Open Source Example from Installation This commit replaces the Open Soruce example from the Installation section for CentOS. We updated the example for Ubuntu, but not for CentOS. We don't want to have any Open Source information in the docs. Fixes https://github.com/scylladb/scylladb/issues/29087	2026-03-17 14:54:32 +01:00
Anna Stuchlik	95bc8911dd	doc: replace http with https in the installation instructions Fixes https://github.com/scylladb/scylladb/issues/17227	2026-03-17 14:46:16 +01:00
Dawid Mędrek	a8dd13731f	Merge 'Improve debuggability of test/cluster/test_data_resurrection_in_memtable.py' from Botond Dénes This test was observed to fail in CI recently but there is not enough information in the logs to figure out what went wrong. This PR makes a few improvements to make the next investigation easier, should it be needed: * storage-service: add table name to mutation write failure error messages. * database: the `database_apply` error injection used to cause trouble, catching writes to bystander tables, making tests flaky. To eliminate this, it gained a filter to apply only to non-system keyspaces. Unfortunately, this still allows it to catch writes to the trace tables. While this should not fail the test, it reduces observability, as some traces disappear. Improve this error injection to only apply to selected table. Also merge it with the `database_apply_wait` error injection, to streamline the code a bit. * test/test_data_resurrection_in_memtable.py: dump data from the datable, before the checks for expected data, so if checks fail, the data in the table is known. Refs: SCYLLADB-812 Refs: SCYLLADB-870 Fixes: SCYLLADB-1050 (by restricting `database_apply` error injection, so it doesn't affect writes to system traces) Backport: test related improvement, no backport Closes scylladb/scylladb#28899 * github.com:scylladb/scylladb: test/cluster/test_data_resurrection_in_memtable.py: dump rows before check replica/database: consolidate the two database_apply error injections service/storage_proxy: add name of table to error message for write errors	2026-03-17 13:35:19 +01:00
Botond Dénes	318aa07158	Merge ' test/alternator: use module-scope fixtures in test_streams.py ' from Nadav Har'El Previously, all stream-table fixtures in test_streams.py used scope="function", forcing a fresh table to be created for every test, slowing down the test a bit (though not much), and discouraging writing small new tests. This was a workaround for a DynamoDB quirk (that Alternator doesn't have): LATEST shard iterators have a time slack and may point slightly before the true stream head, causing leftover events from a previous test to appear in the next test's reads. The first two tests in this series fix small problems that turn up once we start sharing test tables in test_streams.py. The final patch fixes the "LATEST" problem and enables sharing the test table by using "module" scope fixtures instead of "function". After this series, test_streams.py run time went down a bit, from 20.2 seconds to 17.7 seconds. Closes scylladb/scylladb#28972 * github.com:scylladb/scylladb: test/alternator: speed up test_streams.py by using module-scope fixtures test/alternator: test_streams.py don't use fixtures in 4 tests test/alternator: fix do_test() in test_streams.py	2026-03-17 13:56:16 +02:00
Ernest Zaslavsky	7f597aca67	cmake: fix broken build Add raft_util.idl.hh to cmake to generate the code properly Closes scylladb/scylladb#29055	2026-03-17 10:35:34 +01:00
Botond Dénes	dbe70cddca	test/boost/querier_cache_test: make test_time_based_cache_eviction less sensitive to timing This test relies on the cache entry being evicted after 200ms past the TTL. This may not happen on a busy CI machine. Make the test less reliant on timing by using eventually_true(). Simplify the test by dropping the second entry, it doesn't add anything to the test. Fixes: SCYLLADB-811 Closes scylladb/scylladb#28958	2026-03-17 10:32:23 +01:00
Botond Dénes	0fd51c4adb	test/nodetool: rest_api_mock_server: add retry for status code 404 This fixtures starts the mock server and immediately connects to it to setup the expected requests. The connection attempt might be too early, so there is a retry loop with a timeout. The loop currently checks for requests.exception.ConnectionError. We've seen a case where the connection is successful but the request fails with 404. The mock started the server but didn't setup the routes yet. Add a retry for http 404 to handle this. Fixes: SCYLLADB-966 Closes scylladb/scylladb#29003	2026-03-17 10:30:23 +01:00
Asias He	6cb263bab0	repair: Prevent CPU stall during cross-shard row copy and destruction When handling `repair_stream_cmd::end_of_current_rows`, passing the foreign list directly to `put_row_diff_handler` triggered a massive synchronous deep copy on the destination shard. Additionally, destroying the list triggered a synchronous deallocation on the source shard. This blocked the reactor and triggered the CPU stall detector. This commit fixes the issue by introducing `clone_gently()` to copy the list elements one by one, and leveraging the existing `utils::clear_gently()` to destroy them. Both utilize `seastar::coroutine::maybe_yield()` to allow the reactor to breathe during large cross-shard transfers and cleanups. Fixes SCYLLADB-403 Closes scylladb/scylladb#28979	2026-03-17 11:05:15 +02:00
Pavel Emelyanov	9fe19ec9d9	sstables: Fix object storage lister not resetting position in batch vector The lister loop in get() pre-fetches records in batches and keeps them in a _info vector, iterating over it with the help of _pos cursor. When the vector is re-read, the cursor must be reset too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-17 10:32:42 +03:00
Pavel Emelyanov	1a6a7647c6	sstables: Fix object storage lister skipping entries when filter is active The lister loop in get() method looks weird. It uses do-while(false) loop and calls continue; inside when filter asks to skip a entry. Skipping, thus, aborts the whole thing and EOF-s, which is not what's supposed to happen. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-17 10:32:40 +03:00
Botond Dénes	035aa90d4b	Merge 'Alternator: add per-table batch latency metrics and test coverage' from Amnon Heiman This series fixes a metrics visibility gap in Alternator and adds regression coverage. Until now, BatchGetItem and BatchWriteItem updated global latency histograms but did not consistently update per-table latency histograms. As a result, table-level latency dashboards could miss batch traffic. It updates the batch read/write paths to compute request duration once and record it in both global and per-table latency metrics. Add the missing tests, including a metric-agnostic helper and a dedicated per-table latency test that verifies latency counters increase for item and batch operations. This change is metrics-only (no API/behavior change for requests) and improves observability consistency between global and per-table views. Fixes #28721 We assume the alternator per-table metrics exist, but the batch ones are not updated Closes scylladb/scylladb#28732 * github.com:scylladb/scylladb: test(alternator): add per-table latency coverage for item and batch ops alternator: track per-table latency for batch get/write operations	2026-03-16 17:18:00 +02:00
Michał Hudobski	40d180a7ef	docs: update vector search filtering to reflect primary key support only Remove outdated references to filtering on columns provided in the index definition, and remove the note about equal relations (= and IN) being the only supported operations. Vector search filtering currently supports WHERE clauses on primary key columns only. Closes scylladb/scylladb#28949	2026-03-16 17:16:16 +02:00
Botond Dénes	9de8d6798e	Merge 'reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory' from Łukasz Paszkowski Permits in the `waiting_for_memory` state represent already-executing reads that are blocked on memory allocation. Preemptively aborting them is wasteful -- these reads have already consumed resources and made progress, so they should be allowed to complete. Restrict the preemptive abort check in maybe_admit_waiters() to only apply to permits in the `waiting_for_admission` state, and tighten the state validation in `on_preemptive_aborted()` accordingly. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016 Backport not needed. The commit introducing replica load shedding is not part of 2026.1 Closes scylladb/scylladb#29025 * github.com:scylladb/scylladb: reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit	2026-03-16 17:14:25 +02:00
Marcin Maliszkiewicz	9318c80203	perf: add abort_source support to wait-for-port loops Check abort_source on each retry iteration in wait_for_alternator and wait_for_cql so the wait can be interrupted on shutdown. Didn't use sleep_abortable as the sleep is very short anyway.	2026-03-16 16:14:10 +01:00
Calle Wilund	a5df2e79a7	storage_service: Wait for snapshot/backup before decommission Fixes: SCYLLADB-244 Disables snapshot control such that any active ops finish/fail before proceeding with decommission. Note: snapshot control provided as argument, not member ref due to storage_service being used from both main and cql_test_env. (The latter has no snapshot_ctl to provide). Could do the snapshot lockout on API level, but want to do pre-checks before this. Note: this just disables backup/snapshot fully. Could re-enable after decommission, but this seems somewhat pointless. v2: * Add log message to snapshot shutdown * Make test use log waiting instead of timeouts Closes scylladb/scylladb#28980	2026-03-16 17:12:57 +02:00
Marcin Maliszkiewicz	edf0148bee	perf-alternator: wait for alternator port before running workload This patch is mostly for the purpose of running pgo CI job. We may receive connection error if asyncio.sleep(5) in pgo.py is not sufficient waiting time. In pgo.py we do wait for port but only for cql, anyway it's better to have high level check than trying to wait for alternator port there.	2026-03-16 16:07:52 +01:00
bitpathfinder	85d5073234	test: Fix non-awaited coroutine in test_gossiper_empty_self_id_on_shadow_round The line with the error was not actually needed and has therefore been removed. Fixes: SCYLLADB-906 Closes scylladb/scylladb#28884	2026-03-16 17:07:36 +02:00
Botond Dénes	3e4e0c57b8	Merge 'Relax rf-rack-valid-keyspace option in backup/restore tests' from Pavel Emelyanov Some tests, when create a cluster, configure nodes with the rf-rack-valid option, because sometimes they want to have it OFF. For that the option is explicitly carried around, but the cluster creating helper can guess this option itself -- out of the provided topology and replication factor. Removing this option simplifies the code and (which a nicer outcome) the test "signature" that's used e.g. in command-line to run a specific test. Improving tests, not backporting Closes scylladb/scylladb#28860 * github.com:scylladb/scylladb: test: Relax topology_rf_validity parameter for some tests test: Auto detect rf-rack-valid option in create_cluster()	2026-03-16 17:06:46 +02:00
Raphael S. Carvalho	ee87b66033	replica: Demote log level on split failure during shutdown Dtest failed with: table - Failed to load SSTable .../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db of origin memtable due to std::runtime_error (Cannot split .../me-3gyn_0qwi_313gw2n2y90v2j4fcv-big-Data.db because manager has compaction disabled, reason might be out of space prevention), it will be unlinked... The reason is that the error above is being triggered when the cause is shutdown, not out of space prevention. Let's distinguish between the two cases and log the error with warning level on shutdown. Fixes https://github.com/scylladb/scylladb/issues/24850. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2026-03-16 12:03:17 -03:00
Patryk Jędrzejczak	526e5986fe	test: test_raft_no_quorum: decrease group0_raft_op_timeout_in_ms after quorum loss `test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev mode. The bootstrap of the first node can fail due to `add_entry()` timing out (with the 1s timeout set by the test case). Other test cases in this test file could fail in the same way as well, so we need a general fix. We don't want to increase the timeout in dev mode, as it would slow down the test. The solution is to keep the timeout unchanged, but set it only after quorum is lost. This prevents unexpected timeouts of group0 operations with almost no impact on the test running time. A note about the new `update_group0_raft_op_timeout` function: waiting for the log seems to be necessary only for `test_quorum_lost_during_node_join_response_handler`, but let's do it for all test cases just in case (including `test_can_restart` that shouldn't be flaky currently). Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913 Closes scylladb/scylladb#28998	2026-03-16 16:58:15 +02:00
Raphael S. Carvalho	b508f3dd38	service: Demote log level on split failure during shutdown Since commit `509f2af8db`, gate_closed_exception can be triggered for ongoing split during shutdown. The commit is correct, but it causes split failure on shutdown to log an error, which causes CI instability. Previously, aborted_exception would be triggered instead which is logged as warning. Let's do the same. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-951. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2026-03-16 11:52:00 -03:00
Dani Tweig	bc0952781a	Update Jira sync calling workflow to consolidated view Replaced multiple per-action workflow jobs with a single consolidated call to main_pr_events_jira_sync.yml. Added 'edited' event trigger. This makes CI actions in PRs more readable and workflow execution faster. Fixes:PM-253 Closes scylladb/scylladb#29042	2026-03-16 08:25:32 +02:00
Artsiom Mishuta	755d528135	test.py: fix warnings changes in this commit: 1)rename class from 'TestContext' to 'Context' so pytest will not consider this class as a test 2)extend pytest filterwarnings list to ignore warnings from external libs 3) use datetime.datetime.now(datetime.UTC) unstead datetime.datetime.utcnow() 4) use ResultSet.one() instead ResultSet[0] Fixes SCYLLADB-904 Fixes SCYLLADB-908 Related SCYLLADB-902 Closes scylladb/scylladb#28956	2026-03-15 12:00:10 +02:00
Karol Nowacki	7659a5b878	vector_search: test: fix flaky test The test assumes that the sleep duration will be at least the value of the sleep parameter. However, the actual sleep time can be slightly less than requested (e.g., a 100ms sleep request might result in a 99ms sleep). This commit adjusts the test's time comparison to be more lenient, preventing test flakiness.	2026-03-13 16:28:22 +01:00
Karol Nowacki	5474cc6cc2	vector_search: fix race condition on connection timeout When a `with_connect` operation timed out, the underlying connection attempt continued to run in the reactor. This could lead to a crash if the connection was established/rejected after the client object had already been destroyed. This issue was observed during the teardown phase of a upcoming high-availability test case. This commit fixes the race condition by ensuring the connection attempt is properly canceled on timeout. Additionally, the explicit TLS handshake previously forced during the connection is now deferred to the first I/O operation, which is the default and preferred behavior. Fixes: SCYLLADB-832	2026-03-13 16:28:22 +01:00
Piotr Dulikowski	d8b283e1fb	Merge 'Add CQL forwarding for strongly consistent tables' from Wojciech Mitros In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica. The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`) For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader. This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself. Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71) [SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#27517 * github.com:scylladb/scylladb: test: add tests for CQL forwarding transport: enable CQL forwarding for strong consistency statements transport: add remote statement preparation for CQL forwarding transport: handle redirect responses in CQL forwarding transport: add exception handling for forwarded CQL requests transport: add basic CQL request forwarding idl: add a representation of client_state for forwarding cql_server: handle query, execute, batch in one case transport: inline process_on_shard in cql_server::process transport: extract process() to cql_server transport: add messaging_service to cql_server transport: add response reconstruction helpers for forwarding transport: generalize the bounce result message for bouncing to other nodes strong consistency: redirect requests to live replicas from the same rack transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server transport: extract the error handling from process_request_one transport: move error response helpers from connection to cql_server	2026-03-13 15:03:10 +01:00
Andrzej Jackowski	60aaea8547	cql: improve write consistency level guardrail messages Update warn and fail messages for the write_consistency_levels_warned and write_consistency_levels_disallowed guardrails to include the configuration option name and actionable guidance. The main motivation is to make the messages follow the conventions of other guardrails. Refs: SCYLLADB-257	2026-03-13 14:40:45 +01:00
Tomasz Grabiec	518470e89e	Merge 'load_stats: improve tablet filtering for load stats' from Ferenc Szili When computing table sizes via load_stats to determine if a split/merge is needed, we are filtering tablets which are being migrated, in order to avoid counting them twice (both on leaving and pending replica) in the total table size. The tablets are filtered so that they are counted on the leaving replica until the streaming stage, and on the pending replica after the streaming stage. Currently, the procedure for collecting tablet sizes for load balancing also uses this same filter. This should be changed, because the load balancer needs to have as much information about tablet sizes as possible, and could ignore a node due to missing tablet sizes for tablets in the `write_both_read_new` and `use_new` stages. For tablet size collection, we should include all the tablets which are currently taking up disk space. This means: - on leaving replica, include all tablets until the `cleanup` stage - on pending replica, include all tablets starting with the `write_both_read_new` and later stages While this is an improvement, it causes problems with some of the tests, and therefore needs to be backported to 2026.1 Fixes: SCYLLADB-829 Closes scylladb/scylladb#28587 * github.com:scylladb/scylladb: load_stats: add filtering for tablet sizes load_stats: move tablet filtering for table size computation load_stats: bring the comment and code in sync	2026-03-13 13:08:11 +01:00
Pavel Emelyanov	d544d8602d	test: Relax topology_rf_validity parameter for some tests Tests that call create_cluster() helper no longer need to carry the rf-validity parameter. This simplifies the code and test signature. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-13 14:30:32 +03:00
Pavel Emelyanov	313985fed7	test: Auto detect rf-rack-valid option in create_cluster() The helper accepts its as boolean argument, but it can easily estimate one from the provided topology. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-13 14:30:32 +03:00
Gleb Natapov	fae5282c82	service level: fix crash during migration to driver server level Before `b59b3d4` the migration code checked that service level controller is on v2 version before migration and the check also implicitly checked that _sl_data_accessor field is already initialized, but now that the check is gone the migration can start before service level controller is fully initialized. Re add the check, but to a different place. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1049 Closes scylladb/scylladb#29021	2026-03-13 11:24:26 +01:00
Łukasz Paszkowski	4c4d043a3b	reader_concurrency_semaphore: skip preemptive abort for permits waiting for memory Permits in the `waiting_for_memory` state represent already-executing reads that are blocked on memory allocation. Preemptively aborting them is wasteful -- these reads have already consumed resources and made progress, so they should be allowed to complete. Restrict the preemptive abort check in maybe_admit_waiters() to only apply to permits in the `waiting_for_admission` state, and tighten the state validation in `on_preemptive_aborted()` accordingly. Adjust the following tests: + test_reader_concurrency_semaphore_abort_preemptively_aborted_permit no longer relies on requesting memory + test_reader_concurrency_semaphore_preemptive_abort_requested_memory_leak adjusted to the fix Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1016	2026-03-13 09:50:05 +01:00
Dani Tweig	aa46a0f4e0	Add VECTOR to the list of synced milestones in scylladb.git - Added VECTOR to the comma-separated list of Jira project keys in `call_sync_milestone_to_jira.yml`. - The `jira_project_keys` value changed from `SCYLLADB,CUSTOMER,SMI,RELENG` to `SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR`. - The VECTOR project needs to sync with scylladb.git milestones, so that when a GitHub milestone is created or closed in scylladb/scylladb, the corresponding Jira release is also created or released in the VECTOR project. - Previously only SCYLLADB, CUSTOMER, SMI, and RELENG projects were synced. Fixes:PM-220 Closes scylladb/scylladb#29014	2026-03-13 09:58:41 +02:00
Botond Dénes	fc8cebd671	Merge 'Verify components digests during component load and scrub in validate mode' from Taras Veretilnyk This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception. Depends on https://github.com/scylladb/scylladb/pull/28338 Backport is not required, this is new feature Fixes https://github.com/scylladb/scylladb/issues/20103 Closes scylladb/scylladb#28761 * github.com:scylladb/scylladb: test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade sstables: propagate ignore_component_digest_mismatch config to all load sites sstables: add option to ignore component digest mismatches sstable_compaction_test: Add scrub validate test for corrupted index sstables: add tests for component digest validation on corrupted SSTables sstables: validate index components digests during SSTable scrub in validate mode sstables: verify component digests on SSTable load sstables: add digest_file_random_access_reader for CRC32 digest computation	2026-03-13 09:55:55 +02:00
Avi Kivity	ae8a418744	Merge 'Await async calls in test tablets migration' from Benny Halevy Fix several test cases that did not await async tasks: - test_restart_leaving_replica_during_cleanup - test_restart_in_cleanup_stage_after_cleanup - test_tablet_back_and_forth_migration - test_staging_backlog_is_preserved_with_file_based_streaming Fixes SCYLLADB-910 * Minor fixes, no backport needed Closes scylladb/scylladb#28908 * github.com:scylladb/scylladb: test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task test_tablets_migration: drop unused imports from cassandra.query	2026-03-13 00:20:29 +02:00
Avi Kivity	b228eb26e6	Merge 'dbuild: Use slirp4netns network in dbuild nested containers' from Calle Wilund Fixes #25084 Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability. Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script. Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest). Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages. Closes scylladb/scylladb#28727 * github.com:scylladb/scylladb: gcs_fixture: Change to use docker helper aws_kms_fixture: Modify to use docker helper test/lib/proc_util: Add docker helper pytest: use ephemeral port publish for docker mock servers dbuild: Use container network in dbuild nested containers scylla_cluster: Read notify sock in background to prevent deadlock	2026-03-12 23:49:25 +02:00
Nadav Har'El	ad832c263e	test/cluster: mark test_alternator_concurrent_rmw_same_partition_different_server not strictly xfail A few days ago, in commit `7b30a39` we added to pytest.ini the option xfail_strict. This option causes every time a test XPASSes, i.e., an xfail test actually passes - to be considered an error and fail the test. But some tests demonstrate a timing-related bug and do not reproduce the bug every single time. An example we noticed in one CI run is: test/cluster/test_alternator.py::test_alternator_concurrent_rmw_same_partition_different_server This test reproduces a timing-related bug (if you do an LWT write to one partition on to two different coordinators "at the same time", you can get a failure), but only most of the time, not 100% of the time. The solution is to add "strict=False" for the xfail marker on this specific test. This undoes the xfail_strict for this specific test, accepting that this specific test can either pass or fail. Note that this does NOT make this test worthless - we still see this test failing most of the time, and when a developer finally fixes this issue, the test will begin to pass all the time. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-941 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#29016	2026-03-12 23:46:23 +02:00
Tomasz Grabiec	1256a9faa7	tablets: Fix deadlock in background storage group merge fiber When it deadlocks, groups stop merging and compaction group merge backlog will run-away. Also, graceful shutdown will be blocked on it. Found by flaky unit test test_merge_chooses_best_replica_with_odd_count, which timed-out in 1 in 100 runs. Reason for deadlock: When storage groups are merged, the main compaction group of the new storage group takes a compaction lock, which is appended to _compaction_reenablers_for_merging, and released when the merge completion fiber is done with the whole batch. If we accumulate more than 1 merge cycle for the fiber, deadlock occurs. Lock order will be this Initial state: cg0: main cg1: main cg2: main cg3: main After 1st merge: cg0': main [locked], merging_groups=[cg0.main, cg1.main] cg1': main [locked], merging_groups=[cg2.main, cg3.main] After 2nd merge: cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main] merge completion fiber will try to stop cg0'.main, which will be blocked on compaction lock. which is held by the reenabler in _compaction_reenablers_for_merging, hence deadlock. The fix is to wait for background merge to finish before we start the next merge. It's achieved by holding old erm in the background merge, and doing a topology barrier from the merge finalizing transition. Background merge is supposed to be a relatively quick operation, it's stopping compaction groups. So may wait for active requests. It shouldn't prolong the barrier indefinitely. Tablet boost unit tests which trigger merge need to be adjusted to call the barrier, otherwise they will be vulnerable to the deadlock. Two cluster tests were removed because they assumed that merge happens in the backgournd. Now that it happens as part of merge finalization, and blocks topology state machine, those tests deadlock because they are unable to make topology changes (node bootstrap) while background merge is blocked. The test "test_tablets_merge_waits_for_lwt" needed to be adjusted. It assumed that merge finalization doesn't wait for the erm held by the LWT operation, and triggered tablet movement afterwards, and assumed that this migration will issue a barrier which will block on the LWT operation. After this commit, it's the barrier in merge finalization which is blocked. The test was adjusted to use an earlier log mark when waiting for "Got raft_topology_cmd::barrier_and_drain", which will catch the barrier in merge finalization. Fixes SCYLLADB-928	2026-03-12 22:45:01 +01:00
Tomasz Grabiec	7706c9e8c4	replica: table: Propagate old erm to storage group merge	2026-03-12 22:45:01 +01:00
Tomasz Grabiec	582a4abeb6	test: boost: tablets_test: Save tablet metadata when ACKing split resize decision Needs to be ordered before split finalization, because storage_group must be in split mode already at finalization time. There must be split-ready compaction groups, otherwise finalization fails with this error: Found 0 split ready compaction groups, but expected 2 instead. Exposed by increased split activity in tests.	2026-03-12 22:45:01 +01:00
Tomasz Grabiec	279fcdd5ff	storage_service: Extract local_topology_barrier() Will be called in tests. It does the local part of the global topology barrier. The comment: // We capture the topology version right after the checks // above, before any yields. This is crucial since _topology_state_machine._topology // might be altered concurrently while this method is running, // which can cause the fence command to apply an invalid fence version. was dropped, because it's no longer true after `fad6c41cee`, and it doesn't make sense in the context of local_topology_barrier(). We'd have to propagate the version to local_topology_barrier(), but it's pointless. The fence version is decided before calling the local barrier, and it will be valid even if local version moves ahead.	2026-03-12 22:44:56 +01:00
Avi Kivity	03186ce60d	Merge 'Cleanup after auth v1 and default superuser code removal' from Marcin Maliszkiewicz This is short cleanup after recent removal of creating default cassandra superuser and auth-v1 code removal. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1036 Backport: no, just code cleanup Closes scylladb/scylladb#29004 * github.com:scylladb/scylladb: auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD auth: use configurable default_superuser in describe_roles auth: move default_superuser to common, remove _superuser member auth: use LOCAL_ONE for all auth queries auth: remove get_auth_ks_name indirection	2026-03-12 23:44:32 +02:00
Avi Kivity	e2eeef3e01	Merge 'service level: remove remnants of version 1 service level' from Gleb Natapov can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well. Also drop async versions of find_effective_service_level and get_user_scheduling_group since they are unused. No need to backport, code removal, Closes scylladb/scylladb#29002 * github.com:scylladb/scylladb: service level: make maybe_update_per_service_level_params synchronous service level: remove unused get_user_scheduling_group function service level: drop async find_effective_service_level service level: remove remnants of version 1 service level	2026-03-12 23:39:41 +02:00
Botond Dénes	eed3a6d407	sstables/mx/writer: move post-cell write yield to collection write loop Introduced by `54bddeb3b5`, the yield was added to write_cell(), to also help the general case where there is no collection. Arguably this was unnecessary and this patch moves the yield to write_collection(), to the cell write loop instead, so regular cells don't have to poll the preempt flag. Closes scylladb/scylladb#29013	2026-03-12 21:26:35 +02:00
Avi Kivity	e8a6706d6e	Merge 'shorten some sleeps to speed up bootstrap in tests' from Patryk Jędrzejczak This PR shortens two sleeps from 1s to 100ms to speed up bootstrap in tests. The changed sleeps are: - the pause duration in group0 discovery, - the retry period in `wait_for_cql`. Refs: https://scylladb.atlassian.net/browse/SCYLLADB-918 No backport: performance improvements mostly relevant to tests. Closes scylladb/scylladb#29020 * github.com:scylladb/scylladb: test: pylib: util: wait for CQL being ready with a shorter period group0: discovery: shorten the pause duration	2026-03-12 21:17:05 +02:00
Wojciech Mitros	32974770b0	test: add tests for CQL forwarding Add basic cluster tests for CQL forwarding. The test cases include: - basic reads and writes - prepared statements with binds - forwarding from a non-replica - exception passthrough during forwarding (using an injection) - re-preparing a statement on the target node, even if the user query is also an EXECUTE request on a prepared statement - verification metric updates The existing test_basic_write_read was modified so that a few extra cases could be validated on the same cluster.	2026-03-12 19:43:35 +01:00
Wojciech Mitros	916a9995c1	transport: enable CQL forwarding for strong consistency statements We enable CQL forwarding by starting to return the bounce_to_node result message in redirect_statement() instead of throwing. The forwarding code introduced in the preceding patches reacts to these messages, allowing the requests to be forwarded. With the update, some tests assuming that requests can't be forwarded need to be adjusted, so we do that as well.	2026-03-12 19:43:35 +01:00
Wojciech Mitros	21a7b036a5	transport: add remote statement preparation for CQL forwarding During forwarding of CQL EXECUTE requests, the target node may not have the prepared statement in its cache. If we do have this statement as a coordinator, instead of returning PREPARED NOT FOUND to the client, we want to prepare the statement ourselves on target node. For that, we add a new FORWARD_CQL_PREPARE RPC. We use the new RPC after gettting the prepared_not_found status during forwarding. When we try to forward a request, we always have the query string (we decide whether to forward based on this query), so we can always use the new RPC when getting the prepared_not_found status. After receiving the response, we try forwarding the EXECUTE request again.	2026-03-12 19:43:35 +01:00
Wojciech Mitros	96a5e1c7ce	transport: handle redirect responses in CQL forwarding During CQL forwarding, when the target node can't handle the request, it will find another node which can execute the request or which knows where the request can be executed. We return this information in responses to CQL forwarding, and in this patch, we add handling of this kind of a response. After getting a redirect response, we retry forwarding to the returned host/shard until success or timeout. This can happen many times during a single request, when we first forward to a replica and later to the coordinator, or when a replica/coordinator migrated while we were performing the forwarding	2026-03-12 19:43:31 +01:00
Wojciech Mitros	8816d3038c	transport: add exception handling for forwarded CQL requests When a forwarded request fails on the remote node, we can't use the exception handling that happens in process_request_one because we don't go through this code path. Instead, we use the previously extracted cql_server::handle_exception handler, which performs all accounting on the forwarded-to node, and which prepares the response. For the read_failure_exception_with_timeout exception, we need to perform the sleep on the source node, so we return the timeout in the forwarding response and use it on the source node to know how long to sleep without any extra calculations. The handle_forward_execute() method is extracted from the inline handler lambda to make the error catching wrapper cleaner.	2026-03-12 19:41:37 +01:00
Wojciech Mitros	23bff5dfef	transport: add basic CQL request forwarding Add the infrastructure for forwarding CQL requests to other nodes. When a process() call results in a node bounce (as opposed to a shard bounce), the coordinator serializes the request and sends it via the FORWARD_CQL_EXECUTE RPC verb to the target node. In this patch we omit several features that allow handling more scenarios that can happen when trying to forward a CQL request, but the RPC request and response are already prepared for them. They will be handled in the following commits.	2026-03-12 19:41:35 +01:00
Avi Kivity	76b6784c1a	Merge 'cql3: track CQL parsing memory cost and use it for admission control' from Marcin Maliszkiewicz Use rolling_max_tracker to record gross bytes allocated during each CQL parse. The rolling maximum is then added to the memory estimate for incoming QUERY and PREPARE requests so that the admission control in the CQL transport layer accounts for parsing overhead. The measured memory footprint serves as upper bound rather than exact number but it's purpose is to prevent OOMs under unprepared statements heavy load. In benchmark 1G memory node shows decrease of non-LSA memory usage from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as memory admission kicks in trying to prevent OOM. This is phase 1 of OOM prevention, potential next steps: - add second admission in query_processor::get_statement trying to prevent potential thundering herd problem - decrease cql_server memory pool size - count reads in the memory pool - add per service level memory pool and a shared one Related https://scylladb.atlassian.net/browse/SCYLLADB-740 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-938 Backport: no, new feature, but we may reconsider if some customer needs it Closes scylladb/scylladb#28919 * github.com:scylladb/scylladb: cql3: track CQL parsing memory cost and use it for admission control utils: add rolling max tracker	2026-03-12 19:59:52 +02:00
Wojciech Mitros	170b82ddca	idl: add a representation of client_state for forwarding In the following patches, when we start allowing to forward CQL requests to other nodes, we'll need to use the same client state for executing the request on the destination node as we had on the source. client_state contains many fields and we need to create a new instance of it when we start handling the forwarded request, so to prepare for the forwarding RPC, we add a serializable format of the client_state as an IDL struct. The new class is missing some fields that are not used while executing requests, and some whose value is determined by the fact that the client state is used for a forwarded request. These include: - driver name, driver version, client options - not used for executing requests. Instead, we use these as data sources for the virtual "clients" system table. - auth_state - must be READY - we reached a bounce message, so we were able to try executing the request locally - _control_connection - used for altering a cql_server::connection, which we don't have on the target node - _default_timeout_config - used when updating service levels, also only per-connection - workload_type - used for deciding whether to allow shedding at the start of processing the request, and for getting per-connection service level params (for an API)	2026-03-12 17:48:58 +01:00
Wojciech Mitros	b4a7fefe20	cql_server: handle query, execute, batch in one case Currently we perform the same steps when handling query, execute and batch CQL requests. So instead of creating multiple functions performing these steps, we can handle them all in one fallthrough case in cql_server::connection::process_request_one.	2026-03-12 17:48:58 +01:00
Wojciech Mitros	dadb87047c	transport: inline process_on_shard in cql_server::process The process_on_shard method is relatively short, it's only used in the process() method and the Process concept that is uses is as long as the function itself. This area will be made more complex by the following patches for cql forwarding, so we simplify it by inlining process_on_shard in cql_server::process.	2026-03-12 17:48:58 +01:00
Wojciech Mitros	24cdc3a10d	transport: extract process() to cql_server Move process() and process_on_shard() from cql_server::connection to cql_server. The process() method is no longer a template - instead, it takes an opcode parameter and uses get_process_fn_for_opcode() to select the appropriate internal processing function. The process_query, process_execute, and process_batch wrappers on connection now delegate to _server.process() with the appropriate opcode. This refactoring is preparation for CQL request forwarding, where process() will need to be called from a context other than connection - the forwarding RPC handler).	2026-03-12 17:48:57 +01:00
Wojciech Mitros	0e3469e89c	transport: add messaging_service to cql_server The messaging service will be used by cql_server to register RPC handlers for forwarding CQL requests between nodes. We pass it through the controller to cql_server.	2026-03-12 17:48:57 +01:00
Wojciech Mitros	1376caf980	transport: add response reconstruction helpers for forwarding Expose response::flags() and response::extract_body(), and a new constructor. It will be needed for creating a cql_transport::response from the response body returned during CQL forwarding.	2026-03-12 17:48:57 +01:00
Wojciech Mitros	e44820ba1f	transport: generalize the bounce result message for bouncing to other nodes In the following patches, we'll start allowing forwarding requests to strongly consistent tables so that they'll get executed on the suitable tablet Raft group members. For that we'll reuse the approach that we already have for bouncing requests to other shards - we'll try to execute a request locally, and the result of that will be a bounce message with another replica as the target. In this patch we generalize the former bounce_to_shard result message so that it will be able to specify the target of the bounce as another shard or specific replica. We also rename it to result_message::bounce so that it stops implying that only another shard may be its target. Aside from the host_id and the shard, the new message also includes the timeout, because in the service handling the forwarding we won't have the access to it, and it's needed for specifying how long we should wait for the forwarded requests. It also includes an information whether this is a write request to return correct timeout response in case the deadline is exceeded. We will return other hosts in the new bounce message when executing requests to strongly consistent tables when we can't handle the request because we aren't a suitable replica. We can't handle this message yet, so we don't return it anywhere and we still assume that every bounce message is a bounce to the same host.	2026-03-12 17:48:57 +01:00
Wojciech Mitros	b4d66fda2e	strong consistency: redirect requests to live replicas from the same rack Forwarding CQL requests is not implemented yet, but we're already prepared to return the target to forward to when trying to execute strongly consistent requests. Currently, if we're not a replica of the affected tablet, we redirect the request to the first replica in the list. This is not optimal, because this replica may be down or it may be in another rack, making us perform cross-rack requests during forwarding. Instead, we should forward the request to the replica from the same rack and handle the case where the replica is down. In this patch we change the replica selection for forwarding strongly consistent requests, so that when the coordinator isn't a replica, it redirects the request to the replica from the same rack. If the replica from the same rack is down, or there is no replica in our rack, we choose the next closest replica (preferring same-DC replicas over other DCs). If no replica is alive, the query fails - the driver should retry when some replica comes back up.	2026-03-12 17:48:54 +01:00
Andrzej Jackowski	3b9cd52a95	reader_concurrency_semaphore_test: detect memory leak on preemptive abort of waiting_for_memory permit A permit in `waiting_for_memory` state can be preemptively aborted by maybe_admit_waiters(). This is wrong: such permits have already been admitted and are actively processing a read — they are merely blocked waiting for memory under serialize-limit pressure. When `on_preemptive_aborted()` fires on a `waiting_for_memory` permit, it does not clear `_requested_memory`. A subsequent `request_memory()` call accumulatesa on top of the stale value, causing `on_granted_memory()` to consume more than resource_units tracks. This commit adds a test that confirms that scenario by counting internal_errors.	2026-03-12 17:09:34 +01:00
Alex	7fd39ba586	test/cluster: strengthen raft voters multi-DC test and tune debug runtime The test_raft_voters_multidc_kill_dc scenario had become weaker after group0 voter count was made always odd. In particular, the old num_nodes == 1 case (dc1=2, dc2=1, dc3=1) could pass even without the intended balancing logic, because with 3 voters total we naturally get one voter per DC. This change restores coverage of the original intent: - Replace num_nodes parametrization with explicit DC triples. - Use (3, 1, 1) to force a meaningful asymmetric topology where voter placement logic is required. - Keep a larger topology case (6, 3, 3) for broader coverage. - Mark (6, 3, 3) as skip_mode(debug) with reason: larger topology case is too slow in debug on minipcs. Also updated comments/docstring to match the new setup. Fixes: SCYLLADB-794 backport: None, it is done to deflake minipcs that will start working only on master Closes scylladb/scylladb#29000	2026-03-12 17:07:45 +01:00
Wojciech Mitros	309abc44d9	transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server Change sleep_until_timeout_passes() to accept a foreign_ptr<std::unique_ptr<response>>. We can easily create the foreign_ptr for the responses created in the CQL server, but we'll need this when we get responses when forwarding CQL statements - the responses may come from other shards. We also move it from cql_server::connection to cql_server, because for forwarded CQL requests, we'll need to handle it at the cql_server level. The method also loses its const qualifier - the abort_source that we pass into sleep_abortable needs to be non-const. Apparently, we could still use it in a const method of cql_server::connection because we passed it as _server._abort_source which caused the const qualifier to be lost.	2026-03-12 16:03:14 +01:00
Marcin Maliszkiewicz	975cd60e05	ldap: fix use-after-move crash in ldap_reuser::reap() After stop() moved _reaper, in-flight with_connection() callbacks could still call reap(), which accessed the moved-from future causing a SIGSEGV in future_base::detach_promise(). Add a seastar::gate so stop() waits for all in-flight operations before moving _reaper. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1043 Closes scylladb/scylladb#29015	2026-03-12 16:48:45 +02:00
Patryk Jędrzejczak	c50cf32793	test: pylib: util: wait for CQL being ready with a shorter period `wait_for_cql` is used in hundreds, if not thousands, of places in tests. We shouldn't waste up to 1s for every call. Also, the 1s period is clearly too long compared to the bootstrap time, which is usually 0-3s in dev mode. The following test speeds up from 50s to 42s with the change: ``` for _ in range(10): servers = await manager.servers_add(3) await manager.get_ready_cql(servers) ```	2026-03-12 15:40:19 +01:00
Patryk Jędrzejczak	f85628a9a0	group0: discovery: shorten the pause duration Nodes currently pause group0 discovery for 1s. This case is always hit while adding multiple nodes in parallel to an empty cluster by all nodes except the one that becomes the group0 leader. This is fine in production, but in tests, the slowdown is quite significant. Every `manager.servers_add(n)` call for n > 1 becomes 1s slower when the cluster is empty. Many cluster tests are affected. In this commit, we decrease the sleep duration from 1s to 100ms to speed up tests. The consequence of this change is that nodes might perform more steps in group0 discovery, but the increase in CPU usage and network traffic should be negligible.	2026-03-12 15:40:18 +01:00
Gleb Natapov	c67f876893	service level: make maybe_update_per_service_level_params synchronous It does not call async functions any more.	2026-03-12 15:53:08 +02:00
Benny Halevy	b3fec20960	test_tablets_migration: test_staging_backlog_is_preserved_with_file_based_streaming: convert for loop to asyncio.gather Currently the test iterates on all servers and calls manager.api.disable_injection but it doesn't await those calls. Use asyncio.gather to await all calls in parallel. Co-authored-by: Copilot CLI Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-12 15:26:40 +02:00
Benny Halevy	61d5a2df02	test_tablets_migration: test_tablet_back_and_forth_migration: await move_tablet Co-authored-by: Copilot CLI Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-12 15:26:40 +02:00
Benny Halevy	b8655748a2	test_tablets_migration: test_restart_in_cleanup_stage_after_cleanup: await move_task Co-authored-by: Copilot CLI Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-12 15:26:40 +02:00
Benny Halevy	10dccc2c4e	test_tablets_migration: test_restart_leaving_replica_during_cleanup: await move_task Co-authored-by: Copilot CLI Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-12 15:26:40 +02:00
Benny Halevy	c9d653fb1e	test_tablets_migration: drop unused imports from cassandra.query Co-authored-by: Copilot CLI Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-12 15:26:40 +02:00
Gleb Natapov	c30907b8f2	service level: remove unused get_user_scheduling_group function	2026-03-12 14:28:26 +02:00
Gleb Natapov	a934d8391d	service level: drop async find_effective_service_level find_cached_effective_service_level does exactly same thing now and it is synchronous.	2026-03-12 14:28:26 +02:00
Botond Dénes	15cfa5beeb	mutation/collection_mutation: don't copy the serialized collection serialize_collection_mutation() copies the serialized collection into the returned collection_mutation object. Change to move to avoid the copy. Fixes: SCYLLADB-1041 Closes scylladb/scylladb#29010	2026-03-12 13:57:40 +02:00
Gleb Natapov	f888f2dced	service level: remove remnants of version 1 service level can_use_effective_service_level_cache() always returns true now, so the function can be dropped entirely and all the code that assumes it may return false can be dropped as well.	2026-03-12 12:27:52 +02:00
Nadav Har'El	27f0510280	test/alternator: test_gzip_request_oversized now passes on AWS The Alternator test test_compressed_request.py::test_gzip_request_oversized checks that a very large request that compresses to a small size is still rejected. This test passed on Alternator, but used to fail on DynamoDB because DynamoDB didn't reject this case. This was a bug in DynamoDB (a "decompression bomb" vulnerability), and after I reported it, it was fixed. So now this test does pass on DynamoDB (after a small modification to allow for different error codes). So remove its scylla_only marker, and make the comment true to the current state. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28820	2026-03-12 10:41:56 +01:00
Marcin Maliszkiewicz	b277d9d9aa	cql3: track CQL parsing memory cost and use it for admission control Use rolling_max_tracker to record gross bytes allocated during each CQL parse. The rolling maximum is then added to the memory estimate for incoming QUERY and PREPARE requests so that the admission control in the CQL transport layer accounts for parsing overhead. The measured memory footprint serves as upper bound rather than exact number but it's purpose is to prevent OOMs under unprepared statements heavy load. In benchmark 1G memory node shows decrease of non-LSA memory usage from peak 320MB (our coordinator budget is 10% of 1G) to 96MB. While tps drops from 1.2 kops to 0.8 kops. Drop in tps is expected as memory admission kicks in trying to prevent OOM.	2026-03-12 10:16:10 +01:00
Botond Dénes	0b19a6de85	tombstone_gc: tombstone_gc_state::for_tests(): remove unused param Closes scylladb/scylladb#28923	2026-03-12 10:01:42 +01:00
Marcin Maliszkiewicz	2d22eea2f9	Merge 'cql3: Replace SCYLLA_ASSERT and abort by throwing_assert' from Nadav Har'El In this patch we replace every single use of SCYLLA_ASSERT(), abort() and assert() in the cql3/ directory by throwing_assert(). The problem with SCYLLA_ASSERT()/abort()/assert() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again. All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure. I reviewed all the changes that I did in these patches to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely. throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT()/abort() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, and abort() often prints no message at all. But throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message. Fixes #13970 (Exorcise assertions from CQL code paths) Refs #7871 (Exorcise assertions from Scylla) Closes scylladb/scylladb#28847 * github.com:scylladb/scylladb: cql3: remove unnecessary assert() cql3: replace abort() by throwing_assert() cql3: Replace SCYLLA_ASSERT by throwing_assert	2026-03-12 09:09:24 +01:00
Szymon Malewski	3116db6c2d	test: fix `testJsonOrdering` The `test/cqlpy/cassandra_tests/validation/entities/json_test.py::testJsonOrdering` was failing because of differences between Cassandra and Scylla in printing JSON floating point values - e.g. Cassandra prints 30.0, where Scylla prints 30. Both are valid, so in this patch, instead of comparing strings, we compare parsed JSON using `EquivalentJson`. Fixes #28467 Closes scylladb/scylladb#28924	2026-03-12 09:07:08 +01:00
Marcin Maliszkiewicz	5b2a07b408	utils: add rolling max tracker We will use it later to track parser memory usage via per query samples. Tests runtime in dev: 1.6s	2026-03-12 08:56:41 +01:00
Marcin Maliszkiewicz	54ef8fca57	auth: remove DEFAULT_SUPERUSER_NAME constant and dead DEFAULT_USER_PASSWORD DEFAULT_SUPERUSER_NAME is no longer referenced after removing the role_part special-casing in describe_roles. DEFAULT_USER_PASSWORD was dead code too.	2026-03-12 08:46:00 +01:00
Marcin Maliszkiewicz	029410e159	auth: use configurable default_superuser in describe_roles Replace the hardcoded meta::DEFAULT_SUPERUSER_NAME comparison with default_superuser(_qp) which reads from the auth_superuser_name config option. This makes the IF NOT EXISTS clause in DESCRIBE output correct for clusters with a non-default superuser name.	2026-03-12 08:45:47 +01:00
Nadav Har'El	09a399ae3c	Merge 'Replace estimated_histogram with approx_exponential_histogram - alternator' from Amnon Heiman _"A journey of a thousand miles begins with a single step" Lao Tzu_ ScyllaDB uses estimated_histogram in many places. We already have a more efficient alternative: approx_exponential_histogram. It is both CPU and memory-efficient and can be exported as Prometheus native histograms. Its main limitation (which has its benefits) is that the bucket layout is fixed at compile time, so histograms with different configurations cannot be mixed. The end goal is to replace all uses of estimated_histogram in the codebase. That migration needs a few small API adjustments, so I am splitting the work into steps for easier review. This series is the first step. It introduces a base template for fixed-size estimated histograms, and switches the Alternator's estimated_histogram with the template. This change is self-contained and valuable on its own, while keeping the scope limited. Minor adjustments were made to the code and tests so that the tests would pass. Follow-up PRs will apply the same pattern to the rest of the code. New feature no need to backport Closes scylladb/scylladb#28987 * github.com:scylladb/scylladb: alternator: migrate to operation_size_kb histograms test/alternator/test_metrics.py: Update the bucket in the histogram search alternator: Use batch_histogram for batch size histograms estimated_histogram.hh: adds estimated_histogram_with_max	2026-03-12 00:06:16 +02:00
Wojciech Mitros	b1bd206147	transport: extract the error handling from process_request_one When we forward CQL statements, we'll need to handle the errors on the destination node. Only for read_failure_exception_with_timeout exception, we'll still need to wait until timeout passes on the source node. For that we extract the exception handling to a separate method. Additionally, we separate the waiting and all other handling, so that all handling aside from waiting will be reusable after forwarding, and we'll also be able to sleep on the source node if necessary.	2026-03-11 19:40:47 +01:00
Wojciech Mitros	6184b1d5ea	transport: move error response helpers from connection to cql_server These methods are used only in the error handler in the cql server, and outside of 3 cases, they don't need any information from the cql_server::connection. We move them from cql_server::connection to cql_server, so that they can be used in the following patches for methods for CQL request forwarding where we'll have no instance of cql_server::connection on the node forwarded to. After the change the methods require no access to the server's or connection's fields, so we also make them static methods.	2026-03-11 19:40:47 +01:00
Amnon Heiman	1339a44163	alternator: migrate to operation_size_kb histograms Switch Alternator operation-size metrics from the legacy estimated histogram implementation to estimated_histogram_with_max<512> and export them through the native approx-exponential histogram path. Add a dedicated operation-size histogram type alias based on estimated_histogram_with_max<512>. Replace all per-operation size histograms (GetItem/PutItem/DeleteItem/ UpdateItem/BatchGetItem/BatchWriteItem) with the new type. Remove the custom legacy histogram-to-metrics adapter and use to_metrics_histogram() for operation size metrics, aligning export behavior with other approx-exponential histograms. Update Alternator metrics tests to compute expected le bucket boundaries using approx-exponential bucket math (including deduplication of equal bounds), so assertions match the new exported histogram schema. Update bucket helper signatures to use (max, precision) parameters and keep +Inf handling unchanged. Replace byte-to-KB ceiling conversion with plain integer division (bytes / 1024): histogram export already reports each bucket by its upper bound (le), so rounding input values up before bucketing is unnecessary and would over-shift borderline samples into higher buckets.	2026-03-11 17:29:14 +02:00
Marcin Maliszkiewicz	adc840919b	auth: move default_superuser to common, remove _superuser member Move default_superuser() to auth::meta in common.{hh,cc} and remove the cached _superuser member from both standard_role_manager and password_authenticator. The superuser name comes from config which is immutable at runtime, so caching it is unnecessary.	2026-03-11 16:28:38 +01:00
Marcin Maliszkiewicz	993e06c1ae	auth: use LOCAL_ONE for all auth queries Removes auth-v1 hack for cassandra superuser as auth-v1 code no longer exists. Also CL is not really used when quering raft replicated tables (like auth ones), but LOCAL_ONE is the least confusing one.	2026-03-11 16:27:15 +01:00
Marcin Maliszkiewicz	6d1153687a	auth: remove get_auth_ks_name indirection Replace get_auth_ks_name(qp) with db::system_keyspace::NAME directly. The function always returned the constant "system" and its qp parameter was unused.	2026-03-11 16:26:47 +01:00
David	79f9967eaa	docs: update theme 1.9 Motivation Upgrades Sphinx to 9.x, MyST Parser to 5.x, Python to 3.11+–3.14, Node.js to 22, and replaces Poetry with uv for dependency management. Changelog: https://github.com/scylladb/sphinx-scylladb-theme/blob/master/docs/source/upgrade/CHANGELOG.md#190---26-february-2026 How to test * Make sure you are using Python 3.11-3.14: * python --version * Install uv: * make setupenv * Build the docs: * make preview * Docs should render without errors at http://127.0.0.1:5500 Closes scylladb/scylladb#28971	2026-03-11 16:56:51 +02:00
Aleksandra Martyniuk	2e68f48068	nodetool: cluster repair: do not fail if a table was dropped nodetool cluster repair without additional params repairs all tablet keyspaces in a cluster. Currently, if a table is dropped while the command is running, all tables are repaired but the command finishes with a failure. Modify nodetool cluster repair. If a table wasn't specified (i.e. all tables are repaired), the command finishes successfully even if a table was dropped. If a table was specified and it does not exist (e.g. because it was dropped before the repair was requested), then the behavior remains unchanged. Fixes: SCYLLADB-568. Closes scylladb/scylladb#28739	2026-03-11 16:35:04 +02:00
Dani Tweig	45d7d9a96c	.github/workflow: also call call_sync_milestone_to_jira.yml for close milestone event What changed * Added closed to milestone event types in call_sync_milestone_to_jira.yml (types: [created] -> types: [created, closed]) * Added VECTOR to the list of Jira project keys being synced (jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG -> jira_project_keys: SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR) Why (Requirements Summary) * The call_sync_milestone_to_jira.yml workflow only triggered on milestone creation. When a GitHub milestone is closed, the corresponding Jira versions (in SCYLLADB, CUSTOMER, SMI, RELENG projects) should be marked as released. Adding the closed trigger enables the called workflow (main_sync_milestone_to_jira_release.yml in github-automation) to handle both creating and releasing Jira versions from GitHub milestone events. * Added the VECTOR project so its Jira versions are also created/released when milestones are created or closed in scylladb.git. * This is consistent with the same change already applied to the staging and scylla-machine-image repos. Fixes:PM-216 Update call_sync_milestone_to_jira.yml in scylladb.git - add close trigger and VECTOR project sync Closes scylladb/scylladb#28981	2026-03-11 15:56:55 +02:00
Amnon Heiman	69fbcd32bd	test/alternator/test_metrics.py: Update the bucket in the histogram search	2026-03-11 15:24:05 +02:00
Amnon Heiman	50af1f3671	alternator: Use batch_histogram for batch size histograms Switch batch-related histograms to estimated_histogram_with_max. Results with better memory consumption and improve efficiency.	2026-03-11 15:21:25 +02:00
Amnon Heiman	b22162c719	estimated_histogram.hh: adds estimated_histogram_with_max This patch adds estimated_histogram_with_max template that will be a based for specific estimated_histograms, eventually replacing the current struct implementation. Introduce estimated_histogram_with_max<Max> as a reusable wrapper around approx_exponential_histogram<1, Max, 4>, providing merge support and the same add helpers used by existing estimated_histogra type. Add estimated_histogram_with_max_merge() Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-03-11 15:02:37 +02:00
Radosław Cybulski	fe8117feee	alternator: fix shard's parent calculation for vnodes Fix an invalid condition, when searching for a parent shard, when table is based on vnodes. Shards have associated with them `last token` - token, than marks the end of the range of tokens they consume (inclusive). An additional assumptions are whole token space is used and (for vnodes) token space wraps around. Previously code looked like this: auto pid = std::upper_bound(..., [](const dht::token& t, const cdc::stream_id& id) { return t < id.token(); }); if (pid != pids.begin()) { pid = std::prev(pid); } An `upper_bound` call with `t < id.token()` means it is looking for an iterator, for which value `t < id.token()` changed to true, which effectively means a position, where iterator is bigger then searched value. Then we move iterator backward once if possible. Assuming token space <-2, 2> and parents [0, 2], when we search for: - -1 -> we will get 0, it's first, so we can't move backward, so 0 (ok) - 0 -> we will get 2, it's not first, so we go back and we return 0 (ok) - 1 -> we will get 2, it's not first, so we go back and we return 0 (not ok - should be 2) The fix is to replace it with `std::lower_bound` and remove conditional backward motion. Since we've a guarantees that whole token space is used if `std::lower_bound` ends with `end()` value, then we have a wrap around case and we need to pick `begin()` as result. Fixes #28354 Fixes: SCYLLADB-537 Closes scylladb/scylladb#28382	2026-03-11 14:51:42 +02:00
Calle Wilund	bc544eb08e	gcs_fixture: Change to use docker helper	2026-03-11 12:32:02 +01:00
Calle Wilund	eb2dfe04e1	aws_kms_fixture: Modify to use docker helper	2026-03-11 12:32:02 +01:00
Calle Wilund	4a8afd9649	test/lib/proc_util: Add docker helper Adds boost test equivalent of dockerized_service to handle launching dockerized mock service using ephermal port, query port and return the process.	2026-03-11 12:32:02 +01:00
Calle Wilund	3e8a9a0beb	pytest: use ephemeral port publish for docker mock servers Changes dockerized_service to use ephermal port publish, and query the published port from podman/docker. Modifies client code to use slightly changed usage syntax.	2026-03-11 12:32:01 +01:00
Piotr Dulikowski	d9a277453e	Merge 'cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race' from Alex Dathskovsky query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically. This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops. Test coverage was extended in test/cluster/test_prepare_race.py: - reproduces the invalidation-during-prepare window with injection, - verifies prepare completes successfully, - then invalidates again and executes the same stale client prepared object, - confirms the driver transparently re-requests/re-prepares and execution succeeds. This change introduces: - no behavior change for normal prepare flow besides stronger lifetime guarantees, - no new protocol semantics, - preserves existing cache invalidation logic, - adds explicit cluster-level regression coverage for both the race and driver reprepare path. - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage Fixes: https://github.com/scylladb/scylladb/issues/27657 Backport to active branches recommended: No node crash, but user-visible PREPARE failures under rare schema-invalidation race; low-risk timeout-bounded retry improves robustness. Closes scylladb/scylladb#28952 * github.com:scylladb/scylladb: transport/messages: hold pinned prepared entry in PREPARE result cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race	2026-03-11 12:09:23 +01:00
Calle Wilund	e3e940bc47	dbuild: Use container network in dbuild nested containers Remove the host network setting, ensuring we use private networks (slirp4netns). This will allow nested container port aliasing, helping CI stability (can use ephemeral ports and container introspection). This also makes the nested podman setup non-conditional, since we only run podman containers inside dbuild, and need the setup regardless if host container is docker or not.	2026-03-11 12:05:51 +01:00
Calle Wilund	8a56eafd39	scylla_cluster: Read notify sock in background to prevent deadlock Starts a thread to process scylla notify messages (NOTIFY_SOCKET) instead of just processing inline, non-blocking. This because it is possible for the pipe created to be to small to hold enough messages for us to reach the point where we otherwise even read from said pipe, allowing other end (scylla) to proceed.	2026-03-11 11:59:00 +01:00
Patryk Jędrzejczak	37aeba9c8c	Merge 'raft: add global read barrier to group0_batch::commit and switch auth and service levels' from Marcin Maliszkiewicz This series adds a global read barrier to raft_group0_client, ensuring that Raft group0 mutations are applied on all live nodes before returning to the caller. Currently, after a group0_batch::commit, the mutations are only guaranteed to be applied on the leader. Other nodes may still be catching up, leading to stale reads. This patch introduces a broadcast read barrier mechanism. Calling send_group0_read_barrier_to_live_members after committing will cause the coordinator to send a read barrier RPC to all live nodes (discovered via gossiper) and waits for them to complete. This is best effort attempt to get cluster-wide visibility of the committed state before the response is returned to the user. Auth and service levels write paths are switched to use this new mechanism. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-650 Backport: no, new feature Closes scylladb/scylladb#28731 * https://github.com/scylladb/scylladb: test: add tests for global group0_batch barrier feature qos: switch service levels write paths to use global group0_batch barrier auth: switch write paths to use global group0_batch barrier raft: add function to broadcast read barrier request raft: add gossiper dependency to raft_group0_client raft: add read barrier RPC	2026-03-11 10:37:19 +01:00
Botond Dénes	54bddeb3b5	sstables/mx/writer: yield after writing a cell With the goal of avoiding stalls on writing large collections, like below: ++[0#1/1 100%] addr=0x5422d1e total=32 count=1 avg=32: \| seastar::backtrace<seastar::backtrace_buffer::append_backtrace_oneline()::{lambda(seastar::frame)#1}> at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:85 ++ - addr=0x541b6d4: \| seastar::backtrace_buffer::append_backtrace_oneline at ./build/release/seastar/./seastar/src/core/reactor.cc:811 \| (inlined by) seastar::print_with_backtrace at ./build/release/seastar/./seastar/src/core/reactor.cc:838 ++ - addr=0x541afb7: \| seastar::internal::cpu_stall_detector::generate_trace at ./build/release/seastar/./seastar/src/core/reactor.cc:1479 ++ - addr=0x541b86c: \| seastar::internal::cpu_stall_detector::maybe_report at ./build/release/seastar/./seastar/src/core/reactor.cc:1214 \| (inlined by) seastar::internal::cpu_stall_detector::on_signal at ./build/release/seastar/./seastar/src/core/reactor.cc:1234 \| (inlined by) seastar::reactor::block_notifier at ./build/release/seastar/./seastar/src/core/reactor.cc:1548 /opt/scylladb/libreloc/libc.so.6: ELF 64-bit LSB shared object, x86-64, version 1 (GNU/Linux), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, BuildID[sha1]=f83d43b9b4b0ed5c2bd0a1613bf33e08ee054c93, for GNU/Linux 3.2.0, not stripped ++ - addr=/opt/scylladb/libreloc/libc.so.6+0x1a28f: \| sigpending at ??:0 ++ - addr=0x1760bf6: \| std::basic_string_view<signed char, std::char_traits<signed char> >::remove_prefix at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/string_view:302 \| (inlined by) managed_bytes_basic_view<(mutable_view)0>::remove_prefix at ././utils/managed_bytes.hh:421 \| (inlined by) _Z11read_simpleIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEET_RT0_ at ././utils/fragment_range.hh:365 \| (inlined by) _ZL9get_fieldIlTk14FragmentedView24managed_bytes_basic_viewIL12mutable_view0EEQsr3stdE12is_trivial_vIT_EES3_T0_j at ././mutation/atomic_cell.hh:62 \| (inlined by) atomic_cell_type::timestamp at ././mutation/atomic_cell.hh:103 \| (inlined by) basic_atomic_cell_view<(mutable_view)0>::timestamp at ././mutation/atomic_cell.hh:232 \| (inlined by) sstables::mc::writer::write_cell at ./sstables/mx/writer.cc:1101 \| (inlined by) sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0::operator() at ./sstables/mx/writer.cc:1233 \| (inlined by) collection_mutation_view::with_deserialized<sstables::mc::writer::write_collection(bytes_ostream&, clustering_key_prefix const, column_definition const&, collection_mutation_view, sstables::mc::writer::row_time_properties const&, bool)::$_0> at ././mutation/collection_mutation.hh:97 \| (inlined by) sstables::mc::writer::write_collection at ./sstables/mx/writer.cc:1221 ++ - addr=0x1677af3: \| sstables::mc::writer::write_cells at ./sstables/mx/writer.cc:1261 \| (inlined by) sstables::mc::writer::write_row_body at ./sstables/mx/writer.cc:1287 \| (inlined by) sstables::mc::writer::write_clustered at ./sstables/mx/writer.cc:1377 \| (inlined by) _ZN8sstables2mc6writer15write_clusteredI14clustering_rowQ9ClusteredIT_EEEvRKS4_9tombstone at ./sstables/mx/writer.cc:766 \| (inlined by) sstables::mc::writer::consume at ./sstables/mx/writer.cc:1425 Putting the yield in write_cell() instead of in write_collection() means that writing any row benefits from the added yield point in the middle. Refs: SCYLLADB-964 Closes scylladb/scylladb#28948	2026-03-11 10:34:55 +01:00
Botond Dénes	475220b9c9	Merge 'Remove the rest of pre raft topology code' from Gleb Natapov Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more. No need to backport since we remove functionality here. Closes scylladb/scylladb#28841 * github.com:scylladb/scylladb: service level: remove version 1 service level code features: move GROUP0_SCHEMA_VERSIONING to deprecated features list migration_manager: remove unused forward definitions test: remove unused code auth: drop auth_migration_listener since it does nothing now schema: drop schema_registry_entry::maybe_sync() function schema: drop make_table_deleting_mutations since it should not be needed with raft schema: remove calculate_schema_digest function schema: drop recalculate_schema_version function and its uses migration_manager: drop check for group0_schema_versioning feature cdc: drop usage of cdc_local table and v1 generation definition storage_service: no need to add yourself to the topology during reboot since raft state loading already did it storage_service: remove unused functions group0: drop with_raft() function from group0_guard since it always returns true now gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more gossiper: drop tokens from loaded_endpoint_state gossiper: remove unused functions storage_service: do not pass loaded_peer_features to join_topology() storage_service: remove unused fields from replacement_info gossiper: drop is_safe_for_restart() function and its use storage_service: remove unused variables from join_topology gossiper: remove the code that was only used in gossiper topology storage_service: drop the check for raft mode from recovery code cdc: remove legacy code test: remove unused injection points auth: remove legacy auth mode and upgrade code treewide: remove schema pull code since we never pull schema any more raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer group0: hoist the checks for an illegal upgrade into main.cc api: drop get_topology_upgrade_state and always report upgrade status as done service_level_controller: drop service level upgrade code test: drop run_with_raft_recovery parameter to cql_test_env group0: get rid of group0_upgrade_state storage_service: drop topology_change_kind as it is no longer needed storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more service_storage: remove unused functions storage_service: remove non raft rebuild code storage_service: set topology change kind only once group0: drop in_recovery function and its uses group0: rename use_raft to maintenance_mode and make it sync	2026-03-11 10:24:20 +02:00
Piotr Dulikowski	38a2829f69	Merge 'Return HTTP error description in Vector Store client' from Szymon Wasik The `service_error` struct: `6dc2c42f8b/service/vector_store_client.hh (L64)` currently stores just the error status code. For this reason whenever the HTTP error occurs, only the error code can be forwarded to the client. For example see here: `6dc2c42f8b/service/vector_store_client.cc (L580)` For this reason in the output of the drivers full description of the error is missing which forces user to take a look into Scylla server logs. The objective of this PR is to extend the support for HTTP errors in Vector Store client to handle messages as well. Moreover, it removes the quadratic reallocation in response_content_to_sstring() helper function that is used for getting the response in case of error. Fixes: VECTOR-189 Closes scylladb/scylladb#26139 * github.com:scylladb/scylladb: vector_search: Avoid quadratic reallocation in response_content_to_sstring vector_store_client: Return HTTP error description, not just code	2026-03-11 09:19:27 +01:00
Calle Wilund	6d8ac23731	test_encryption: Use maximum replication in _smoke_test Refs: SCYLLADB-557 We should use full replication in KS/CF creation and population, for at least two reasons: 1.) Ensure we wait fully for and write to all nodes 2.) Make test more "real", behaving like a proper cluster Closes scylladb/scylladb#28959	2026-03-11 09:54:57 +02:00
Nadav Har'El	00a819bcd8	cql3: remove unnecessary assert() In cql3/, there was one call to assert() (not SCYLLA_ASSERT or throwing_assert), and it was: const auto shard_num = smp::count; assert(shard_num > 0) Rather than converting this assert() to throwing_assert() as I did in previous patches, I decided to outright remove it: Seastar guarantees that smp::count is not zero. Many other places in the code use smp::count assuming that it is correct, no other place bothers to assert it isn't zero. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-03-11 09:43:24 +02:00
Nadav Har'El	34eec020b3	cql3: replace abort() by throwing_assert() After the previous patch replaced all SCYLLA_ASSERT() calls by throwing_assert(), this patch also replaces all calls to abort(). All these abort() calls are supposedly cases that can never happen, but if they ever do happen because of a bug, in none of these places we absolutely need to crash - and exception that aborts the current operation should be enough. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-03-11 09:43:11 +02:00
Nadav Har'El	c87d6407ed	cql3: Replace SCYLLA_ASSERT by throwing_assert In this patch we replace every single use of SCYLLA_ASSERT() in the cql3/ directory by throwing_assert(). The problem with SCYLLA_ASSERT() is that when it fails, it crashes Scylla. This is almost always a bad idea (see #7871 discussing why), but it's even riskier in front-end code like cql3/: In front-end code, there is a risk that due to a bug in our code, a specific user request can cause Scylla to crash. A malicious user can send this query to all nodes and crash the entire cluster. When the user is not malicious, it causes a small problem (a failing request) to become a much worse crash - and worse, the user has no idea which request is causing this crash and the crash will repeat if the same request is tried again. All of this is solved by using the new throwing_assert(), which is the same as SCYLLA_ASSERT() but throws an exception (using on_internal_error()) instead of crashing. The exception will prevent the code path with the invalid assumption from continuing, but will result in only the current user request being aborted, with a clear error message reporting the internal server error due to an assertion failure. I reviewed all the changes that I did in this patch to check that (to the best of my understanding) none of the assertions in cql3/ involve the sort of serious corruption that might require crashing the Scylla node entirely. throwing_assert() also improves logging of assertion failures compared to the original SCYLLA_ASSERT() - SCYLLA_ASSERT() printed a message to stderr which in many installations is lost, whereas throwing_assert() uses Scylla's standard logger, and also includes a backtrace in the log message. Fixes #13970 (Exorcise assertions from CQL code paths) Refs #7871 (Exorcise assertions from Scylla) Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-03-11 09:41:20 +02:00
Botond Dénes	99fa912f1b	Merge 'Generalize streaming scopes tests' from Pavel Emelyanov To restore how streaming scopes work there are two tests that greatly duplicate each other -- test_restore_with_streaming_scopes from cluster/object_store suite and test_refresh_with_streaming_scopes from cluster suite. This patch generalizes both into a do_test_streaming_scopes() non-test function Closes scylladb/scylladb#28874 * github.com:scylladb/scylladb: test: Re-sort comments around do_test_streaming_scopes() test: Split do_load_sstables() test: Drop load_fn argument from do_load_sstables() test: Re-use do_test_streaming_scopes() in refresh test test: Introduce SSTablesOnLocalStorage test: Introduce SSTablesOnObjectStorage test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes()	2026-03-11 09:35:21 +02:00
Dmitriy Kruglov	cee44716db	docs: add cluster platform migration procedure Document how to migrate a ScyllaDB cluster to different instance types using the add-and-replace node cycling approach. Closes: QAINFRA-42 Closes scylladb/scylladb#28458	2026-03-11 09:31:35 +02:00
Nadav Har'El	401dc1894c	test/alternator,cqlpy: avoid xfail_strict against DynamoDB/Cassandra Recently, in commit `7b30a39`, we added to pytest.ini the option xfail_strict. This option causes every time a test XPASSes, i.e., an XFAIL test actually passes, to be considered an error and fail the test. While this has some benefits, it's a big problem when running tests against a reference implementation like DynamoDB or Cassandra: We typically mark a test "xfail" if the test shows a known bug - i.e., if the test fails on Scylla but passes on the reference system (DynamoDB or Cassandra). This means that when running "test/cqlpy/run-cassandra" or "test/alternator/run --aws", we expect to see many tests XPASS, and now this will cause these runs to "fail". So in this patch we add the xfail_strict=false to cqlpy/run-cassandra and alternator/run --aws. This option is not added to cqlpy/run or to alternator/run without --aws, and also doesn't affect test.py or Jenkins. P.S. This is another nail in the coffin of doing "cd test/alternator; pytest --aws". You should get used to running Alternator tests through test/alternator/run, even if you don't need to run Scylla (the "--aws" option doesn't run Scylla). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28973	2026-03-11 09:29:30 +02:00
Robert Bindar	29619e48d7	replica/table: calculate manifest tablet_count from tablet map During tests I noticed that if the number of tablets is very small, say 2, and the number of nodes is 3 (2 shards per node), using the number of storage groups on each shard, a shard may end up holding 0 groups, whilst the other holds 1 group. And in some nodes even both shards have 0 groups. Taking the minimum among shards here was showing in manifests a tablet count of 0 for all 3 nodes, which is incorrect. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#28978	2026-03-11 09:27:04 +02:00
Botond Dénes	3fed6f9eff	Merge 'service: tasks: scan all tablets in tablet_virtual_task::wait' from Aleksandra Martyniuk Currently, for repair tasks tablet_virtual_task::wait gathers the ids of tablets that are to be repaired. The gathered set is later used to check if the repair is still ongoing. However, if the tablets are resized (split or merged), the gathered set becomes irrelevant. Those, we may end up with invalid tablet id error being thrown. Wait until repair is done for all tablets in the table. Fixes: https://github.com/scylladb/scylladb/issues/28202 Backport to 2026.1 needed as it contains the change introducing the issue `d51b1fea94` Closes scylladb/scylladb#28323 * github.com:scylladb/scylladb: service: fix indentation test: add test_tablet_repair_wait service: remove status_helper::tablets service: tasks: scan all tablets in tablet_virtual_task::wait	2026-03-11 09:24:07 +02:00
Raphael S. Carvalho	cc5b1acadf	Improve log when sstable load fails due to missing tablet replica A bug or some bad operator intervention can lead to a sstable existing in a node after the tablet replica was moved to a different node. This will result sstable loading during boot failing, requiring operator intervention. The log today just dumps the name of the "orphaned" sstable, but one investigating it might want to know which process (repair, memtable, whatever) generated that sstable, if the sstable was created locally or remotely, and the current replica set of the underlying tablet. From the original identifier, we can know the exact time the sstable was created on its original node. From the current id, we know the time it was created on the current node. All this info can help the investigator to correlate with events in other nodes (includes actions from the coordinator) to get closer to the root cause. The new log will look like this: "Unable to load SSTable .../me-3gyg_1fsw_2u0u826b00b71vc46o-big-Data.db (originated from compaction with id 913f41c0-18c2-11f1-8f08-cb8521b3f330 on host e483238c-2287-4022-8bc4-b4f1c4cb2b0d) of tablet 6 (replica set: [e483238c-2287-4022-8bc4-b4f1c4cb2b0d:0])" Refs https://scylladb.atlassian.net/browse/SCYLLADB-788. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#28921	2026-03-11 06:20:34 +02:00
Avi Kivity	b17e1259e3	Merge 'types: optimize vector deserialization for high-dimensional vectors' from Szymon Wasik Vector deserialization is an operation which performance is critical for vector similarity search feature because it is frequently executed during rescoring operation. Some of the identified performance bottlenecks for it include: 1. Per-element virtual dispatch in deserialize(): each of the N elements went through visit() which switches on ~28 type variants. For a 1024-dimension float vector, that's 1024 redundant type switches when the element type is the same for all of them. 2. Redundant work in split_fragmented(): value_length_if_fixed() was called inside the loop (N virtual calls), and no reserve() was done on the output vector causing repeated reallocations. This series fixes both: - Introduce deserialize_vector_visitor that dispatches on the element type once for the entire vector, then loops inside the resolved handler. Simple numeric types (float, int, etc.) call deserialize_value() directly with no virtual dispatch per element. String types (ascii, utf8) get a dedicated handler that skips make_empty() (sstring has no empty_t constructor). Complex types (list, map, tuple, etc.) fall back to per-element dispatch. - In split_fragmented(), reserve the output vector to _dimension and cache value_length_if_fixed() before the loop. Benchmark results (1024-dim float vector, release build, -O3 -flto): deserialize: 15.73 us -> 11.70 us (1.34x, 26% faster) split_fragmented: 10.34 us -> 7.45 us (1.39x, 28% faster) References: SCYLLADB-471 Backport: none, unless we observe some critical performance improvement for quantization. Closes scylladb/scylladb#28618 * github.com:scylladb/scylladb: types: optimize reading vector fragments types: optimize vector deserialization for high-dimensional vectors	2026-03-11 00:39:46 +02:00
Dawid Mędrek	167feabe1a	cql3: Reject user-provided timestamps for strongly consistent tables Similarly to LWTs, we reject queries with user-provided timestamps when they target strongly consistent tables. Such statements could force us to rewrite history, and that contradicts the philosophy of linearizability we aim for. Fixes SCYLLADB-879 Closes scylladb/scylladb#28867	2026-03-10 22:11:39 +02:00
Marcin Maliszkiewicz	8ae80a32c0	Update seastar submodule * seastar d2953d2a...4d268e0e (32): > Merge 'prometheus: support multiple __name__ filters and prefixed names' from Travis Downs doc: update prometheus.md with __name__ filter enhancements prometheus: support prefixed names in __name__ filter prometheus: add benchmarks for name filter performance prometheus: support multiple __name__ query parameters prometheus: move write_body_args to header > fair_queue: Subtract from _queued_capacity on pop_front() > memory: expose cumulative allocated bytes statistic > Merge 'Add ability to configure IO bandwidth limit for supergroup' from Pavel Emelyanov test: IO bandiwdth throttler unit tests code: Add ability to configure IO bandwidth limit for supergroup io_queue: Have more than one throttler par class io_queue: Introduce bandwidth_throttler helper class io_queue: Nest io_group::priotiy_class_data-s io_queue: Update class bandwidth on group's class data io_queue: Make io_group::priority_class_data::tokens() static fair_queue: Introduce group (un)plugging > Fix _shard_to_numa_node_mapping double population > Use exception parameter in log_timer_callback_exception() > Fix wakeup_granularity() fallback debug-fs reading > test_fixture: Fix SEASTAR_FIXTURE_THREAD_TEST_CASE thread not propagated > build: support tuning -ffile-prefix-map > test: Remove unused C::dup() method of testing class > src/core/reactor: introduce reactor::get_backend_name() > util/process: add pid() accessor > Merge 'Add source location to task and tasktrace object' from Radosław Cybulski coroutine.hh: disable source_location for GCC to avoid ICE reactor: improve do_dump_task_queue reporting Use source_location in `do_dump_task_queue` Update backtrace with source locations of resume points Add calls to update resume_point Add a std::source_location (resume_point) to task object. > Merge 'Refine posix file .dup() implementation' from Pavel Emelyanov file: Templatize posix_file_handle_impl file: Don't dup() non-read-only files file: Split ..._impl::dup() implementations test: Add a simple test for dup() > Merge 'Deprecate reactor::make_pollable_fd(socket_address, int)' from Pavel Emelyanov reactor: Deprecate make_pollable_fd() net/posix: Create file_desc for sockets in-place reactor,net: Keep sock_need_nonblock boolean on posix_network_stack net/posix: Re-format constructor initializer lists > Merge 'test: add fuzz testing infrastructure and sstring fuzzer' from Travis Downs test: add fuzz tests to CI workflow test: add sstring differential fuzzer test: add fuzz testing infrastructure > Introduce "integrated queue length" metrics and use it for IO classes (#3210) > reactor: Remove get_sg_data(unsigned) overload > memcached: Stop using scattered_message > reactor: Mark uptime() method const > alien: Remove deprecated run_on and submit_to calls > file: make open_flags and access_flags constexpr > scheduling: Unfriend some methods from scheduling_group > reactor: Move _dying bit to epoll backend > file: coroutinize the with_file templates > configure: validate --cook ingredient names > fix trailing whitespace > Merge 'Estimate timing overhead, allow failing if it is too high' from Travis Downs perf_tests: document overhead column and threshold options perf_tests: add measurement overhead tracking and warnings perf_tests: remove inline/hot attributes from time_measurement methods perf_tests: move time_measurement class to implementation file perf_tests: move perf counters into time_measurement singleton > rpc: log handler type > Merge 'Add pre-commit with trailing whitespace hook' from Travis Downs Add GitHub Actions workflow for pre-commit enforcement Add pre-commit setup documentation to HACKING.md Add pre-commit configuration with trailing-whitespace hook Remove trailing whitespace from source files > posix-stack: Make internal::posix_connect() resolve exceptions into futures > sstring: fix npos to be size_t for consistency with std::string Closes scylladb/scylladb#28954	2026-03-10 22:06:58 +02:00
Szymon Wasik	7fae78d2b0	types: optimize reading vector fragments There was a redundant work in split_fragmented(): value_length_if_fixed() was called inside the loop (N virtual calls), and no reserve() was done on the output vector causing repeated reallocations. This patch reserves the output vector to _dimension and caches value_length_if_fixed() before the loop. Additionally, split read_vector_element() into two specialized functions: read_vector_element_fixed() and read_vector_element_variable(), and hoist the branch on fixed_len outside the loop in split_fragmented() and deserialize_loop(). This avoids a conditional branch per element in the hot path. Benchmark results (1024-dim float vector, release build, -O3 -flto): 10.34 us -> 7.45 us (1.39x, 28% faster)	2026-03-10 20:17:31 +01:00
Taras Veretilnyk	579269b3c5	test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade Verify that scylla sstable upgrade fails when an sstable has a corrupted Statistics component digest, and succeeds when the --ignore-component-digest-mismatch flag is provided.	2026-03-10 19:24:05 +01:00
Taras Veretilnyk	fc4c82b962	docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade	2026-03-10 19:24:05 +01:00
Taras Veretilnyk	7214f5a0b6	sstables: propagate ignore_component_digest_mismatch config to all load sites Add ignore_component_digest_mismatch option to db::config (default false). When set, sstable loading logs a warning instead of throwing on component digest mismatches, allowing a node to start up despite corrupted non-vital components or bugs in digest calculation. Propagate the config to all production sstable load paths: - distributed_loader (node startup, upload dir processing) - storage_service (tablet storage cloning) - sstables_loader (load-and-stream, download tasks, attach) - stream_blob (tablet streaming)	2026-03-10 19:24:05 +01:00
Taras Veretilnyk	c123f637ea	sstables: add option to ignore component digest mismatches Add `ignore_component_digest_mismatch` option to `sstable_open_config` that logs a warning instead of throwing `malformed_sstable_exception` on component digest mismatch. This is useful for recovering sstables with corrupted non-vital components or working around bugs in digest calculation. Expose the option in scylla-sstable via the `--ignore-component-digest-mismatch` flag for the upgrade operation.	2026-03-10 19:24:05 +01:00
Taras Veretilnyk	95420014ea	sstable_compaction_test: Add scrub validate test for corrupted index Generalize corrupt_sstable() and scrub_validate_corrupted_file() to accept a component_type parameter, defaulting to Data, so they can be reused for corrupting other components.	2026-03-10 19:24:05 +01:00
Taras Veretilnyk	a3912cf7f1	sstables: add tests for component digest validation on corrupted SSTables Add tests that verify SSTable component digest validation detects corruption on load. Each test writes an SSTable, corrupts a specific component file by flipping a bit, then asserts that reloading the SSTable throws malformed_sstable_exception with the expected digest mismatch message.	2026-03-10 19:24:05 +01:00
Taras Veretilnyk	e78a3d2c44	sstables: validate index components digests during SSTable scrub in validate mode	2026-03-10 19:24:05 +01:00
Taras Veretilnyk	9decbdeab0	sstables: verify component digests on SSTable load Add integrity verification for SSTable component files by validating their CRC32 digests against the expected values stored in Scylla metadata during SSTable loading. The following components are validated on load: TOC, Scylla metadata, CompressionInfo, Statistics, Summary, and Filter.	2026-03-10 19:24:05 +01:00
Taras Veretilnyk	478c1eaec5	sstables: add digest_file_random_access_reader for CRC32 digest computation Add a new reader that wraps file_random_access_reader and computes a running CRC32 digest over the data as it is read. The digest accumulates across sequential read_exactly() calls and is reset on seek(), since a non-sequential read invalidates the running checksum.	2026-03-10 19:24:05 +01:00
Szymon Wasik	6c0ef8eb92	types: optimize vector deserialization for high-dimensional vectors One of the performance bottlenecks while deserializing vectors was per-element virtual dispatch in deserialize(): each of the N elements went through visit() which switches on ~28 type variants. For a 1024-dimension float vector, that's 1024 redundant type switches when the element type is the same for all of them. This patch introduces deserialize_vector_visitor that dispatches on the element type once for the entire vector, then loops inside the resolved handler. Simple numeric types (float, int, etc.) call deserialize_value() directly with no virtual dispatch per element. String types (ascii, utf8) get a dedicated handler that skips make_empty() (sstring has no empty_t constructor). Complex types (list, map, tuple, etc.) fall back to per-element dispatch. Benchmark results (1024-dim float vector, release build, -O3 -flto): 15.73 us -> 11.70 us (1.34x, 26% faster)	2026-03-10 18:21:34 +01:00
Andrzej Jackowski	9247dff8c2	reader_concurrency_semaphore: fix leak workaround `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` introduces a protection against resources that are "made up" of thin air to `reader_concurrency_semaphore`. If there are more `_resources` than the `_initial_resources`, it means there is a negative leak, and `on_internal_error_noexcept` is called. In addition to it, `_resources` is set to `std::max(_resources, _initial_resources)`. However, the commit message of `e4da0afb8d5491bf995cbd1d7a7efb966c79ac34` states the opposite: "The detection also clamps the _resources to _initial_resources, to prevent any damage". Before this commit, the protection mechanism doesn't clamp `_resources` to `_initial_resources` but instead keeps `_resources` high, possibly even indefinitely growing. This commit changes `std::max` to `std::min` to make the code behave as intended. Refs: SCYLLADB-163 Closes scylladb/scylladb#28982	2026-03-10 18:57:31 +02:00
Szymon Wasik	74d86d3fe9	vector_search: Avoid quadratic reallocation in response_content_to_sstring Pre-compute the total size and allocate a single uninitialized sstring before copying the buffers, following the pattern from Seastar's read_entire_stream_contiguous(). This avoids iterative reallocation which is O(n^2) for large responses.	2026-03-10 17:45:55 +01:00
Szymon Wasik	d27610f138	vector_store_client: Return HTTP error description, not just code This simple patch adds support for storing the HTTP error description that Vector Store client receives from vector store. Until now it was just printed to the log but it was not returned. For this reason it was not forwarded to the drivers which forced users to access ScyllaDB server logs to understand what is wrong with Vector Store. This patch also updates formatter to print the message next to the error code. Fixes: VECTOR-189	2026-03-10 17:22:30 +01:00
Nadav Har'El	92ee959e9b	test/alternator: speed up test_streams.py by using module-scope fixtures Previously, all stream-table fixtures in this test file used scope="function", forcing a fresh table to be created for every test, slowing down the test a bit (though not much), and discouraging writing small new tests. This was a workaround for a DynamoDB quirk (that Alternator doesn't have): LATEST shard iterators have a time slack and may point slightly before the true stream head, causing leftover events from a previous test to appear in the next test's reads. We fix this by draining the stream inside latest_iterators() and shards_and_latest_iterators() after obtaining the LATEST iterators: fetch records in a loop until two consecutive polling rounds both return empty, guaranteeing the iterators are positioned past all pre-existing events before the caller writes anything. With this guarantee in place, all stream-table fixtures can safely use scope="module". After this patch, test_streams.py continues to pass on DynamoDB. On Alternator, the test file's run time went down a bit, from 20.2 seconds to 17.7 seconds. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-03-10 17:14:04 +02:00
Nadav Har'El	6ac1f1333f	test/alternator: test_streams.py don't use fixtures in 4 tests In the next patch, we plan to make the fixtures in test_streams.py shared between tests. Most tests work well with shared tables, but two (test_streams_trim_horizon and test_streams_starting_sequence_number) were written to expect a new table with an empty history, and two other (test_streams_closed_read and test_streams_disabled_stream) want to disable streaming and would break a shared table. So this patch we modify these four tests to create their own new table instead of using a fixture. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-03-10 17:12:33 +02:00
Botond Dénes	81e214237f	Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk This pull request adds support for calculation and storing CRC32 digests for all SSTable components. This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream. Several test cases where introduced to verify expected behaviour. Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting. Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup. However, with component digests stored in scylla_metadata (#20100), replacing a component like Statistics requires atomically updating both the component and scylla_metadata with the new digest - impossible with POSIX rename. The new mechanism creates a clone sstable with a fresh generation: - Hard-links all components from the source except the component being rewritten and scylla_metadata - Copies original sstable components pointer and recognized components from the source - Invokes a modifier callback to adjust the new sstable before rewriting - Writes the modified component along with updated scylla_metadata containing the new digest - Seals the new sstable with a temporary TOC - Replaces the old sstable atomically, the same way as it is done in compaction This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair). In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to temporary toc persistence. Backport is not required, it is a new feature Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453 Closes scylladb/scylladb#28338 * github.com:scylladb/scylladb: docs: document components_digests subcomponent and trailing digest in Scylla.db sstable_compaction_test: Add tests for perform_component_rewrite sstable_test: add verification testcases of SSTable components digests persistance sstables: store digest of all sstable components in scylla metadata sstables: replace rewrite_statistics with new rewrite component mechanism sstables: add new rewrite component mechanism for safe sstable component rewriting compaction: add compaction_group_view method to specify sstable version sstables: add null_data_sink and serialized_checksum for checksum-only calculation sstables: extract default write open flags into a constant sstables: Add write_simple_with_digest for component checksumming sstables: Extract file writer closing logic into separate methods sstables: Implement CRC32 digest-only writer	2026-03-10 16:02:53 +02:00
Aleksandra Martyniuk	e02b0d763c	service: fix indentation	2026-03-10 14:44:52 +01:00
Aleksandra Martyniuk	02257d1429	test: add test_tablet_repair_wait Add a test that checks if tablet_virtual_task::wait won't fail if tablets are merged.	2026-03-10 14:42:27 +01:00
Aleksandra Martyniuk	0e0070f118	service: remove status_helper::tablets Currently, status_helper::tablets, which keeps a vector of processed tablet ids, is used only in tablet_virtual_task::get_status_helper, so there is no point in returning it. Also, in get_status_helper, it is used only to determine if any tablets are processed. Remove status_helper::tablets. Use a flag instead of the vector in get_status_helper.	2026-03-10 14:42:27 +01:00
Aleksandra Martyniuk	e5928497ce	service: tasks: scan all tablets in tablet_virtual_task::wait Currently, for repair tasks tablet_virtual_task::wait gathers the ids of tablets that are to be repaired. The gathered set is later used to check if the repair is still ongoing. However, if the tablets are resized (split or merged), the gathered set becomes irrelevant. Those, we may end up with invalid tablet id error being thrown. Wait until repair is done for all tablets in the table.	2026-03-10 14:42:21 +01:00
Andrei Chekun	c36df5ecf4	test.py: eliminite drivers exception There is a race condition in driver that raises the RuntimeException. This pollutes the output, so this PR is just silencing this exception. Fixes: SCYLLADB-900 Closes scylladb/scylladb#28957	2026-03-10 14:31:36 +02:00
Alex	3ac4e258e8	transport/messages: hold pinned prepared entry in PREPARE result result_message::prepared now owns a strong pinned prepared-cache entry instead of relying only on a weak pointer view. This closes the remaining lifetime gap after query_processor::prepare() returns, so users of the returned PREPARE message cannot observe an invalidated weak handle during subsequent processing. - update result_message::prepared::cql constructor to accept pinned entry - construct weak view from owned pinned entry inside the message - pass pinned cache entry from query_processor::prepare() into the message constructor	2026-03-10 14:17:57 +02:00
Alex	27051d9a7c	cql3: pin prepared cache entry in prepare() to avoid invalid weak handle race query_processor::prepare() could race with prepared statement invalidation: after loading from the prepared cache, we converted the cached object to a checked weak pointer and then continued asynchronous work (including error-injection waitpoints). If invalidation happened in that window, the weak handle could no longer be promoted and the prepare path could fail nondeterministically. This change keeps a strong cache entry reference alive across the whole critical section in prepare() by using a pinned cache accessor (get_pinned()), and only deriving the weak handle while the entry is pinned. This removes the lifetime gap without adding retry loops. Test coverage was extended in test/cluster/test_prepare_race.py: - reproduces the invalidation-during-prepare window with injection, - verifies prepare completes successfully, - then invalidates again and executes the same stale client prepared object, - confirms the driver transparently re-requests/re-prepares and execution succeeds. This change introduces: - no behavior change for normal prepare flow besides stronger lifetime guarantees, - no new protocol semantics, - preserves existing cache invalidation logic, - adds explicit cluster-level regression coverage for both the race and driver reprepare path. - pushes the re prepare operation twards the driver, the server will return unprepared error for the first time and the driver will have to re prepare during execution stage	2026-03-10 14:17:57 +02:00
Piotr Dulikowski	37f8cdf485	Merge 'test.py: fix unawaited ScyllaLogFile.grep() coroutines' from Andrei Chekun Fixed several places where ScyllaLogFile.grep() was called without await, resulting in checking coroutine objects for truthiness instead of actual log matches. Fixes: SCYLLADB-903 No backport, framework fix and one test fix. Closes scylladb/scylladb#28909 * github.com:scylladb/scylladb: test.py: fix unawaited ScyllaLogFile.grep() coroutines tests: fix test_group0_recovers_after_partial_command_application	2026-03-10 12:29:23 +01:00
Dario Mirovic	f72081194c	db: use prefix tombstones in DROP TABLE schema mutations When dropping a table, make_drop_table_or_view_mutations() creates a point tombstone in system_schema.columns for every column in the table. The clustering key of system_schema.columns is (table_name, column_name). A clustering key with only the table_name component acts as a prefix tombstone. That tombstone covers all columns belonging to that table. This approach is already used by make_table_deleting_mutations() during CREATE TABLE. Apply the same prefix tombstone approach to DROP TABLE for the columns, view_virtual_columns, computed_columns, and dropped_columns schema tables. This reduces tombstone accumulation in schema table sstables. In test_max_cells test case, which repeatedly creates and drops a table with 32768 columns, overall test time improved from ~180s to ~157s, which is ~12.7% improvement. Refs SCYLLADB-815 Closes scylladb/scylladb#28976	2026-03-10 11:59:00 +01:00
Gleb Natapov	b59b3d4f8a	service level: remove version 1 service level code	2026-03-10 10:46:48 +02:00
Gleb Natapov	b633ec1779	features: move GROUP0_SCHEMA_VERSIONING to deprecated features list	2026-03-10 10:46:48 +02:00
Gleb Natapov	40ec0d4942	migration_manager: remove unused forward definitions	2026-03-10 10:46:48 +02:00
Gleb Natapov	aa9eb0ef8c	test: remove unused code	2026-03-10 10:46:48 +02:00
Gleb Natapov	4660f908f9	auth: drop auth_migration_listener since it does nothing now	2026-03-10 10:46:48 +02:00
Gleb Natapov	74b5a8d43d	schema: drop schema_registry_entry::maybe_sync() function Schema is synced through group0 now. Drop all the test of the function as well.	2026-03-10 10:46:47 +02:00
Gleb Natapov	b9f3281af6	schema: drop make_table_deleting_mutations since it should not be needed with raft Also remove the test since it is no longer relevant	2026-03-10 10:46:47 +02:00
Gleb Natapov	f76199e5c2	schema: remove calculate_schema_digest function It is used by the test only, so remove the test and its data as well.	2026-03-10 10:46:47 +02:00
Gleb Natapov	08e33ad7f7	schema: drop recalculate_schema_version function and its uses There is no need to recalculate schema version any more since it is set by group0.	2026-03-10 10:46:39 +02:00
Gleb Natapov	7bb334a5dd	migration_manager: drop check for group0_schema_versioning feature We do not allow upgrading from a version that does not have it any longer.	2026-03-10 10:39:59 +02:00
Gleb Natapov	4402b030ae	cdc: drop usage of cdc_local table and v1 generation definition	2026-03-10 10:39:59 +02:00
Gleb Natapov	6769615ff1	storage_service: no need to add yourself to the topology during reboot since raft state loading already did it	2026-03-10 10:39:59 +02:00
Gleb Natapov	33fbda9f3b	storage_service: remove unused functions	2026-03-10 10:39:58 +02:00
Gleb Natapov	0e3e7be335	group0: drop with_raft() function from group0_guard since it always returns true now Also drop the code that assumed that the function can return false.	2026-03-10 10:39:58 +02:00
Gleb Natapov	4e56ca3c76	gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more They were used by legacy topology and cdc code only.	2026-03-10 10:39:58 +02:00
Gleb Natapov	77f8f952b2	gossiper: drop tokens from loaded_endpoint_state	2026-03-10 10:39:58 +02:00
Gleb Natapov	706754dc24	gossiper: remove unused functions	2026-03-10 10:39:58 +02:00
Gleb Natapov	8ee4cdd4b7	storage_service: do not pass loaded_peer_features to join_topology() They are not used there any longer.	2026-03-10 10:39:58 +02:00
Gleb Natapov	24c01f2289	storage_service: remove unused fields from replacement_info	2026-03-10 10:39:58 +02:00
Gleb Natapov	2d8722d204	gossiper: drop is_safe_for_restart() function and its use The function checks that the node's state is not left or removed in gossiper during restart, but with raft topology a removed node will not be able to contact the cluster to get this information since it will be banned.	2026-03-10 10:39:58 +02:00
Gleb Natapov	6f739a8ee4	storage_service: remove unused variables from join_topology	2026-03-10 10:39:58 +02:00
Gleb Natapov	d35b83bec8	gossiper: remove the code that was only used in gossiper topology The topology state machine is always present now and can be passed to the gossiper during creation.	2026-03-10 10:39:58 +02:00
Gleb Natapov	390eb46c1a	storage_service: drop the check for raft mode from recovery code In non raft mode the node will node boot at all, so the check is redundant now.	2026-03-10 10:39:58 +02:00
Gleb Natapov	6a7e850161	cdc: remove legacy code The patch removes test/boost/cdc_generation_test.cc since it unit tests cdc::limit_number_of_streams_if_needed function which is remove here.	2026-03-10 10:38:57 +02:00
Gleb Natapov	0b508c5f96	test: remove unused injection points Also remove test_auth_raft_command_split test which is irrelevant since `5ba7d1b116` because it does not use the function that injects max sized command after the commit.	2026-03-10 10:09:39 +02:00
Gleb Natapov	1d188f0394	auth: remove legacy auth mode and upgrade code A system needs to be upgraded to use v2 auth before moving to this ScyllaDB version otherwise the boot will fail.	2026-03-10 10:09:39 +02:00
Gleb Natapov	02fc4ad0a9	treewide: remove schema pull code since we never pull schema any more Schema pull was used by legacy schema code which is not supported for a long time now and during legacy recovery which is no longer supported as well. It can be dropped now.	2026-03-10 10:09:39 +02:00
Gleb Natapov	0cf726c81f	raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer	2026-03-10 10:09:39 +02:00
Gleb Natapov	60a861c518	group0: hoist the checks for an illegal upgrade into main.cc The checks are spread around now, but having then in one place and done as early as possible simplifies the logic.	2026-03-10 10:09:39 +02:00
Gleb Natapov	1ff98c89e3	api: drop get_topology_upgrade_state and always report upgrade status as done Non upgraded version will not boot any longer.	2026-03-10 10:09:38 +02:00
Gleb Natapov	be153a4eb7	service_level_controller: drop service level upgrade code We do not allow upgrade from a version that is not updated yet, so the code is not used any longer.	2026-03-10 10:09:38 +02:00
Gleb Natapov	61cc091364	test: drop run_with_raft_recovery parameter to cql_test_env It is unused.	2026-03-10 10:09:38 +02:00
Gleb Natapov	00083b42a7	group0: get rid of group0_upgrade_state Simplify code by getting rid of group0_upgrade_state since upgrade is no longer supported, so no need to track its state. The none upgraded node will simply not boot and to detect that the patch checks the state directly from the system table.	2026-03-10 10:09:38 +02:00
Gleb Natapov	d4b55de214	storage_service: drop topology_change_kind as it is no longer needed The mode is always raft, so no need to keep a variable that tracks that.	2026-03-10 10:09:38 +02:00
Gleb Natapov	68ea6aa0a6	storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more	2026-03-10 10:09:38 +02:00
Gleb Natapov	06652948f3	service_storage: remove unused functions raft_topology_change_enabled and upgrade_state_to_topology_op_kind are not use any more. Remove the code.	2026-03-10 10:09:38 +02:00
Gleb Natapov	e8c72b7ba0	storage_service: remove non raft rebuild code Only raft is supported now.	2026-03-10 10:09:38 +02:00
Gleb Natapov	49ebab971d	storage_service: set topology change kind only once The only support mode is topology_change_kind::raft, so always set it in storage_service::join_cluster during join or regular boot. Drop the check for legacy mode from raft_group0::setup_group0_if_exist since the mode will not be set at this point any longer. The wrong upgrade will still be detected in storage_service::join_cluster where topology.upgrade_state is checked directly.	2026-03-10 10:09:38 +02:00
Gleb Natapov	4e072977d4	group0: drop in_recovery function and its uses Legacy recovery procedure is no longer supported and the code can be dropped.	2026-03-10 10:09:38 +02:00
Gleb Natapov	770762edd8	group0: rename use_raft to maintenance_mode and make it sync group0_upgrade_state::recovery is now used only in maintenance mode so rename the function to indicate it. Also there is no preemption point in the function any more and it can be a regular function, not a co-routine.	2026-03-10 10:09:33 +02:00
Pavel Emelyanov	61af7c8300	test: Re-sort comments around do_test_streaming_scopes() The test description of refreshing test is very elaborated and it's worth having it as the description of the streaming scopes test itself. Callers of the helper can go with smaller descriptions. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-10 10:00:09 +03:00
Pavel Emelyanov	5ce3597c25	test: Split do_load_sstables() This helper does two things -- sorts sstables per server according to scope in use and calls sstables_storage.restore(). The code looks better if the sorting of sstables stays in a helper and the call for .restore() is moved to the caller. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-10 10:00:09 +03:00
Pavel Emelyanov	8c1fb2b39a	test: Drop load_fn argument from do_load_sstables() Now all callers provide the sstables_storage argument and the load_fn is effectively unused. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-10 09:59:08 +03:00
Pavel Emelyanov	59051ccc28	test: Re-use do_test_streaming_scopes() in refresh test Now it's possible to replace the whole body of the test_refresh_with_streaming_scopes() test by calling the corresponding helper function from backup/restore test module. This helper does exactly the same, and the SSTablesOnLocalStorage class provides the necessary save/restore implementations. One more thing to mention -- the refreshing test for some reason only wants to run with restored min-tablet-count equal to the original one. The do_test_streaming_scopes() needs to account for that, as it runs the tests for more options. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-10 09:59:07 +03:00
Pavel Emelyanov	f6f1cb0391	test: Introduce SSTablesOnLocalStorage This class implements some of the sstables manipulations performed by test_refresh_with_streaming_scopes(). It's here to facilitate next patch that will use it to call do_test_streaming_scopes() helper. This patch moves two blocks of code out of the test into this new class. The shutil.rmtree(tmpbackup) is seemingly lost, but it really isn't -- the tmpbackup variable holds a name of a _subdir_ inside servers' workdirs. This path doesn't really exist on disk on its own, so removing it is a no-op. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-10 09:58:40 +03:00
Pavel Emelyanov	dae4da1810	test: Introduce SSTablesOnObjectStorage The class in question performs two operations for do_test_streaming_scopes(): saves sstables and restores them. Current caller of the helper is the test_restore_with_streaming_scopes() test that need to backup sstables on object storage and restore them from there with the restoration API. The SSTablesOnObjectStorage class does exactly that. The change in do_load_sstables() that checks for sstables_storage to be non None is needed to keep test_refresh_with_streaming_scopes() work -- that test doesn't provide sstables_storage (yet) and the function in question will call the load_fn callback. Next patch will eliminate it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-10 09:58:39 +03:00
Pavel Emelyanov	5a033dea47	test: Move test_restore_with_streaming_scopes() into do_test_streaming_scopes() The body of this test is duplicated by test_refresh_with_streaming_scopes() test from other module. Keeping it in a non-test top-level function will help generalizing these two tests. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-10 09:57:53 +03:00
Nadav Har'El	d78ea3d498	test/cqlpy: mark test_unbuilt_index_not_used not strictly xfail A few days ago, in commit `7b30a3981b` we added to pytest.ini the option xfail_strict. This option causes every time a test XPASSes, i.e., an xfail test actually passes - to be considered an error and fail the test. But some tests demonstrate a timing-related bug and do not reproduce the bug every single time. An example we noticed in one CI run is: test/cqlpy/test_secondary_index.py::test_unbuilt_index_not_used This test reproduces a timing-related bug (if you read from a secondary index "too quickly" you can get wrong results), but only about 90% of the time, not 100% of the time. The solution is to add "strict=False" for the xfail marker on this specific test. This undoes the xfail_strict for this specific test, accepting that this specific test can either pass or fail. Note that this does NOT make this test worthless - we still see this test failing most of the time, and when a developer finally fixes this issue, the test will begin to pass all the time. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-956 (we'll probably need to follow up this fix with the same fix for other xfail tests that can sometime pass). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28942	2026-03-09 22:48:20 +02:00
Avi Kivity	01ddc17ab9	Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros When we generate view updates, we check whether we can skip the entire view update if all columns selected by the view are unmodified. However, for collection columns, we only check if they were unset before and after the update. In this patch we add a check for the actual collection contents. We perform this check for both virtual and non-virtual selections. When the column is only a virtual column in the view, it would be enough to check the liveness of each collection cell, however for that we'd need to deserialize the entire collection anyway, which should be effectively as expensive as comparing all of its bytes. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808 Closes scylladb/scylladb#28839 * github.com:scylladb/scylladb: mv: allow skipping view updates when a collection is unmodified mv: allow skipping view updates if an empty collection remains unset	2026-03-09 22:46:01 +02:00
Botond Dénes	13ff9c4394	db,compaction: use utils::chunked_vector for cache invalidation ranges Instead of dht::partition_ranges_vector, which is an std::vector<> and have been seen to cause large allocations when calculating ranges to be invalidated after compaction: seastar_memory - oversized allocation: 147456 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:840 seastar::memory::cpu_pages::check_large_allocation(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:903 (inlined by) seastar::memory::cpu_pages::allocate_large(unsigned int, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:910 (inlined by) seastar::memory::allocate_large(unsigned long, bool) at ./build/release/seastar/./seastar/src/core/memory.cc:1533 (inlined by) seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./seastar/src/core/memory.cc:1679 seastar::memory::allocate(unsigned long) at ././seastar/src/core/memory.cc:1698 (inlined by) operator new(unsigned long) at ././seastar/src/core/memory.cc:2440 (inlined by) std::__new_allocator<interval<dht::ring_position>>::allocate(unsigned long, void const) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/new_allocator.h:151 (inlined by) std::allocator<interval<dht::ring_position>>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/allocator.h:203 (inlined by) std::allocator_traits<std::allocator<interval<dht::ring_position>>>::allocate(std::allocator<interval<dht::ring_position>>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/alloc_traits.h:614 (inlined by) std::_Vector_base<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/stl_vector.h:387 (inlined by) std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>>::reserve(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/vector.tcc:79 dht::to_partition_ranges(utils::chunked_vector<interval<dht::token>, 131072ul> const&, seastar::bool_class<utils::can_yield_tag>) at ./dht/i_partitioner.cc:347 compaction::compaction::get_ranges_for_invalidation(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>> const&) at ./compaction/compaction.cc:619 (inlined by) compaction::compaction::get_compaction_completion_desc(std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>, std::vector<seastar::lw_shared_ptr<sstables::sstable>, std::allocator<seastar::lw_shared_ptr<sstables::sstable>>>) at ./compaction/compaction.cc:719 (inlined by) compaction::regular_compaction::replace_remaining_exhausted_sstables() at ./compaction/compaction.cc:1362 compaction::compaction::finish(std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>, std::chrono::time_point<db_clock, std::chrono::duration<long, std::ratio<1l, 1000l>>>) at ./compaction/compaction.cc:1021 compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0::operator()() at ./compaction/compaction.cc:1960 (inlined by) compaction::compaction_result std::__invoke_impl<compaction::compaction_result, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(std::__invoke_other, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:63 (inlined by) std::__invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type std::__invoke<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/bits/invoke.h:98 (inlined by) decltype(auto) std::__apply_impl<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&, std::integer_sequence<unsigned long, ...>) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2920 (inlined by) decltype(auto) std::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0, std::tuple<>>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/tuple:2935 (inlined by) seastar::future<compaction::compaction_result> seastar::futurize<compaction::compaction_result>::apply<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&, std::tuple<>&&) at ././seastar/include/seastar/core/future.hh:1930 (inlined by) seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()::operator()() const at ././seastar/include/seastar/core/thread.hh:267 (inlined by) seastar::noncopyable_function<void ()>::direct_vtable_for<seastar::futurize<std::invoke_result<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>::type>::type seastar::async<compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0>(seastar::thread_attributes, compaction::compaction::run(std::unique_ptr<compaction::compaction, std::default_delete<compaction::compaction>>)::$_0&&)::'lambda'()>::call(seastar::noncopyable_function<void ()> const) at ././seastar/include/seastar/util/noncopyable_function.hh:138 seastar::noncopyable_function<void ()>::operator()() const at ./build/release/seastar/./seastar/include/seastar/util/noncopyable_function.hh:224 (inlined by) seastar::thread_context::main() at ./build/release/seastar/./seastar/src/core/thread.cc:318 dht::partition_ranges_vector is used on the hot path, so just convert the problematic user -- cache invalidation -- to use utils::chunked_vector<dht::partition_range> instead. Fixes: SCYLLADB-121 Closes scylladb/scylladb#28855	2026-03-09 22:04:54 +02:00
Andrei Chekun	8acba40c84	test.py: fix unawaited ScyllaLogFile.grep() coroutines Fixed several places where ScyllaLogFile.grep() was called without await, resulting in checking coroutine objects for truthiness instead of actual log matches. Fixes: SCYLLADB-903	2026-03-09 19:41:07 +01:00
Andrei Chekun	224a11be65	tests: fix test_group0_recovers_after_partial_command_application Due to the fact that grep logs was not awaited this issue was masked. With adding await for log grep it started to fail. This PR fixes the test.	2026-03-09 19:41:07 +01:00
Nadav Har'El	16e7a88a02	test/alternator: fix do_test() in test_streams.py Many tests in test/alternator/test_streams.py use a do_test() function which performs a user-defined function that runs some write requests, and then verifies that the expected output appears on the stream. Because DynamoDB drops do-nothing changes from the stream - such as writing to an item a value that it already has - these tests need to write to a different item each time, so do_test() invents a random key and passes it to the user-defined function to use. But... we had a bug, the random number generation was done only once, instead of every time. The fix is to do the random number generation on every call. We never noticed this bug when each test used a brand new table. But the next patch will make the tests share the test table, and tests start to fail. It's especially visible if you run the same test twice against DynamoDB, e.g., test/alternator/run --count 2 --aws \ test_streams.py::test_streams_putitem_keys_only Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-03-09 19:21:53 +02:00
Łukasz Paszkowski	147b355326	replica/table: avoid computing token range side in storage_group_of() on hot path `storage_group_of()` is on the replica-side token lookup hot path but used `tablet_map::get_tablet_id_and_range_side()`, which computes both tablet id and post-split range side. Most callers only need the storage group id. Switch `storage_group_of()` to use `get_tablet_id()` via `tablet_id_for_token()`, and select the compaction group via new overloads that compute the range side only when splitting mode is active.	2026-03-09 17:59:36 +01:00
Łukasz Paszkowski	419e9aa323	replica/compaction_group: add lazy select_compaction_group() overloads Change `storage_group::select_compaction_group()` to accept a token (and tablet_map) and compute the tablet range side only when splitting_mode() is active. Add an overload for selecting the compaction group for an sstable spanning a token range.	2026-03-09 17:59:36 +01:00
Łukasz Paszkowski	3f70611504	locator/tablets: add tablet_map::get_tablet_range_side() Add `tablet_map::get_tablet_range_side(token)` to compute the post-split range side without computing the tablet id. Pure addition, no behavior change.	2026-03-09 17:59:36 +01:00
Jakub Smolar	7cdd979158	db/config: announce ms format as highest supported Uncomment the feature flag check in get_highest_supported_format() to return MS format when supported, otherwise fall back to ME.	2026-03-09 17:12:09 +01:00
Michał Chojnowski	949fc85217	db/config: enable `ms` sstable format by default Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes. Make them the new default. If we change our mind, this change can be reverted later.	2026-03-09 17:12:09 +01:00
Michał Chojnowski	6b413e3959	cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format Trie-based indexes and older indexes have a difference in metrics, and the test uses the metrics to check for bypass cache. To choose the right metrics, it uses highest_supported_sstable_format, which is inappropriate, because the sstable format chosen for writes by Scylla might be different than highest_supported_sstable_format. Use chosen_sstable_format instead.	2026-03-09 17:12:09 +01:00
Michał Chojnowski	b89840c4b9	api/system: add /system/chosen_sstable_version Returns the sstable version currently chosen for use in for new sstables. We are adding it because some tests want to know what format they are writing (tests using upgradesstable, tests which check stats that only apply to one of the index types, etc). (Currently they are using `highest_supported_sstable_format` for this purpose, which is inappropriate, and will become invalid if a non-latest format is the default).	2026-03-09 17:12:09 +01:00
Michał Chojnowski	9280a039ee	test/cluster/dtest: reduce num_tokens to 16 cluster.dtest_alternator_tests.test_slow_query_logging performs a bootstrap with 768 token ranges. It works with `me` sstables, which have 2 open file descriptors per open sstable, but with `ms` sstables, which have 3 open file descriptors per open sstable, it fails with EMFILE. To avoid this problem, let's just decrease the number of vnodes for in the test suite. It's appropriate anyway, because it avoids some unneeded work without weakening the tests. (Note: pylib-based have been setting `num_tokens` to 16 for a long time too). This breaks `bypass_cache_test`, which is written in a way that expects a certain number of token ranges. We adjust the relevant parameter accordingly.	2026-03-09 17:12:09 +01:00
Marcin Maliszkiewicz	96a2b0e634	test: add tests for global group0_batch barrier feature Runtime: 16s in dev mode	2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz	6723ced684	qos: switch service levels write paths to use global group0_batch barrier This ensures that we return auth functions only after we wait until all live nodes apply our mutations.	2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz	fe79fdf090	auth: switch write paths to use global group0_batch barrier This ensures that we return auth functions only after we wait until all live nodes apply our mutations.	2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz	4c8681a927	raft: add function to broadcast read barrier request This function ensures that all alive nodes executed read barrier. It will be usefull for the following commits which would eventually delay returning response to the user until mutations are applied on other nodes so that the user may perceive better data consistency accross nodes.	2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz	cbae84a926	raft: add gossiper dependency to raft_group0_client In following commit raft_group0_client will send read barrier RPC to all alive nodes, it takes list of the nodes from gossiper.	2026-03-09 15:15:59 +01:00
Marcin Maliszkiewicz	8422fbca9f	raft: add read barrier RPC The RPC does read barrier on a destination node. It will be issued in following commits to live nodes to assure that command was applied everywhere.	2026-03-09 15:15:59 +01:00
Michał Chojnowski	ff60a5f1e5	cql3: suggest ALTER MATERIALIZED VIEW to users trying to use ALTER TABLE on a view When a user tries to use ALTER TABLE on a materialized view, the resulting error message is `Cannot use ALTER TABLE on Materialized View`. The intention behind this error is that ALTER MATERIALIZED VIEW should be used instead. But we observed that some users interpret this error message as a general "You cannot do any ALTER on this thing". This patch enhances the error message (and others similar to it) to prevent the confusion. Closes scylladb/scylladb#28831	2026-03-09 15:07:21 +01:00
Botond Dénes	1e41db5948	Merge 'service: tasks: return successful status if a table was dropped' from Aleksandra Martyniuk tablet_virtual_task::wait throws if a table on which a tablet operation was working is dropped. Treat the tablet operation as successful if a table is dropped. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-494 Needs backport to all live releases Closes scylladb/scylladb#28933 * github.com:scylladb/scylladb: test: add test_tablet_repair_wait_with_table_drop service: tasks: return successful status if a table was dropped	2026-03-09 16:04:44 +02:00
Piotr Dulikowski	23ed0d4df8	Merge 'vector_search: fix TLS server name with IP' from Karol Nowacki SNI works only with DNS hostnames. Adding an IP address causes warnings on the server side. This change adds SNI only if it is not an IP address. This change has no unit tests, as this behavior is not critical, since it causes a warning on the server side. The critical part, that the server name is verified, is already covered. This PR also adds warning logs to improve future troubleshooting of connections to the vector-store nodes. Fixes: VECTOR-528 Backports to 2025.04 and 2026.01 are required, as these branches are also affected. Closes scylladb/scylladb#28637 * github.com:scylladb/scylladb: vector_search: fix TLS server name with IP vector_search: add warn log for failed ann requests	2026-03-09 15:03:22 +01:00
Asias He	e0483f6001	test: Fix coordinator assumption in do_test_tablet_incremental_repair_merge_error The first node in the cluster is not guaranteed to be the coordinator node. Hardcoding node 0 as the coordinator causes test flakiness. This patch dynamically finds the actual coordinator node and targets it for error injection, log checking, and restarts. Additionally, inject `tablet_force_tablet_count_decrease_once` across all servers to force the tablet merge process to trigger once. Fixes SCYLLADB-865 Closes scylladb/scylladb#28945	2026-03-09 15:27:45 +02:00
Marcin Maliszkiewicz	b6a7484520	docs: note eventual visibility of auth changes Mention that role and permission changes are durable but may not be immediately visible on other nodes due to asynchronous replication. Fixes: SCYLLADB-651 Closes scylladb/scylladb#28900	2026-03-09 14:07:10 +01:00
Piotr Dulikowski	42d70baad3	db: view: mutate_MV: don't hold keyspace ref across preemption Currently, the view_update_generator::mutate_MV function acquires a reference to the keyspace relevant to the operation, then it calls max_concurrent_for_each and uses that reference inside the lambda passed to that function. max_concurrent_for_each can preempt and there is no mechanism that makes sure that the keyspace is alive until the view updates are generated, so it is possible that the keyspace is freed by the time the reference is used. Fix the issue by precomputing the necessary information based on the keyspace reference right away, and then passing that information by value to the other parts of the code. It turns out that we only need to know whether the keyspace uses tablets and whether it uses a network topology strategy. Fixes: scylladb/scylladb#28925 Closes scylladb/scylladb#28928	2026-03-09 15:04:26 +02:00
Łukasz Paszkowski	826fd5d6c3	test/storage: harden out-of-space prevention tests around restart and disk-utilization transitions The tests in test_out_of_space_prevention.py are flaky. Three issues contribute: 1. After creating/removing the blob file that simulates disk pressure, the tests immediately checked derived state (e.g., "compaction_manager - Drained") without first confirming the disk space monitor had detected the utilization change. Fix: explicitly wait for "Reached/Dropped below critical disk utilization level" right after creating/removing the blob file, before checking downstream effects. 2. Several tests called `manager.driver_connect()` or omitted reconnection entirely after `server_restart()` / `server_start()`. The pre-existing driver session can silently reconnect multiple times, causing subsequent CQL queries to fail. Fix: call `reconnect_driver()` after every node restart. Additionally, call `wait_for_cql_and_get_hosts()` where CQL is used afterward, to ensure all connection pools are established. 3. Some log assertions used marks captured before a restart, so they could match pre-restart messages or miss messages emitted in the correct post-restart window. Fix: refresh marks at the right points. Apart from that, the patch fixes a typo: autotoogle -> autotoggle. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-655 Closes scylladb/scylladb#28626	2026-03-09 14:45:09 +02:00
Calle Wilund	ef795eda5b	test_encryption: Fix test_system_auth_encryption Fixes: SCYLLADB-915 Test was quite broken; Not waiting for coro:s, as well as a bunch of checks no longer even close to valid (this is a ported dtest, and not a very good one). Closes scylladb/scylladb#28887	2026-03-09 14:38:31 +02:00
Marcin Maliszkiewicz	f177259316	Merge 'vector_search: small improvements' from Karol Nowacki vector_search: small improvements This PR addresses several minor code quality issues and style inconsistencies within the vector_search module. No backport is needed as these improvements are not visible to the end user. Closes scylladb/scylladb#28718 * github.com:scylladb/scylladb: vector_search: fix names of private members vector_search: remove unused global variable	2026-03-09 11:42:35 +01:00
Botond Dénes	6bba4f7ca1	Merge 'test: cluster: util: sleep for 0.01s between writes in do_writes' from Patryk Jędrzejczak Tests use `start_writes` as a simple write workload to test that writes succeed when they should (e.g., there is no availability loss), but not to test performance. There is no reason to overload the CPU, which can lead to test failures. I suspect this function to be the cause of SCYLLADB-929, where the failures of `test_raft_recovery_user_data` (that creates multiple write workloads with `start_writes`) indicated that the machine was overloaded. The relevant observations: - two runs failed at the same time in debug mode, - there were many reactor stalls and RPC timeouts in the logs (leading to unexpected events like servers marking each other down and group0 leader changes). I didn't prove that `start_writes` really caused this, but adding this sleep should be a good change, even if I'm wrong. The number of writes performed by the test decreases 30-50 times with the sleep. Note that some other util functions like `start_writes_to_cdc_table` have such a sleep. This PR also contains some minor updates to `test_raft_recovery_user_data`. Fixes SCYLLADB-929 No backport: - the failures were observed only in master CI, - no proof that the change fixes the issue, so backports could be a waste of time. Closes scylladb/scylladb#28917 * github.com:scylladb/scylladb: test: test_raft_recovery_user_data: replace asyncio.gather with gather_safely test: test_raft_recovery_user_data: use the exclude_node API test: test_raft_recovery_user_data: drop tablet_load_stats_cfg test: cluster: util: sleep for 0.01s between writes in do_writes	2026-03-09 12:12:04 +02:00
Nadav Har'El	47e8206482	test/alternator: test, and document, Alternator's data encoding This patch adds a test file, test/alternator/test_encoding.py, testing how Alternator stores its data in the underlying CQL database. We test how tables are named, and how attributes of different types are encoded into CQL. The test, which begins with a long comment, also doubles as developer- oriented documention on how Alternator encodes its data in the CQL database. This documentation is not intended for end-users - we do not want to officially support reading or writing Alternator tables through CQL. But once in a while, this information can come in handy for developers. More importantly, this test will also serve as a regression test, verifying that Alternator's encoding doesn't change unintentionally. If we make an unintentional change to the way that Alternator stores its data, this can break upgrades: The new code might not be able to read or write the old table with its old encoding. So it's important to make sure we never make such unintentional changes to the encoding of Alternator's data. If we ever do make intentional changes to Alternator's data encoding, we will need to fix the relevant test; But also not forget to make sure that the new code is able to read the old encoding as well. The new tests use both "dynamodb" (Alternator) and "cql" fixtures, to test how CQL sees the Alternator tables. So naturally are these tests are marked "scylla_only" and skipped on DynamoDB. Fixes #19770. Closes scylladb/scylladb#28866	2026-03-09 10:50:09 +01:00
Andrzej Jackowski	6fb5ab78eb	db/config: move guardrails config to one place and reorder The motivations for this patch are as follows: - Guardrails should follow similar conventions, e.g. for config names, metrics names, testing. Keeping guardrails together makes it easier to find and compare existing guardrails when new guardrails are implemented. - The configuration is used to auto-generate the documentation (particularly, the `configuration-parameters` page). Currently, the order of parameters in the documentation is inconsistent (e.g. `minimum_replication_factor_fail_threshold` before `minimum_replication_factor_warn_threshold` but `maximum_replication_factor_fail_threshold` after `maximum_replication_factor_warn_threshold`), which can be confusing to customers. Fixes: SCYLLADB-256 Closes scylladb/scylladb#28932	2026-03-09 10:50:00 +01:00
Patryk Jędrzejczak	46b7170347	Merge 'test/pylib: centralize timeout scaling and propagate build_mode in LWT helpers' from Alex Dathskovsky This series improves timeout handling consistency across the test framework and makes build-mode effects explicit in LWT tests. (starting with LWT test that got flaky) 1. Centralize timeout scaling Introduce scale_timeout(timeout) fixture in runner.py to provide a single, consistent mechanism for scaling test timeouts based on build mode. Previously, timeout adjustments were done in an ad-hoc manner across different helpers and tests. Centralizing the logic: Ensures consistent behavior across the test suite Simplifies maintenance and reasoning about timeout behavior Reduces duplication and per-test scaling logic This becomes increasingly important as tests run on heterogeneous hardware configurations, where different build modes (especially debug) can significantly impact execution time. 2. Make scale_timeout explicit in LWT helpers Propagate scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes. Additionally: Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() for consistent polling behavior across modes Update all LWT test call sites to pass build_mode explicitly Increase default timeout values, as the previous defaults were too short and prone to flakiness, particularly under slower configurations such as debug builds Overall, this series improves determinism, reduces flakiness, and makes the interaction between build mode and test timing explicit and maintainable. backport: not required just an enhansment for test.py infra Closes scylladb/scylladb#28840 * https://github.com/scylladb/scylladb: test/auth_cluster: align service-level timeout expectations with scaled config test/lwt: propagate scale_timeout through LWT helpers; scale resize waits Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes. Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes. Adjust LWT test call sites to pass scale_timeout explicitly. Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds). test/pylib: introduce scale_timeout fixture helper	2026-03-09 10:28:19 +01:00
Patryk Jędrzejczak	4c8dba15f1	Merge 'strong_consistency/state_machine: ensure and upgrade mutations schema' from Michał Jadwiszczak This patch fixes 2 issues within strong consistency state machine: - it might happen that apply is called before the schema is delivered to the node - on the other hand, the apply may be called after the schema was changed and purged from the schema registry The first problem is fixed by doing `group0.read_barrier()` before applying the mutations. The second one is solved by upgrading the mutations using column mappings in case the version of the mutations' schema is older. Fixes SCYLLADB-428 Strong consistency is in experimental phase, no need to backport. Closes scylladb/scylladb#28546 * https://github.com/scylladb/scylladb: test/cluster/test_strong_consistency: add reproducer for old schema during apply test/cluster/test_strong_consistency: add reproducer for missing schema during apply test/cluster/test_strong_consistency: extract common function raft_group_registry: allow to drop append entries requests for specific raft group strong_consistency/state_machine: find and hold schemas of applying mutations strong_consistency/state_machine: pull necessary dependencies db/schema_tables: add `get_column_mapping_if_exists()`	2026-03-09 09:49:22 +01:00
Marcin Maliszkiewicz	4150c62f29	Merge 'test_proxy_protocol: fix flaky system.clients visibility checks' from Piotr Smaron `test_proxy_protocol_port_preserved_in_system_clients` failed because it didn't see the just created connection in system.clients immediately. The last lines of the stacktrace are: ``` # Complete CQL handshake await do_cql_handshake(reader, writer) # Now query system.clients using the driver to see our connection cql = manager.get_cql() rows = list(cql.execute( f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING" )) # We should find our connection with the fake source address and port > assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients" E AssertionError: Expected to find connection from 203.0.113.200 in system.clients E assert 0 > 0 E + where 0 = len([]) ``` Explanation: we first await for the hand-made connection to be completed, then, via another connection, we're querying system.clients, and we don't get this hand-made connection in the resultset. The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper that polls via cql.run_async() until the expected row count is reached (30 s timeout, 100 ms period). Fixes: SCYLLADB-819 The flaky test is present on master and in previous release, so backporting only there. Closes scylladb/scylladb#28849 * github.com:scylladb/scylladb: test_proxy_protocol: introduce extra logging to aid debugging test_proxy_protocol: fix flaky system.clients visibility checks	2026-03-09 08:37:57 +01:00
Yaron Kaikov	977bdd6260	.github/workflows/trigger-scylla-ci: fix heredoc injection in trigger-scylla-ci workflow Move all ${{ }} expression interpolations into env: blocks so they are passed as environment variables instead of being expanded directly into shell scripts. This prevents an attacker from escaping the heredoc in the Validate Comment Trigger step and executing arbitrary commands on the runner. The Verify Org Membership step is hardened in the same way for defense-in-depth. Refs: GHSA-9pmq-v59g-8fxp Fixes: SCYLLADB-954 Closes scylladb/scylladb#28935	2026-03-08 21:34:51 +02:00
Wojciech Mitros	0008976e2f	mv: allow skipping view updates when a collection is unmodified When we generate view updates, we check whether we can skip the entire view update if all columns selected by the view are unmodified. However, for collection columns, we only check if they were unset before and after the update. In this patch we add a check for the actual collection contents. We perform this check for both virtual and non-virtual selections. When the column is only a virtual column in the view, it would be enough to check the liveness of each collection cell, however for that we'd need to deserialize the entire collection anyway, which should be effectively as expensive as comparing all of its bytes. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-808	2026-03-08 16:23:22 +01:00
Wojciech Mitros	7d1e0a2e4d	mv: allow skipping view updates if an empty collection remains unset Currently, when we generate view updates, we skip the view update if all columns selected by the view are unchanged in the base table update. However, this does not apply for collection columns - if the base table has a collection regular column, we never allow skipping generating view updates and the reason for that is missing implementation. We can easily relax this for the case where the collection was missing before and after the update - in this commit we move the check for collections after the check for missing cells.	2026-03-08 16:22:27 +01:00
Artsiom Mishuta	fda68811e8	test.py: fix strict-config argument. The ini-level strict_config was removed/never existed as a config key in pytest 8 — it's only a command-line flag(and back in pytest 9) In pytest 8.3.5, the equivalent is the --strict-config CLI flag, not an ini option Fixes SCYLLADB-955 Closes scylladb/scylladb#28939	2026-03-08 16:09:29 +02:00
Taras Veretilnyk	739dd59ebc	docs: document components_digests subcomponent and trailing digest in Scylla.db Document the new `components_digests` subcomponent (tag 12) added to the Scylla.db metadata component, which stores CRC32 digests of all checksummed SSTable component files. Also document the trailing CRC32 digest that stores digest of the scylla metadata itself.	2026-03-06 21:58:15 +01:00
Taras Veretilnyk	2b1c37396a	sstable_compaction_test: Add tests for perform_component_rewrite Add two test cases to verify the correctness of the perform_component_rewrite functionality: - test_perform_component_rewrite_single_sstable: Tests rewriting the Statistics component of a single sstable - test_perform_component_rewrite_multiple_sstables: Tests rewriting 5 out of 10 sstables	2026-03-06 21:58:15 +01:00
Taras Veretilnyk	591d13e942	sstable_test: add verification testcases of SSTable components digests persistance Adds a generic test helper that writes a random SSTable, reloads it, and verifies that the persisted CRC32 digest for each component matches the digest computed from disk. Those covers all checksummed components test cases.	2026-03-06 21:58:15 +01:00
Taras Veretilnyk	54af4a26ca	sstables: store digest of all sstable components in scylla metadata This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in scylla metadata component. This also extends new rewrite component mechanism, to rewrite metadata with updated digest together with the component.	2026-03-06 21:58:10 +01:00
Dawid Mędrek	5feed00caa	Merge 'raft: read_barrier: update local commit_idx to read_idx when it's safe' from Patryk Jędrzejczak When the local entry with `read_idx` belongs to the current term, it's safe to update the local `commit_idx` to `read_idx`. The motivation for this change is to speed up read barriers. `wait_for_apply` executed at the end of `read_barrier` is delayed until the follower learns that the entry with `read_idx` is committed. It usually happens quickly in the `read_quorum` message. However, non-voters don't receive this message, so they have to wait for `append_entries`. If no new entries are being added, `append_entries` can come only from `fsm::tick_leader()`. For group0, this happens once every 100ms. The issue above significantly slows down cluster setups in tests. Nodes join group0 as non-voters, and then they are met with several read barriers just after a write to group0. One example is `global_token_metadata_barrier` in `write_both_read_new` performed just after `update_topology_state` in `write_both_read_old`. I tested the performance impact of this change with the following test: ```python for _ in range(10): await manager.servers_add(3) ``` It consistently takes 44-45s with the change and 50-51s without the change in dev mode. No backport: - non-critical performance improvement mostly relevant in tests, - the change requires some soak time in master. Closes scylladb/scylladb#28891 * github.com:scylladb/scylladb: raft: server: fix the repeating typo raft: clarify the comment about read_barrier_reply raft: read_barrier: update local commit_idx to read_idx when it's safe raft: log: clarify the specification of term_for	2026-03-06 18:50:08 +01:00
Aleksandra Martyniuk	40dca578c5	test: add test_tablet_repair_wait_with_table_drop	2026-03-06 15:08:29 +01:00
Piotr Smaron	f12e4ea42b	test_proxy_protocol: introduce extra logging to aid debugging In case of an error, we want to see the contents of the system.clients table to have a better understanding of what happened - whether the row(s) are really missing or maybe they are there, but 1 digit doesn't match or the row is half-written. We'll therefore query for the whole table on the CQL side, and then filter out the rows we want to later proceed with on the python side. This way we can dump the contents of the whole system.clients table if something goes south.	2026-03-06 14:50:12 +01:00
Piotr Smaron	d8cf2c5f23	test_proxy_protocol: fix flaky system.clients visibility checks `test_proxy_protocol_port_preserved_in_system_clients` failed because it didn't see the just created connection in system.clients immediately. The last lines of the stacktrace are: ``` # Complete CQL handshake await do_cql_handshake(reader, writer) # Now query system.clients using the driver to see our connection cql = manager.get_cql() rows = list(cql.execute( f"SELECT address, port FROM system.clients WHERE address = '{fake_src_addr}' ALLOW FILTERING" )) # We should find our connection with the fake source address and port > assert len(rows) > 0, f"Expected to find connection from {fake_src_addr} in system.clients" E AssertionError: Expected to find connection from 203.0.113.200 in system.clients E assert 0 > 0 E + where 0 = len([]) ``` Explanation: we first await for the hand-made connection to be completed, then, via another connection, we're querying system.clients, and we don't get this hand-made connection in the resultset. The solution is to replace the bare cql.execute() calls with await wait_for_results(), a helper that polls via cql.run_async() until the expected row count is reached (30 s timeout, 100 ms period). Fixes: SCYLLADB-819	2026-03-06 14:49:59 +01:00
Aleksandra Martyniuk	dd634c329f	service: tasks: return successful status if a table was dropped tablet_virtual_task::wait throws if a table on which a tablet operation was working is dropped. Treat the tablet operation as successful if a table is dropped.	2026-03-06 14:37:44 +01:00
Botond Dénes	4fdc0a5316	Merge 'Relax test's check_mutation_replicas() argument list' from Pavel Emelyanov The one accepts long list of arguments, some of those is not really needed. Also some callers can be relaxed not to provide default values for arguments with such. Improving tests, not backporting Closes scylladb/scylladb#28861 * github.com:scylladb/scylladb: test: Remove passing default "expected_replicas" to check_mutation_replicas() test: Remove scope and primary-replica-only arguments from check_mutation_replicas() helper	2026-03-06 11:25:00 +02:00
Szymon Malewski	d817e56e87	vector_similarity_fcts.cc: fix strict aliasing violation in extract_float_vector Previous code performed endian conversion by bulk-copying raw bytes into a std::vector<float> and then iterating over it via a reinterpret_cast<uint32_t> pointer. Accessing float storage through a uint32_t violates C++ strict aliasing rules, giving the compiler freedom to reorder or elide the stores, causing undefined behavior. Replace the two-pass approach with a single-pass loop using seastar::consume_be<uint32_t>() and std::bit_cast<float>(), which is both well-defined and auto-vectorizable. Follow-up #28754 Closes scylladb/scylladb#28912	2026-03-06 09:15:45 +01:00
Artsiom Mishuta	5d7a73cc5b	test.py add support if non_gating tests Add support for non_gating, the opposite of gating in dtest terminology, tests in test.py codebase This test will/should not be run by any current gating job (ci/next/nightly) Closes scylladb/scylladb#28902	2026-03-06 09:39:32 +02:00
Andrei Chekun	01498a00d5	test.py: make HostRegistry singleton HostRegistry initialized in several places in the framework, this can lead to the overlapping IP, even though the possibility is low it's not zero. This PR makes host registry initialized once for the master thread and pytest. To avoid communication between with workers, each worker will get its own subnet that it can use solely for its own goals. This simplifies the solution while providing the way to avoid overlapping IP's. Closes scylladb/scylladb#28520	2026-03-06 09:25:29 +02:00
Artsiom Mishuta	2be4d8074d	test.py disable XFail tests on CI run This PR disables running FXAIL tests on ci run to speed it up. tests will continue run on "nightly" job and FAIL on unexpected pass and will continue run on "NEXT" job and NOT FAIL on unexpected pass Closes scylladb/scylladb#28886	2026-03-06 09:12:06 +02:00
Szymon Malewski	f9d213547f	cql3: selection: fix `add_column_for_post_processing` for ORDER BY The purpose of `add_column_for_post_processing` is to add columns that are required for processing of a query, but are not part of SELECT clause and shouldn't be returned. They are added to the final result set, but later are not serialized. Mainly it is used for filtering and grouping columns, with a special case of `WHERE primary_key IN ... ORDER BY ...` when the whole result set needs additional final sorting, and ordering columns must be added as well. There was a bug that manifested in #9435, #8100 and was actually identified in #22061. In case of selection with processing (e.g functions involved), result set row is formed in two stages. Initially it is a list of columns fetched from replicas - on which filtering and grouping is performed. After that the actual selection is resolved and the final number of columns can change. Ordering is performed on this final shape, but the ordering column index returned by `add_column_for_post_processing` refereed to initial shape. If selection refereed to the same column twice (e.g. `v, TTL(v)` as in #9435) final row was longer than initial and ordering refereed to incorrect column. If a function in selection refereed to multiple columns (e.g. as_json(.., ..) which #8100 effectively uses) the final row was shorter and ordering tried to use a non-existing column. This patch fixes the problem by making sure that column index of the final result set is used for ordering. The previously crashing test `cassandra_tests/validation/entities/json_test.py::testJsonOrdering` doesn't have to be skipped, but now it is failing on issue #28467. Fixes #9435 Fixes #8100 Fixes #22061 Closes scylladb/scylladb#28472	2026-03-05 19:22:34 +02:00
Patryk Jędrzejczak	c8c57850d9	test: test_raft_recovery_user_data: replace asyncio.gather with gather_safely	2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak	c3aa4ed23c	test: test_raft_recovery_user_data: use the exclude_node API The API is now available.	2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak	dd75687251	test: test_raft_recovery_user_data: drop tablet_load_stats_cfg The issue has been fixed.	2026-03-05 17:13:52 +01:00
Patryk Jędrzejczak	52940c4f31	test: cluster: util: sleep for 0.01s between writes in do_writes Tests use `start_writes` as a simple write workload to test that writes succeed when they should (e.g., there is no availability loss), but not to test performance. There is no reason to overload the CPU, which can lead to test failures. I suspect this function to be the cause of SCYLLADB-929, where the failures of `test_raft_recovery_user_data` (that creates multiple write workloads with `start_writes`) indicated that the machine was overloaded. The relevant observations: - two runs failed at the same time in debug mode, - there were many reactor stalls and RPC timeouts in the logs (leading to unexpected events like servers marking each other down and group0 leader changes). I didn't prove that `start_writes` really caused this, but adding this sleep should be a good change, even if I'm wrong. The number of writes performed by the test decreases 30-50 times with the sleep. Note that some other util functions like `start_writes_to_cdc_table` have such a sleep. Fixes SCYLLADB-929	2026-03-05 17:13:40 +01:00
Calle Wilund	ab3d3d8638	build: add slirp4netns to dependencies Needed for port forwarded podman-in-podman containers [avi: - move from Dockerfile to install-dependencies.sh so non-container builds also get it - regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz ] Closes scylladb/scylladb#28870	2026-03-05 17:44:17 +02:00
Michał Jadwiszczak	37bbbd3a27	test/cluster/test_strong_consistency: add reproducer for old schema during apply	2026-03-05 13:50:20 +01:00
Michał Jadwiszczak	6aef4d3541	test/cluster/test_strong_consistency: add reproducer for missing schema during apply	2026-03-05 13:50:16 +01:00
Michał Jadwiszczak	4795f5840f	test/cluster/test_strong_consistency: extract common function	2026-03-05 13:47:43 +01:00
Michał Jadwiszczak	3548b7ad38	raft_group_registry: allow to drop append entries requests for specific raft group Similar to `raft_drop_incoming_append_entries`, the new error injection `raft_drop_incoming_append_entries_for_specified_group` skips handler for `raft_append_entries` RPC but it allows to specify id of raft group for which the requests should be dropped. The id of a raft group should be passed in error injection parameters under `value` key.	2026-03-05 13:47:43 +01:00
Michał Jadwiszczak	b0cffb2e81	strong_consistency/state_machine: find and hold schemas of applying mutations It might happen that a strong consistency command will arrive to a node: - before it knows about the schema - after the schema was changes and the old version was removed from the memory To fix the first case, it's enough to perform a read barrier on group0. In case of the second one, we can use column mapping the upgrade the mutation to newer schema. Also, we should hold pointers to schemas until we finish `_db.apply()`, so the schema is valid for the whole time. And potentially we should hold multiple pointers because commands passed to `state_machine::apply()` may contain mutations to different schema versions. This commit relies on a fact that the tablet raft group and its state machine is created only after the table is created locally on the node. Fixes SCYLLADB-428	2026-03-05 13:47:40 +01:00
Tomasz Grabiec	b90fe19a42	Merge 'service: assert that tables updated via group0 use schema commitlog' from Aleksandra Martyniuk Set enable_schema_commitlog for each group0 tables. Assert that group0 tables use schema commitlog in ensure_group0_schema (per each command). Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914. Needs backport to all live releases as all are vulnerable Closes scylladb/scylladb#28876 * github.com:scylladb/scylladb: test: add test_group0_tables_use_schema_commitlog db: service: remove group0 tables from schema commitlog schema initializer service: ensure that tables updated via group0 use schema commitlog db: schema: remove set_is_group0_table param	2026-03-05 13:28:13 +01:00
Botond Dénes	509f2af8db	Merge 'repair: Fix rwlock in compaction_state and lock holder lifecycle' from Raphael Raph Carvalho Consider this: - repair takes the lock holder - tablet merge filber destories the compaction group and the compaction state - repair fails - repair destroy the lock holder This is observed in the test: ``` repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28] ... repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551] table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found) compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge .... scylla[10793] Segmentation fault on shard 28, in scheduling group streaming ``` The rwlock in compaction_state could be destroyed before the lock holder of the rwlock is destroyed. This causes user after free when the lock the holder is destroyed. To fix it, users of repair lock will now be waited when a compaction group is being stopped. That way, compaction group - which controls the lifetime of rwlock - cannot be destroyed while the lock is held. Additionally, the merge completion fiber - that might remove groups - is properly serialized with incremental repair. The issue can be reproduced using sanitize build consistently and can not be reproduced after the fix. Fixes #27365 Closes scylladb/scylladb#28823 * github.com:scylladb/scylladb: repair: Fix rwlock in compaction_state and lock holder lifecycle repair: Prevent repair lock holder leakage after table drop	2026-03-05 14:18:25 +02:00
Patryk Jędrzejczak	f1978d8a22	raft: server: fix the repeating typo	2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak	5a43695f6a	raft: clarify the comment about read_barrier_reply The comment could be misleading. It could suggest that the returned index is already safe to read. That's not necessarily true. The entry with the returned index could, for example, be dropped by the leader if the leader's entry with this index had a different term.	2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak	1ae2ae50a6	raft: read_barrier: update local commit_idx to read_idx when it's safe When the local entry with `read_idx` belongs to the current term, it's safe to update the local `commit_idx` to `read_idx`. The argument for safety is in the new comment above `maybe_update_commit_idx_for_read`. The motivation for this change is to speed up read barriers. `wait_for_apply` executed at the end of `read_barrier` is delayed until the follower learns that the entry with `read_idx` is committed. It usually happens quickly in the `read_quorum` message. However, non-voters don't receive this message, so they have to wait for `append_entries`. If no new entries are being added, `append_entries` can come only from `fsm::tick_leader()`. For group0, this happens once every 100ms. The issue above significantly slows down cluster setups in tests. Nodes join group0 as non-voters, and then they are met with several read barriers just after a write to group0. One example is `global_token_metadata_barrier` in `write_both_read_new` performed just after `update_topology_state` in `write_both_read_old`. Writing a test for this change would be difficult, so we trust the nemesis tests to do the job. They have already found consistency issues in read barriers. See #10578.	2026-03-05 13:06:08 +01:00
Patryk Jędrzejczak	1cbd0da519	raft: log: clarify the specification of term_for When `idx > last_idx()`, the function does an out-of-bounds access to `_log`. This may look contradictory to the current specification.	2026-03-05 13:06:07 +01:00
Michał Jadwiszczak	33a16940be	strong_consistency/state_machine: pull necessary dependencies Both migration manager and system keyspace will be used in next commit. The first one is needed to execute group0 read barrier and we need system keyspace to get column mappings.	2026-03-05 12:33:17 +01:00
Alex	b32ef8ecd5	test/auth_cluster: align service-level timeout expectations with scaled config Use scale_timeout_by_mode() in make_scylla_conf() to derive request_timeout_in_ms in test/pylib/scylla_cluster.py. Update test_connections_parameters_auto_update in test/cluster/auth_cluster/test_raft_service_levels.py to expect the mode-specific timeout string returned by the REST endpoint after this scaling change.	2026-03-05 13:32:15 +02:00
Alex	a66565cc42	test/lwt: propagate scale_timeout through LWT helpers; scale resize waits Pass scale_timeout explicitly through BaseLWTTester and Worker, validating it at construction time instead of relying on implicit pytest fixture injection inside helper classes. Update wait_for_phase_ops() and wait_for_tablet_count() to use scale_timeout_by_mode() so polling behavior remains consistent across build modes. Adjust LWT test call sites to pass scale_timeout explicitly. Increase default timeout values, as the previous defaults were too short and prone to flakiness under slower configurations (notably debug/dev builds).	2026-03-05 13:07:09 +02:00
Alex	73f1a65203	test/pylib: introduce scale_timeout fixture helper Introduce scale_timeout(mode) to centralize test timeout scaling logic based on build mode, the function will return a callable that will handle the timeout by mode. This ensures consistent timeout behavior across test helpers and eliminates ad-hoc per-test scaling adjustments. Centralizing the logic improves maintainability and makes timeout behavior easier to reason about. This becomes increasingly important as we run tests on heterogeneous hardware configurations. Different build modes (especially debug) can significantly affect execution time, and having a single scaling mechanism helps keep test stability predictable across environments. No functional change beyond unifying existing timeout scaling behavior.	2026-03-05 13:07:09 +02:00
Anna Stuchlik	855c503c63	doc: fix the unified installer instructions This commit updates the documentation for the unified installer. - The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS). - The info about CentOS 7 is removed (no longer supported). - Java 8 is removed. - The example for cassandra-stress is removed (as it was already removed on other installation pages). Fixes https://github.com/scylladb/scylladb/issues/28150 Closes scylladb/scylladb#28152	2026-03-05 12:57:06 +02:00
Michał Jadwiszczak	d25be9e389	db/schema_tables: add `get_column_mapping_if_exists()` In scenarios where we want to firsty check if a column mapping exists and if we don't want do flow control with exception, it is very wasteful to do ``` if (column_mapping_exists()) { get_column_mapping(); } ``` especially in a hot path like `state_machine::apply()` becase this will execute 2 internal queries. This commit introduces `get_column_mapping_if_exists()` function, which simply wrapps result of `get_column_mapping()` in optional and doesn't throw an exception if the mapping doesn't exist.	2026-03-05 11:55:57 +01:00
Artsiom Mishuta	7b30a3981b	test.py: enable strict_config,xfail_strict,strict-markers this commit enables 3 strict pytest options: strict_config - if any warnings encountered while parsing the pytest section of the configuration file will raise errors. xfail_strict - if markers not registered in the markers section of the configuration file will raise errors. strict-markers - if tests marked with @pytest.mark.xfail that actually succeed will by default fail the test suite and fix errors that occur after enabling these options Closes scylladb/scylladb#28859	2026-03-05 12:54:26 +02:00
Dawid Mędrek	7564a56dc8	Merge 'tombstone_gc: allow using repair-mode tombstone gc with RF=1 tables' from Botond Dénes Currently, repair-mode tombstone-gc cannot be used on tables with RF=1. We want to make repair-mode the default for all tablet tables (and more, see https://github.com/scylladb/scylladb/issues/22814), but currently a keyspace created with RF=1 and later altered to RF>1 will end up using timeout-mode tombstone gc. This is because the repair-mode tombstone-gc code relies on repair history to determine the gc-before time for keys/ranges. RF=1 tables cannot run repairs so they will have empty repair history and consequently won't be able to purge tombstones. This PR solves this by keeping a registry of RF=1 tables and consulting this registry when creating `tombstone_gc_state` objects. If the table is RF=1, tombstone-gc will work as if the table used immediate-mode tombstone-gc. The registry is updated on each replication update. As soon as the table is not RF=1 anymore, the tombstone-gc reverts to the natural repair-mode behaviour. After this PR, tombstone-gc defaults to repair-mode for all tables, regardless of RF and tablets/vnodes. Fixes: SCYLLADB-106. New feature, no backport required. Closes scylladb/scylladb#22945 * github.com:scylladb/scylladb: test/{boost,cluster}: add test for tombstone gc mode=repair with RF=1 tombstone_gc: allow use of repair-mode for RF=1 tables replica/table: update rf=1 table registry in shared tombstone-gc state tombstone_gc: tombstone_gc_before_getter: consider RF when getting gc before time tombstone_gc: unpack per_table_history_maps tombstone_gc: extract _group0_gc_time from per_table_history_map tombstone_gc: drop tombstone_gc_state(nullptr) ctor and operator bool() test/lib/random_schema: use timeout-mode tombstone_gc tombstone_gc_options: add C++ friendly constructor test: move away from tombstone_gc_state(nullptr) ctor treewide: move away from tombstone_gc_state(nullptr) ctor sstable: move away from tombstone_gc_mode::operator bool() replica/table: add get_tombstone_gc_state() compaction: use tombstone_gc_state with value semantics db/row_cache: use tombstone_gc_state with value semantics tombstone_gc: introduce tombstone_gc_state::for_tests()	2026-03-05 11:50:31 +01:00
Piotr Dulikowski	a2669e9983	test: test_mv_merge_allowed: add mistakenly omitted awaits The test test_mv_merge_allowed asserts in two places that the tablet count is 2. It does so by calling an async function but, mistakenly, the returned coroutine was not awaited. The coroutine is, apparently, truthy so the assertions always passed. Fix the test to properly await the coroutines in the assertions. Fixes: SCYLLADB-905 Closes scylladb/scylladb#28875	2026-03-05 11:29:23 +01:00
Avi Kivity	5ae40caa6d	dist: tune tcp_mem to 3% of total memory in scylla-kernel-conf package tcp_mem defaults to 9% of total memory. ScyllaDB defaults to 93%. The sum is more than 100%. Fix by tuning tcp_mem to 3% of total memory. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-734 Closes scylladb/scylladb#28700	2026-03-05 12:51:04 +03:00
Patryk Jędrzejczak	bb1a798c2c	Merge 'raft: Throw stopped_error if server aborted' from Dawid Mędrek This PR solves a series of similar problems related to executing methods on an already aborted `raft::server`. They materialize in various ways: * For `add_entry` and `modify_config`, a `raft::not_a_leader` with a null ID will be returned IF forwarding is disabled. This wasn't a big problem because forwarding has always been enabled for group0, but it's something that's nice to fix. It's might be relevant for strong consistency that will heavily rely on this code. * For `wait_for_leader` and `wait_for_state_change`, the calls may hang and never resolve. A more detailed scenario is provided in a commit message. For the last two methods, we also extend their descriptions to indicate the new possible exception type, `raft::stopped_error`. This change is correct since either we enter the functions and throw the exception immediately (if the server has already been aborted), or it will be thrown upon the call to `raft::server::abort`. We fix both issues. A few reproducer tests have been included to verify that the calls finish and throw the appropriate errors. Fixes SCYLLADB-841 Backport: Although the hanging problems haven't been spotted so far (at least to the best of my knowledge), it's best to avoid running into a problem like that, so let's backport the changes to all supported versions. They're small enough. Closes scylladb/scylladb#28822 * https://github.com/scylladb/scylladb: raft: Make methods throw stopped_error if server aborted raft: Throw stopped_error if server aborted test: raft: Introduce get_default_cluster	2026-03-05 10:47:39 +01:00
Botond Dénes	cd13a911cc	test/cluster/test_data_resurrection_in_memtable.py: dump rows before check So that if the check of expected rows fail, we have a dump to look at and see what is different.	2026-03-05 11:44:02 +02:00
Botond Dénes	f375aae257	replica/database: consolidate the two database_apply error injections Into a single database_apply one. Add three parameters: * ks_name and cf_name to filter the tables to be affected * what - what to do: throw or wait This leads to smaller footprint in the code and improved filtering for table names at the cost of some extra error injection params in the tests.	2026-03-05 11:44:02 +02:00
Marcin Maliszkiewicz	c3f59e4fa1	Merge 'cql3: implement write_consistency_levels guardrails' from Andrzej Jackowski This patch series implements `write_consistency_levels_warned` and `write_consistency_levels_disallowed` guardrails, allowing the configuration of which consistency levels are unwanted for writes. The motivation for these guardrails is to forbid writing with consistency levels that don't provide high durability guarantees (like CL=ANY, ONE, or LOCAL_ONE). Neither guardrail is enabled by default, so as not to disrupt clusters that are currently using any of the CLs for writes. The warning guardrail may seem harmless, as it only adds a warning to the CQL response; however, enabling it can significantly increase network traffic (as a warning message is added to each response) and also decrease throughput due to additional allocations required to prepare the warning. Therefore, both guardrails should be enabled with care. The newly added `writes_per_consistency_level` metric, which is incremented unconditionally, can help decide whether a guardrail can be safely enabled in an existing cluster. This commit adds additional `if` instructions on the critical path. However, based on the `perf_simple_query` benchmark for writes, the difference is marginal (~40 additional instructions, which is a relative difference smaller than 0.001). BEFORE: ``` 291443.35 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48067 insns/op, 18885 cycles/op, 0 errors) throughput: mean= 289743.07 standard-deviation=6075.60 median= 291424.69 median-absolute-deviation=1702.56 maximum=292498.27 minimum=261920.06 instructions_per_op: mean= 48072.30 standard-deviation=21.15 median= 48074.49 median-absolute-deviation=12.07 maximum=48119.87 minimum=48019.89 cpu_cycles_per_op: mean= 18884.09 standard-deviation=56.43 median= 18877.33 median-absolute-deviation=14.71 maximum=19155.48 minimum=18821.57 ``` AFTER: ``` 290108.83 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48121 insns/op, 18988 cycles/op, 0 errors) throughput: mean= 289105.08 standard-deviation=3626.58 median= 290018.90 median-absolute-deviation=1072.25 maximum=291110.44 minimum=274669.98 instructions_per_op: mean= 48117.57 standard-deviation=18.58 median= 48114.51 median-absolute-deviation=12.08 maximum=48162.18 minimum=48087.18 cpu_cycles_per_op: mean= 18953.43 standard-deviation=28.76 median= 18945.82 median-absolute-deviation=20.84 maximum=19023.93 minimum=18916.46 ``` Fixes: SCYLLADB-259 Refs: SCYLLADB-739 No backport, it's a new feature Closes scylladb/scylladb#28570 * github.com:scylladb/scylladb: scylla.yaml: add write CL guardrails to scylla.yaml scylla.yaml: reorganize guardrails config to be in one place test: add cluster tests for write CL guardrails test: implement test_guardrail_write_consistency_level cql3: start using write CL guardrails cql3/query_processor: implement metrics to track CL of writes db: cql3/query_processor: add write_consistency_levels enum_sets config: add write_consistency_levels_* guardrails configuration	2026-03-05 09:55:38 +01:00
Botond Dénes	44b8cad3df	service/storage_proxy: add name of table to error message for write errors It is useful to know what table the failed write belongs to.	2026-03-05 10:51:12 +02:00
Yauheni Khatsianevich	aa85f5a9c3	test: migrating alternator ttl tests to scylla repo migrating alternator_ttl_tests.py to scylla repo as part of deprecating dtest framework migrated tests: - test_ttl_with_load_and_decommission Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-869 Closes scylladb/scylladb#28858	2026-03-05 10:04:14 +02:00
Nadav Har'El	8e32d97be6	test/alternator: fix run script The test/alternator/run script currently fails, Scylla fails to boot complaining that "--alternator-ttl-period-in-seconds" is specified twice (which is, unfortunately, not allowed). The problem is that recently we started to set this option in test/cqlpy/run.py, for CQL's new row-level TTL, so now it is no longer needed in test/alternator/run - and in fact not allowed and we must remove it. This patch only affects the script test/alternator/run, and has no affect on running tests through test.py or Jenkins. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28868	2026-03-05 10:06:38 +03:00
Botond Dénes	9b2242c752	test/cluster/test_repair.py: fix test_repair_timtestamp_difference This test forgot to await its check() calls, which is the pass-condition of the test. Once the await was added, the test started failing. Turns out, the test was broken, but this was never discovered, because due to the missing await, the errors were not propagated. This patch adds the missing await and fixes the discovered problems: * Use cql.run_async() instead of cql.execute() * Fix json path for timestamp * Missing flush/compact Fixes: SCYLLADB-911 Closes scylladb/scylladb#28883	2026-03-05 10:04:49 +03:00
Nadav Har'El	af07718fff	test/cqlpy: fix "run --release" for versions 5.4 or older Recently we started to rely on the options "--auth-superuser-name" and "--auth-superuser-salted-password" to ensure that a cassandra/cassandra user exists for tests - without those options a default superuser no longer exists. This broke "test/cqlpy/run --release" for old releases, earlier than 5.4 (in the enterprise stream, 2024.1 or earlier), because those old release didn't have this option. So in this patch we fix the "--release" logic that removes these options from the command line when running these old versions. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28894	2026-03-05 09:59:46 +03:00
Botond Dénes	5e7b966d37	Merge 'Remove prepare_snapshot_for_backup() helper from backup/restore tests' from Pavel Emelyanov The helper in question duplicates the functionality of `take_snapshot()` one from the same file. The only difference is that it additionally creates keyspace:table with yet another helper, but that helper is also going to be removed (as continuation of #28600 and #28608) Enhancing tests, not backporting Closes scylladb/scylladb#28834 * github.com:scylladb/scylladb: test_backup: Remove prepare_snapshot_for_backup() test_backup: Patch test_simple_backup_and_restore to use take_snapshot() test_backup: Patch backup tests to use take_snapshot() test_backup: Add helper to take snapshot on a single server	2026-03-05 06:54:07 +02:00
Dani Tweig	25fc8ef14c	Add RELENG to milestone-to-Jira sync project keys Closes scylladb/scylladb#28889	2026-03-05 06:51:21 +02:00
Calle Wilund	35aab75256	test_internode_compression: Add await for "run" coro:s Fixes: SCYLLADB-907 Closes scylladb/scylladb#28885	2026-03-05 06:50:33 +02:00
Patryk Jędrzejczak	2a3476094e	storage_service: raft_topology_cmd_handler: fix use-after-free `8e9c7397c5` made `rs` a reference, which can lead to use-after-free. The `normal_nodes` map containing the referenced value can be destroyed before the last use of `rs` when the topology state is reloaded after a context switch on some `co_await`. The following move assignment in `storage_service::topology_state_load` causes this: ``` _topology_state_machine._topology = co_await _sys_ks.local().load_topology_state(tablet_hosts); ``` This issue has been discovered in next-2026.1 CI after queueing the backport of #28558. `test_truncate_during_topology_change` failed after ASan reported a heap-use-after-free in ``` co_await _repair.local().bootstrap_with_repair(get_token_metadata_ptr(), rs.ring.value().tokens, session); ``` This test enables `delay_bootstrap_120s`, which makes the bug much more likely to reproduce, but it could happen elsewhere. No backport needed, as the only backport of #28558 hasn't been merged yet. The backport PR will cherry-pick this commit. Closes scylladb/scylladb#28772	2026-03-05 01:50:22 +01:00
Aleksandra Martyniuk	156c29f962	test: add test_group0_tables_use_schema_commitlog	2026-03-04 17:25:06 +01:00
Aleksandra Martyniuk	5306e26b83	db: service: remove group0 tables from schema commitlog schema initializer Remove group0 tables from schema commitlog schema initializer. The schema commitlog of group0 tables is ensured by set_is_group0_table.	2026-03-04 17:25:06 +01:00
Aleksandra Martyniuk	690b2c4142	service: ensure that tables updated via group0 use schema commitlog Set enable_schema_commitlog for each group0 tables. Assert that group0 tables use schema commitlog in ensure_group0_schema (per each command). Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.	2026-03-04 17:25:04 +01:00
Aleksandra Martyniuk	6b3b174704	db: schema: remove set_is_group0_table param set_is_group0_table takes an enabled flag, based on which it decides whether it's a group0 table. The method is called only with enabled = true. Drop the param. For not group0 tables nothing should be set.	2026-03-04 17:24:34 +01:00
Aleksandra Martyniuk	57f1e46204	test: cluster: tasks: await set_task_ttl Await set_task_ttl in test_tablet_repair_task_children. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-912 Closes scylladb/scylladb#28882	2026-03-04 17:58:37 +02:00
Dawid Mędrek	d44fc00c4c	raft: Make methods throw stopped_error if server aborted After the previous changes in `raft::server::{add_entry, modify_config}` (cf. SCYLLADB-841), we also go through other methods of `raft::server` and verify that they handle the aborted state properly. I found two methods that do not: (A) `wait_for_leader` (B) `wait_for_state_change` What happened before these changes? In case (A), the dangerous scenario occurred when `_leader_promise` was empty on entering the function. In that case, we would construct the promise and wait on the corresponding future. However, if the server had been already aborted before the call, the future would never resolve and we'd be effectively stuck. Case (B) is fully analogous: instead of `_leader_promise`, we'd work with `_stte_change_promise`. There's probably a more proper solution to this problem, but since I'm not familiar with the internal code of Raft, I fix it this way. We can improve it further in the future. We provide two simple validation tests. They verify that after aborting a `raft::server`, the calls: * do not hang (the tests would time out otherwise), * throw raft::stopped_error. Fixes SCYLLADB-841	2026-03-04 16:28:11 +01:00
Dawid Mędrek	c200d6ab4f	raft: Throw stopped_error if server aborted Before the change, calling `add_entry` or `modify_config` on an already aborted Raft server could result in an error `not_a_leader` containing a null server ID. It was possible precisely when forwarding was disabled in the server configuration. `not_a_leader` is supposed to return the ID of the current leader, so that was wrong. Furthermore, the description of the function specified that if a server is aborted, then it should throw `stopped_error`. We fix that issue. A few small reproducer tests were provided to verify that the functions behave correctly with and without forwarding enabled. Refs SCYLLADB-841	2026-03-04 16:28:08 +01:00
Marcin Maliszkiewicz	9697b6013f	Merge 'test: add missing awaits in test_client_routes_upgrade' from Andrzej Jackowski Two calls in test_client_routes_upgrade were missing `await`, so they were never actually executed. This caused Python to emit RuntimeWarning about unawaited coroutines, and more importantly, the test skipped important verification steps, which could mask real bugs or cause flakiness. Additionally, increase 10s timeouts to 60s to avoid flakiness in slow environments. Although these tests haven't failed so far, similar issues have already been observed in other tests with too-short timeouts. Fixes: [SCYLLADB-909](https://scylladb.atlassian.net/browse/SCYLLADB-909) Backport to 2026.1, as the test is also there. [SCYLLADB-909]: https://scylladb.atlassian.net/browse/SCYLLADB-909?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#28877 * github.com:scylladb/scylladb: test: increase timeouts in test_client_routes.py test: add missing awaits in test_client_routes_upgrade	2026-03-04 15:26:34 +01:00
Szymon Malewski	212bd6ae1a	vector: Vectorize loops in similarity functions The main loops iterating over vector components were not vectorized due to: - "cannot prove it is safe to reorder floating-point operations" - "Cannot vectorize early exit loop with more than one early exit" The first issue is fixed with adding `#pragma clang fp contract(fast) reassociate(on)`, which allows compiler to optimize floating point operations. The second issue is solved by refactoring the operations in the affected loop. Additionally using float operations instead of double increases throughput and numerical accuracy is not the main consideration in vector search scenarios. Performance measured: - scylla built using dbuild - using https://github.com/zilliztech/VectorDBBench (modified to call `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...` without ANN search): - client concurrency 20 before: ~2250 QPS `float` operations: ~2350 QPS `compute_cosine_similarity` vectorization: ~2500QPS `extract_float_vector` vectorization: ~3000QPS Follow-up https://github.com/scylladb/scylladb/pull/28615 Ref https://scylladb.atlassian.net/browse/SCYLLADB-764 Closes scylladb/scylladb#28754	2026-03-04 15:14:53 +01:00
Andrzej Jackowski	221b78cb81	test: increase timeouts in test_client_routes.py Increase 10s timeouts to 60s to avoid flakiness in slow environments. Although these tests haven't failed so far, similar issues have already been observed in other tests with too-short timeouts. Test execution time is unaffected; the entire suite in `dev` takes ~30s before and after this change.	2026-03-04 13:40:30 +01:00
Andrzej Jackowski	527c4141da	test: add missing awaits in test_client_routes_upgrade Two calls in test_client_routes_upgrade were missing `await`, so they were never actually executed. This caused Python to emit RuntimeWarning about unawaited coroutines, and more importantly, the test skipped important verification steps, which could mask real bugs or cause flakiness. Fixes: SCYLLADB-909	2026-03-04 13:34:37 +01:00
Piotr Smaron	a31cb18324	db: fix UB in system.clients row sorting The comparator used to sort per-IP client rows was not a strict-weak-ordering (it could return true in both directions for some pairs), which makes `std::ranges::sort` behavior undefined. A concrete pair that breaks it (and is realistic in system.clients): a = (port=9042, client_type="cql") b = (port=10000, client_type="alternator") With the current comparator: cmp(a,b) = (9042 < 10000) \|\| ("cql" < "alternator") = true \|\| false = true cmp(b,a) = (10000 < 9042) \|\| ("alternator" < "cql") = false \|\| true = true So both directions are true, meaning there is no valid ordering that sort can achieve. The fix is to sort lexicographically by (port, client_type) to match the table's clustering key and ensure deterministic ordering. Closes scylladb/scylladb#28844	2026-03-04 14:10:49 +03:00
Avi Kivity	c331796d28	Merge 'Support Min < Precision for approx_exponential_histogram' from Amnon Heiman This series closes a gap in the approx_exponential_histogram implementation to cover integer values starting from small Min values. While the original implementation was focused on durations, where this limitation was not an issue, over time, there has been a growing need for histograms that cover smaller values, such as the number of SSTables or the number of items in a batch. The reason for the original limitation is inherent to the exponential histogram math. The previous code required Min to be at least Precision to avoid negative bit shifts in the exponential calculations. After this series, approx_exponential_histogram allows Min to be smaller than Precision by scaling values during indexing. The value is shifted left by log2 Precision minus log2 Min or zero whichever is larger, and the existing exponential math is applied. Bucket limits are then scaled back to the original units. This keeps insertion and retrieval O(1) without runtime branching, at the cost of repeated bucket limits for some values in the Min to Precision range. Additional tests cover the new behavior. Relates to #2785 New feature, no need to backport. Closes scylladb/scylladb#28371 * github.com:scylladb/scylladb: estimated_histogram_test.cc: add to_metrics_histogram test histogram_metrics_helper.hh: Support Min < Precision estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision estimated_histogram.hh: support Min less than Precision histograms	2026-03-04 12:43:26 +02:00
Szymon Malewski	4c4673e8f9	test: vector_similarity: Fix similarity value checks `isclose` function checks if returned similarity floats are close enough to expected value, but it doesn't `assert` by itself. Several tests missed that `assert`, effectively always passing. With this patch similarity values checks are wrapped in helper function `assert_similarity` with predefined tolerance. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-877 Closes scylladb/scylladb#28748	2026-03-04 09:53:32 +01:00
Marcin Maliszkiewicz	c7d3f80863	Merge 'auth: do not create default 'cassandra:cassandra' superuser' from Dario Mirovic This patch series removes creation of default 'cassandra:cassandra' superuser on system start. Disable creation of a superuser with default 'cassandra:cassandra' credentials to improve security. The current flow requires clients to create another superuser and then drop the default `cassandra:cassandra' role. For those who do, there is a time window where the default credentials exist. For those who do not, that role stays. We want to improve security by forcing the client to either use config to specify default values for default superuser name and password or use cqlsh over maintenance socket connection to explicitly create/alter a superuser role. The patch series: - Enable role modification over the maintenance socket - Stop using default 'cassandra' value for default superuser, skipping creation instead Design document: https://scylladb.atlassian.net/wiki/spaces/RND/pages/165773327/Drop+default+cassandra+superuser Fixes scylladb/scylla-enterprise#5657 This is an improvement. It does not need a backport. Closes scylladb/scylladb#27215 * github.com:scylladb/scylladb: config: enable maintenance socket in workdir by default docs: auth: do not specify password with -p option docs: update documentation related to default superuser test: maintenance socket role management test: cluster: add logs to test_maintenance_socket.py test: pylib: fix connect_driver handling when adding and starting server auth: do not create default 'cassandra:cassandra' superuser auth: remove redundant DEFAULT_USER_NAME from password authenticator auth: enable role management operations via maintenance socket client_state: add has_superuser method client_state: add _bypass_auth_checks flag auth: let maintenance_socket_role_manager know if node is in maintenance mode auth: remove class registrator usage auth: instantiate auth service with factory functors auth: add service constructor with factory functors auth: add transitional.hh file service: qos: handle special scheduling group case for maintenance socket service: qos: use _auth_integration as condition for using _auth_integration	2026-03-04 09:43:57 +01:00
Piotr Dulikowski	85dcbfae9a	Merge 'hint: Don't switch group in database::apply_hint()' from Pavel Emelyanov The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well. To be sure, add a debugging check for current group being the expected one. Code cleanup, not backporting Closes scylladb/scylladb#28545 * github.com:scylladb/scylladb: hint: Don't switch group in database::apply_hint() hint_sender: Switch to sender group on stop either	2026-03-04 09:36:38 +01:00
Pavel Emelyanov	5793e305b5	test_backup: Remove prepare_snapshot_for_backup() It's now unused Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-04 11:33:43 +03:00
Pavel Emelyanov	ffbd9a3218	test_backup: Patch test_simple_backup_and_restore to use take_snapshot() This change is a bit more careful, as the test collects files from snapshot directory several times. Before patching it to use the helper, it collected _all_ the files. Now the helper only provides TOC-s, but that's fine -- the only check that relies on that may also re-collect TOC-s and compare new set with old set. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-04 11:33:43 +03:00
Pavel Emelyanov	c1b0ac141b	test_backup: Patch backup tests to use take_snapshot() Some of those tests need to update the hard-coded 'backup' snapshot name to use the one provided by take_snapshot() helper. Other than that, the patching is pretty straightforward. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-04 11:33:43 +03:00
Pavel Emelyanov	ea17c26fd9	test_backup: Add helper to take snapshot on a single server The take_snapshot() helper returns a dict(server: list[string]). When there's only one server to work with, it's more handy to just get a single list of sstables. Next patches will make use of that helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-04 11:31:39 +03:00
Botond Dénes	e7487c21e4	test/{boost,cluster}: add test for tombstone gc mode=repair with RF=1	2026-03-04 09:45:38 +02:00
Botond Dénes	5998a859f7	tombstone_gc: allow use of repair-mode for RF=1 tables Modify the methods which calculate the default gc mode as well as that which validates whether repair-mode can be used at all, so both accepts use of repair-mode on RF=1 tables. This de-facto changes the default tombstone-gc to repair-mode for all tables. Documentation is updated accordingly. Some tests need adjusting: * cqlpy/test_select_from_mutation_fragments.py: disable GC for some test cases because this patch makes tombstones they write subject to GC when using defaults. * test/cluster/test_mv.py::test_mv_tombstone_gc_not_inherited used repair-mode as a non-default for the base table and expected the MV to revert to default. Another mode has to be used as the non-default (immediate). * test/cqlpy/test_tools.py::test_scylla_sstable_dump_schema: don't compare tombstone_gc schema extension when comparing dumped schema vs. original. The tool's schema loader doesn't have access to the keyspace definition so it will come up with different defaults for tombstone-gc. * test/boost/row_cache_test.cc::test_populating_cache_with_expired_and_nonexpired_tombstones sets tombstone expiry assuming the tombstone-gc timeout-mode default. Change the CREATE TABLE statement to set the expected mode.	2026-03-04 09:44:24 +02:00
Andrzej Jackowski	c0e94828de	scylla.yaml: add write CL guardrails to scylla.yaml Disabled by default. This change is introduced only to document the guardrail. Refs: SCYLLADB-259	2026-03-04 08:00:17 +01:00
Andrzej Jackowski	038f89ede4	scylla.yaml: reorganize guardrails config to be in one place Also change the format of the section header and add "#" to empty lines, so that in the future no one splits the section by adding new configs.	2026-03-04 08:00:17 +01:00
Andrzej Jackowski	ec42fdfd01	test: add cluster tests for write CL guardrails Most of the functionality is tested in cqlpy tests located in `test_guardrail_write_consistency_level.py`. Add two tests that require the cluster framework: - `test_invalid_write_cl_guardrail_config` checks the node startup path when incorrect `write_consistency_levels_warned` and `write_consistency_levels_disallowed` values are used. - `test_write_cl_default` checks the behavior of the default configuration using a multi-node cluster. Tests execution time: - Dev: 10s - Debug: 18s Refs: SCYLLADB-259	2026-03-04 08:00:17 +01:00
Andrzej Jackowski	446539f12f	test: implement test_guardrail_write_consistency_level Implement basic tests for write consistency level guardrails, verifying that they work for each type of write request (inserts, updates, deletes, logged batches, unlogged batches, conditional batches, and counter operations). All tests are marked as Scylla-only because they currently don't pass with Cassandra due to differences in handling superusers (see: SCYLLADB-882). Tests execution time: - Dev: 3s - Debug: 14s Refs: SCYLLADB-259 Refs: SCYLLADB-882	2026-03-04 08:00:13 +01:00
Avi Kivity	85bd6d0114	Merge 'Add multiple-shard persistent metadata storage for strongly consistent tables' from Wojciech Mitros In this series we introduce new system tables and use them for storing the raft metadata for strongly consistent tables. In contrast to the previously used raft group0 tables, the new tables can store data on any shard. The tables also allow specifying the shard where each partition should reside, which enables the tablets of strongly consistent tables to have their raft group metadata co-located on the same shard as the tablet replica. The new tables have almost the same schemas as the raft group0 tables. However, they have an additional column in their partition keys. The additional column is the shard that specifies where the data should be located. While a tablet and its corresponding raft group server resides on some shard, it now writes and reads all requests to the metadata tables using its shard in addition to the group_id. The extra partition key column is used by the new partitioner and sharder which allow this special shard routing. The partitioner encodes the shard in the token and the sharder decodes the shard from the token. This approach for routing avoids any additional lookups (for the tablet mapping) during operations on the new tables and it also doesn't require keeping any state. It also doesn't interact negatively with resharding - as long as tablets (and their corresponding raft metadata) occupy some shard, we do not allow starting the node with a shard count lower than the id of this shard. When increasing the shard count, the routing does not change, similarly to how tablet allocation doesn't change. To use the new tables, a new implementation of `raft::persistence` is added. Currently, it's almost an exact copy of the `raft_sys_table_storage` which just uses the new tables, but in the future we can modify it with changes specific to metadata (or mutation) storage for strongly consistent tables. The new storage is used in the `groups_manager`, which combined with the removal of some `this_shard_id() == 0` checks, allows strongly consistent tables to be used on all shards. This approach for making sure that the reads/writes to the new tables end up on the correct shards won in the balance of complexity/usability/performance against a few other approaches we've considered. They include: 1. Making the Raft server read/write directly to the database, skipping the sharder, on its shard, while using the default partitioner/sharder. This approach could let us avoid changing the schema and there should be no problems for reads and writes performed by the Raft server. However, in this approach we would input data in tables conflicting with the placement determined by the sharder. As a result, any read going through the sharder could miss the rows it was supposed to read. Even when reading all shards to find a specific value, there is a risk of polluting the cache - the rows loaded on incorrect shards may persist in the cache for an unknown amount of time. The cache may also mistakenly remember that a row is missing, even though it's actually present, just on an incorrect shard. Some of the issues with this approach could be worked around using another sharder which always returns this_shard_id() when asked about a shard. It's not clear how such a sharder would implement a method like `token_for_next_shard`, and how much simpler it would be compared to the current "identity" sharder. 2. Using a sharder depending on the current allocation of tablets on the node. This approach relies on the knowledge of group_id -> shard mapping at any point in time in the cluster. For this approach we'd also need to either add a custom partitioner which encodes the group_id in the token, or we'd need to track the token(group_id) -> shard mapping. This approach has the benefit over the one used in the series of keeping the partition key as just group_id. However, it requires more logic, and the access to the live state of the node in the sharder, and it's not static - the same token may be sharded differently depending on the state of the node - it shouldn't occur in practice, but if we changed the state of the node before adjusting the table data, we would be unable to access/fix the stale data without artificially also changing the state of the node. 3. Using metadata tables co-located to the strongly consistent tables. This approach could simplify the metadata migrations in the future, however it would require additional schema management of all co-located metadata tables, and it's not even obvious what could be used as the partition key in these tables - some metadata is per-raft-group, so we couldn't reuse the partition key of the strongly consistent table for it. And finding and remembering a partition key that is routed to a specific shard is not a simple task. Finally, splits and merges will most likely need special handling for metadata anyway, so we wouldn't even make use of co-located table's splits and merges. Fixes [SCYLLADB-361](https://scylladb.atlassian.net/browse/SCYLLADB-361) [SCYLLADB-361]: https://scylladb.atlassian.net/browse/SCYLLADB-361?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#28509 * github.com:scylladb/scylladb: docs: add strong consistency doc test/cluster: add tests for strongly-consistent tables' metadata persistence raft: enable multi-shard raft groups for strongly consistent tablets test/raft: add unit tests for raft_groups_storage raft: add raft_groups_storage persistence class db: add system tables for strongly consistent tables' raft groups dht: add fixed_shard_partitioner and fixed_shard_sharder raft: add group_id -> shard mapping to raft_group_registry schema: add with_sharder overload accepting static_sharder reference	2026-03-04 08:55:43 +02:00
Piotr Dulikowski	2fb981413a	Merge 'vector_search: test: fix HTTPS client test flakiness' from Karol Nowacki The default 100ms timeout for client readiness in tests is too aggressive. In some test environments, this is not enough time for client creation, which involves address resolution and TLS certificate reading, leading to flaky tests. This commit increases the default client creation timeout to 10 seconds. This makes the tests more robust, especially in slower execution environments, and prevents similar flakiness in other test cases. Fixes: VECTOR-547, SCYLLADB-802, SCYLLADB-825, SCYLLADB-826 Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well. Closes scylladb/scylladb#28846 * github.com:scylladb/scylladb: vector_search: test: include ANN error in assertion vector_search: test: fix HTTPS client test flakiness	2026-03-04 08:55:43 +02:00
Wojciech Mitros	38f02b8d76	mv: remove dead code in view_updates::can_skip_view_updates When we create a materialized view, we consider 2 cases: 1. the view's primary key contains a column that is not in the primary key of the base table 2. the view's primary key doesn't contain such a column In the 2nd case, we add all columns from the base table to the schema of the view (as virtual columns). As a result, all of these columns are effectively "selected" in view_updates::can_skip_view_updates. Same thing happens when we add new columns to the base table using ALTER. Because of this, we can never have !column_is_selected and !has_base_non_pk_columns_in_view_pk at the same time. And thus, the check (!column_is_selected && _base_info.has_base_non_pk_columns_in_view_pk) is always the same as (!column_is_selected). Because we immediately return after this check, the tail of this function is also never reached - all checks after the (column_is_selected) are affected by this. Also, the condition (!column_is_selected && base_has_nonexpiring_marker) is always false at the point it is called. And this in turn makes the `base_has_nonexpiring_marker` unused, so we delete it as well. It's worth considering, why did we even have `base_has_nonexpiring_marker` if it's effectively unused. We initially introduced it in `bd52e05ae2` and we (incorrectly) used it to allow skipping view updates even if the liveness of virtual columns changed. Soon after, in `5f85a7a821`, we started categorizing virtual columns as column_is_selected == true and we moved the liveness checks for virtual columns to the `if (column_is_selected)` clause, before the `base_has_nonexpiring_marker` check. We changed this because even if we have a nonexpiring marker right now, it may be changed in the future, in which case the liveness of the view row will depend on liveness of the virtual columns and we'll need to have the view updates from the time the row marker was nonexpiring. Closes scylladb/scylladb#28838	2026-03-04 08:55:43 +02:00
Geoff Montee	0eb5603ebd	Docs: describe the system tables Fixes issue #12818 with the following docs changes: docs/dev/system_keyspace.md: Added missing system tables, added table of contents (TOC), added categories Closes scylladb/scylladb#27789	2026-03-04 08:55:43 +02:00
Botond Dénes	f156bcddab	Merge 'test: decrease strain in test_startup_response' from Marcin Maliszkiewicz For 2025.3 and 2025.4 this test runs order of magnitude slower in debug mode. Potentially due to passwords::check running in alien thread and overwhelming the CPU (this is fixed in newer versions). Decreasing the number of connections in test makes it fast again, without breaking reproducibility. As additional measure we double the timeout. The fix is now cherry-picked to master as sometimes test fails there too. (cherry picked from commit `1f1fc2c2ac`) Fixes https://scylladb.atlassian.net/browse/SCYLLADB-795 backport: 2026.1, already on other stable branches Closes scylladb/scylladb#28848 * github.com:scylladb/scylladb: test: add more logs to test_startup_no_auth_response test: decrease strain in test_startup_response	2026-03-04 08:55:43 +02:00
Andrzej Jackowski	bb359b3b78	cql3: start using write CL guardrails Enable verification of write consistency level guardrails in `modification_statement` and `batch_statement`. Neither guardrail is enabled by default, so as not to disrupt clusters that are currently using any of the CLs for writes. The warning guardrail may seem harmless, as it only adds a warning to the CQL response; however, enabling it can significantly increase network traffic (as a warning message is added to each response) and also decrease throughput due to additional allocations required to prepare the warning. Therefore, both guardrails should be enabled with care. The newly added `writes_per_consistency_level` metric, which is incremented unconditionally, can help decide whether a guardrail can be safely enabled in an existing cluster. This commit adds additional `if` instructions on the critical path. However, based on the `perf_simple_query` benchmark for writes, the difference is marginal (~40 additional instructions, which is a relative difference smaller than 0.001). BEFORE: ``` 291443.35 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48067 insns/op, 18885 cycles/op, 0 errors) throughput: mean= 289743.07 standard-deviation=6075.60 median= 291424.69 median-absolute-deviation=1702.56 maximum=292498.27 minimum=261920.06 instructions_per_op: mean= 48072.30 standard-deviation=21.15 median= 48074.49 median-absolute-deviation=12.07 maximum=48119.87 minimum=48019.89 cpu_cycles_per_op: mean= 18884.09 standard-deviation=56.43 median= 18877.33 median-absolute-deviation=14.71 maximum=19155.48 minimum=18821.57 ``` AFTER: ``` 290108.83 tps ( 53.3 allocs/op, 16.0 logallocs/op, 14.2 tasks/op, 48121 insns/op, 18988 cycles/op, 0 errors) throughput: mean= 289105.08 standard-deviation=3626.58 median= 290018.90 median-absolute-deviation=1072.25 maximum=291110.44 minimum=274669.98 instructions_per_op: mean= 48117.57 standard-deviation=18.58 median= 48114.51 median-absolute-deviation=12.08 maximum=48162.18 minimum=48087.18 cpu_cycles_per_op: mean= 18953.43 standard-deviation=28.76 median= 18945.82 median-absolute-deviation=20.84 maximum=19023.93 minimum=18916.46 ``` Fixes: SCYLLADB-259	2026-03-04 07:26:00 +01:00
Asias He	225b10b683	repair: Fix rwlock in compaction_state and lock holder lifecycle Consider this: - repair takes the lock holder - tablet merge filber destories the compaction group and the compaction state - repair fails - repair destroy the lock holder This is observed in the test: ``` repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036] Repair 1 out of 1 tablets: table=sec_index.users range=(432345564227567615,504403158265495551] replicas=[0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea:15, 498e354c-1254-4d8d-a565-2f5c6523845a:9, 5208598c-84f0-4526-bb7f-573728592172:28] ... repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: Started to repair 1 out of 1 tables in keyspace=sec_index, table=users, table_id=ea2072d0-ccd9-11f0-8dba-c5ab01bffb77, repair_reason=repair repair - Enable incremental repair for table=sec_index.users range=(432345564227567615,504403158265495551] table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Disabled compaction for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair table - Got unrepaired compaction and repair lock for range=(432345564227567615,504403158265495551] session_id=a13a72cc-cd2d-11f0-8e9b-76d54580ab09 for incremental repair repair - repair[5d73d094-72ee-4570-a3cc-1cd479b2a036]: get_sync_boundary: got error from node=0e9d51a5-9c99-4d6e-b9db-ad36a148b0ea, keyspace=sec_index, table=users, range=(432345564227567615,504403158265495551], error=seastar::rpc::remote_verb_error (Compaction state for table [0x60f008fa34c0] not found) compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge compaction_manager - Stopping 1 tasks for 1 ongoing compactions for table sec_index.users compaction_group=238 due to tablet merge .... scylla[10793] Segmentation fault on shard 28, in scheduling group streaming ``` The rwlock in compaction_state could be destroyed before the lock holder of the rwlock is destroyed. This causes user after free when the lock the holder is destroyed. To fix it, users of repair lock will now be waited when a compaction group is being stopped. That way, compaction group - which controls the lifetime of rwlock - cannot be destroyed while the lock is held. Additionally, the merge completion fiber - that might remove groups - is properly serialized with incremental repair. The issue can be reproduced using sanitize build consistently and can not be reproduced after the fix. Fixes #27365 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2026-03-03 21:05:15 -03:00
Raphael S. Carvalho	1d8903d9f7	repair: Prevent repair lock holder leakage after table drop Prevent repair lock holder from being leaked in repair_service when table is dropped midway. The leakage might result in use-after-free later, since the repair lock itself will be gone after table drop. The RPC verb that removes the lock on success path will not be called by coordinator after table was dropped. Refs #27365. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-896. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2026-03-03 21:05:10 -03:00
Dario Mirovic	06af4480ea	config: enable maintenance socket in workdir by default We want to enable maintenance socket by default. This will prevent users from having to reboot a server to enable it. Also, there is little point in having maintenance socket that is turned off, and we want users to use it. After this patch series, they will have to use it. Note that while config seeding exists, we do not encourage it for production deployments. This patch changes default maintenance_socket value from ignore to workdir. This enables maintenance socket without specifying an explicit path. Refs SCYLLADB-409	2026-03-04 00:01:07 +01:00
Dario Mirovic	6e83fb5029	docs: auth: do not specify password with -p option Specifying password with -p option is considered unsafe. The password will be saved in bash history. The preferred approach is to enter the password when prompted. Any approach that passes the password via command line arguments makes that password visible in process options (ps command), no matter if the password is passed directly or as an environment variable. Refs SCYLLADB-409	2026-03-04 00:01:07 +01:00
Dario Mirovic	afafb8a8fa	docs: update documentation related to default superuser Update create superuser procedure: - Remove notes about default `cassandra` superuser - Add create superuser using existing superuser section - Update create superuser by using `scylla.yaml` config - Add create superuser using maintenance socket Update password reset procedure: - Add maintenance socket approach - Remove the old approach with deleting all the roles Update enabling authentication with downtime and during runtime: - Mention creating new superuser over the maintenance socket - Remove default superuser usage Update enable authorization: - Mention creating new superuser over the maintenance socket - Remove mention of default superuser Reasoning for deletion of the old approach: - [old] Needs cluster downtime, removes all roles, needs recreation of roles, needs maintenance socket anyways, if config values are not used for superuser - [new] No cluster downtime, possibly one node restart to enable maintenance socket, faster Refs SCYLLADB-409	2026-03-04 00:01:07 +01:00
Dario Mirovic	3db74aaf5f	test: maintenance socket role management Introduce a test that cover: - Server startup without credentials config seeding with no roles created - Await maintenance socket role management to be enabled - `CREATE ROLE`, `ALTER ROLE`, and `DROP ROLE` statement execution success All the tests in the test_maintenance_socket.py module take 2-3 seconds to execute. Explicitly shut down Cluster objects to prevent 'RuntimeError: cannot schedule new futures after shutdown'. Refs SCYLLADB-409	2026-03-03 23:57:50 +01:00
Dario Mirovic	f74fe22386	test: cluster: add logs to test_maintenance_socket.py Add logs to test_maintenance_socket.py test test_maintenance_socket. This approach offers additional visibility in case of test failure. Such logs will be added to new tests in a follow up patch in this patch series. Refs SCYLLADB-409	2026-03-03 23:42:25 +01:00
Dario Mirovic	0e5ddec2a8	test: pylib: fix connect_driver handling when adding and starting server When connect_driver=False, the expected server up state should be capped to HOST_ID_QUERIED. This is to avoid waiting for CQL readiness, which requires a superuser to be present. This logic was only in ScyllaCluster.server_start. ManagerClient.server_add with start=True and connect_driver=False would still wait for CQL and hang if no superuser is present. The workaround was to call ManagerClient.server_add(start=False, connect_driver=False) followed by ManagerClient.server_start(connect_driver=False). This patch moves the capping from ScyllaCluster.server_start to ManagerClient.server_add and ManagerClient.server_start, where connect_driver is processed. ScyllaCluster only receives the already resolved expected_server_up_state value. Refs SCYLLADB-409	2026-03-03 23:42:25 +01:00
Dario Mirovic	fd17dcbec8	auth: do not create default 'cassandra:cassandra' superuser Changes the behavior of default superuser creation. Previously, without configuration 'cassandra:cassandra' credentials were used. Now default superuser creation is skipped if not configured. The two ways to create default superuser are: - Config file - auth_superuser_name and auth_superuser_salted_password fields - Maintenance socket - connect over maintenance socket and CREATE/ALTER ROLE ... Behavior changes: Old behavior: - No config - 'cassandra:cassandra' created - auth_superuser_name only - <name>:cassandra created - auth_superuser_salted_password only - 'cassandra:<password>' created - Both specified - '<name>:<password>' created New behavior: - No config - no default superuser - Requires maintenance socket setup - auth_superuser_name only - '<name>:' created WITHOUT password - Requires maintenance socket setup - auth_superuser_salted_password only - no default superuser - Both specified - '<name>:<password>' created Fixes SCYLLADB-409	2026-03-03 23:42:25 +01:00
Dario Mirovic	9dc1deccf3	auth: remove redundant DEFAULT_USER_NAME from password authenticator Remove redundant DEFAULT_USER_NAME from password_authenticator.cc file. It is just a copy of meta::DEFAULT_SUPERUSER_NAME. Refs SCYLLADB-409	2026-03-03 23:42:25 +01:00
Dario Mirovic	45628cf041	auth: enable role management operations via maintenance socket Introduce maintenance_socket_authenticator and rework maintenance_socket_role_manager to support role management operations. Maintenance auth service uses allow_all_authenticator. To allow role modification statements over the maintenance socket connections, we need to treat the maintenance socket connections as superusers and give them proper access rights. Possible approaches are: 1. Modify allow_all_authenticator with conditional logic that password_authenticator already does 2. Modify password_authenticator with conditional logic specific for the maintenance socket connections 3. Extend password_authenticator, overriding the methods that differ Option 3 is chosen: maintenance_socket_authenticator extends password_authenticator with authentication disabled. The maintenance_socket_role_manager is reworked to lazily create a standard_role_manager once the node joins the cluster, delegating role operations to it. In maintenance mode role operations remain disabled. Refs SCYLLADB-409	2026-03-03 23:41:05 +01:00
Dario Mirovic	6a1edab2ac	client_state: add has_superuser method Encapsulate the superuser check in client_state so that it respects _bypass_auth_checks. Connections that bypass auth (internal callers and the maintenance socket) are always considered superusers. Migrate existing call sites from auth::has_superuser(service, user) to client_state.has_superuser(). Also add _bypass_auth_checks handling to ensure_not_anonymous(). Refs SCYLLADB-409	2026-03-03 22:31:35 +01:00
Dario Mirovic	d765b5b309	client_state: add _bypass_auth_checks flag Authorization checks were previously skipped based on the _is_internal flag. This couples two concerns: marking client state as internal and bypassing authorization. Introduce _bypass_auth_checks to handle only the authorization bypass. Internal client state sets it to true, preserving current behavior. External client state accepts it as a constructor parameter, defaulting to false. This will allow maintenance socket connections to skip authorization without being marked as internal. Refs SCYLLADB-409	2026-03-03 22:31:35 +01:00
Dario Mirovic	b68656b59f	auth: let maintenance_socket_role_manager know if node is in maintenance mode This patch is part of preparations for dropping 'cassandra::cassandra' default superuser. When that is implemented, maintenance_socket_role_manager will have two modes of work: 1. in maintenance mode, where role operations are forbidden 2. in normal mode, where role operations are allowed To execute the role operations, the node has to join a cluster. In maintenance mode the node does not join a cluster. This patch lets maintenance_socket_role_manager know if it works under maintenance mode and returns appropriate error message when role operations execution is requested. Refs SCYLLADB-409	2026-03-03 22:31:35 +01:00
Dario Mirovic	3bef493a35	auth: remove class registrator usage This patch removes class registrator usage in auth module. It is not used after switching to factory functor initialization of auth service. Several role manager, authenticator, and authorizer name variables are returned as well, and hardcoded inside qualified_java_name method, since that is the only place they are ever used. Refs SCYLLADB-409	2026-03-03 22:31:35 +01:00
Dario Mirovic	eab24ff3b0	auth: instantiate auth service with factory functors Auth service is instantiated with the constructor that accepts service_config, which then uses class registrator to instantiate authorizer, authenticator, and role manager. This patch switches to instantiating auth service via the constructor that accepts factory functors. This is a step towards removing usage of class registrator. Refs SCYLLADB-409	2026-03-03 22:31:35 +01:00
Dario Mirovic	bfff07eacb	auth: add service constructor with factory functors Auth service can be initialized: - [current] by passing instantiated authorizer, authenticator, role manager - [current] by passing service_config, which then uses class registrator to instantiate authorizer, authenticator, role manager - This approach is easy to use with sharded services - [new] by passing factory functors which instantiate authorizer, authenticator, role manager - This approach is also easy to use with sharded services Refs SCYLLADB-409	2026-03-03 22:31:35 +01:00
Dario Mirovic	e8e00c874b	auth: add transitional.hh file In a follow-up patch in this patch series class registrator will be removed. Adding transitional.hh file will be necessary to expose the authenticator and authorizer. Refs SCYLLADB-409	2026-03-03 22:31:35 +01:00
Dario Mirovic	e5218157de	service: qos: handle special scheduling group case for maintenance socket service_level_controller has special handling for maintenance socket connections. If the current user is not a named user, it should use the default scheduling group. The reason is that the maintenance socket can communicate with Scylla before auth_integration is registered. The guard is already present, but it was omitted in get_cached_user_scheduling_group. This also fixes flakiness in test_maintenance_socket.py tests. Refs SCYLLADB-409	2026-03-03 22:31:35 +01:00
Dario Mirovic	dc9a90d7cb	service: qos: use _auth_integration as condition for using _auth_integration Maintenance socket connections can be established before _auth_integration is initialized. The fix introduced with scylladb/scylladb#26856 PR check for the value of user variable. For maintenance socket connections it will be an anonymous user, and will fall back to using default scheduling group. This patch changes the criteria for using default scheduling group from the user variable to checking the _auth_integration variable itself: - If _auth_integration is not initialized, use default scheduling group - If _auth_integration is initialized, let it choose the scheduling group Refs SCYLLADB-409	2026-03-03 22:31:35 +01:00
Andrzej Jackowski	371cdb3c81	cql3/query_processor: implement metrics to track CL of writes Add `write_consistency_levels_disallowed_violations` and `write_consistency_levels_warned_violations` metrics to track violations of write_consistency_levels guardrails. Add `writes_per_consistency_level` to track what CL is used by writes, regardless of the guardrails configuration. Data gathered by this metric can be used to decide whether enabling a particular write consistency level guardrail in a particular existing cluster is safe. Refs: SCYLLADB-259	2026-03-03 21:18:11 +01:00
Andrzej Jackowski	3606934458	db: cql3/query_processor: add write_consistency_levels enum_sets Add enum_sets to query_processor that track the configuration values of `write_consistency_levels_warned` and `write_consistency_levels_disallowed`. Refs: SCYLLADB-259	2026-03-03 20:28:57 +01:00
Dawid Mędrek	7fd083e329	test: raft: Introduce get_default_cluster We introduce a function creating a Raft cluster with parameters usually used by the tests. This will avoid code duplication, especially after introducing new tests in the following commits. Note that the test `test_aborting_wait_for_state_change` has changed: the previous complex configuration was unnecessary for it (I wrote it).	2026-03-03 18:50:21 +01:00
Pavel Emelyanov	b768753c0f	test: Remove passing default "expected_replicas" to check_mutation_replicas() The value of None is default, callers don't need to specify it explicitly Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-03 17:28:21 +03:00
Pavel Emelyanov	b8ae9ede63	test: Remove scope and primary-replica-only arguments from check_mutation_replicas() helper These two are only used to print into logs on error. However, their values can be found from previous logs and test execution context. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-03 17:26:25 +03:00
Karol Nowacki	45477d9c6b	vector_search: test: include ANN error in assertion When the test fails, the assertion message does not include the error from the ANN request. This change enhances the assertion to include the specific ANN error, making it easier to diagnose test failures.	2026-03-03 14:19:20 +01:00
Karol Nowacki	ab6c222fc4	vector_search: test: fix HTTPS client test flakiness The default 100ms timeout for client readiness in tests is too aggressive. In some test environments, this is not enough time for client creation, which involves address resolution and TLS certificate reading, leading to flaky tests. This commit increases the default client creation timeout to 10 seconds. This makes the tests more robust, especially in slower execution environments, and prevents similar flakiness in other test cases. Fixes: VECTOR-547, SCYLLADB-802	2026-03-03 14:19:20 +01:00
Botond Dénes	4f5310bc72	replica/table: update rf=1 table registry in shared tombstone-gc state On every update of the ERM, update the state of the current table in the registry of RF=1 tables in shared tombstone gc state. Ensures that tombstone gc stops collection of tombstones in immediate mode as soon as the table starts transitioning away from RF=1.	2026-03-03 14:09:28 +02:00
Botond Dénes	7c2c63ab43	tombstone_gc: tombstone_gc_before_getter: consider RF when getting gc before time Currently we cannot use repair-mode tombstone gc on RF=1 tables, because such tables don't need repair and so there won't be repair history to use to produce gc_before times. Introduce shared_tombstone_gc_state::_rf_one_tables which will keep a registry of RF=1 tables. Keeping this up to date is left to outside code (table.cc). Consult the registry to determine whether a table is RF=1 or not, so the repair history check can be ellided for rf=1 tables. Not wired in yet into the table code.	2026-03-03 14:09:28 +02:00
Botond Dénes	074006749c	tombstone_gc: unpack per_table_history_maps It is now a class with a single member, replace usage with that of the member (through an alias to reduce churn).	2026-03-03 14:09:28 +02:00
Botond Dénes	d6e2d44759	tombstone_gc: extract _group0_gc_time from per_table_history_map Doesn't belong there. Also, having it as a separate member of shared_tombstone_gc_state makes updating _group0_gc_time cheaper, as the update doesn't have to do a copy-mutate-swap of the history maps.	2026-03-03 14:09:28 +02:00
Botond Dénes	5fd9fc3056	tombstone_gc: drop tombstone_gc_state(nullptr) ctor and operator bool() Both are ambiguous and all users were migrated away to more meaningful alternatives. They are now unused, drop them.	2026-03-03 14:09:28 +02:00
Botond Dénes	a785c0cf41	test/lib/random_schema: use timeout-mode tombstone_gc This is the current de-facto default for all tests using random schema and some are apparently relying on this. Make this explicit to avoid upsetting tests, by the impending change of this default to repair.	2026-03-03 14:09:28 +02:00
Botond Dénes	d10e622a3b	tombstone_gc_options: add C++ friendly constructor So one can create options with strong types, instead of from a map of strings.	2026-03-03 14:09:28 +02:00
Botond Dénes	6004e84f18	test: move away from tombstone_gc_state(nullptr) ctor Use for_tests() instead (or no_gc() where approriate).	2026-03-03 14:09:28 +02:00
Botond Dénes	3c34598d88	treewide: move away from tombstone_gc_state(nullptr) ctor It is ambigous, use the appropriate no-gc or gc-all factories instead, as appropriate. A special note for mutation::compacted(): according to the comment above it, it doesn't drop expired tombstones but as it is currently, it actually does. Change the tombstone gc param for the underlying call to compact_for_compaction() to uphold the comment. This is used in tests mostly, so no fallout expected. Tests are handled in the next commit, to reduce noise. Two tests in mutation_test.cc have to be updated: * test_compactor_range_tombstone_spanning_many_pages has to be updated in this commit, as it uses mutation_partition::compact_for_query() as well as compact_for_query(). The test passes default constructed tombstone_gc() to the latter while the former now uses no-gc creating a mismatch in tombstone gc behaviour, resulting in test failure. Update the test to also pass no-gc to compact_for_query(). * test_query_digest similarly uses mutation_partition::query_mutation() and another compaction method, having to match the no-gc now used in query_mutation().	2026-03-03 14:09:28 +02:00
Botond Dénes	04b001daa6	sstable: move away from tombstone_gc_mode::operator bool() It is ambiguous, use tombstone_gc_mode::is_gc_enabled() instead. Note that the two has slightly different meanings, operator bool() returned true when repair-history related functionality was enabled. This is fine, because the only two users are logs, where the two meanings are close enough. All other users were eliminated or migrated already, taking the change in meaning into account.	2026-03-03 14:09:28 +02:00
Botond Dénes	6364e35403	replica/table: add get_tombstone_gc_state() Shorthand for get_compaction_manager().get_shared_tombstone_gc_state().get_tombstone_gc_state().	2026-03-03 14:09:28 +02:00
Botond Dénes	f3ee6a0bd1	compaction: use tombstone_gc_state with value semantics Instead of passing around references to it, pass around values. This object is now designed to be used as a value-type, after recent refactoring.	2026-03-03 14:09:27 +02:00
Botond Dénes	83e20d920e	db/row_cache: use tombstone_gc_state with value semantics Instead of keeping a pointer to it. Replace nullptr with tombstone_gc_state::no_gc(). This object is now designed to be used as a value-type, after recent refactoring.	2026-03-03 14:09:27 +02:00
Botond Dénes	041ab593c7	tombstone_gc: introduce tombstone_gc_state::for_tests() To replace the usage of tombstone_gc_state(nullptr) usage in tests specifically. This more verbose factory method will hopefully convey that this is not to be used in production code. The nullptr constructor doesn't convey this and in fact it was used in production code here-and-there.	2026-03-03 14:09:27 +02:00
Artsiom Mishuta	5c84a76b28	test.py: setup pytest logger This commit introduces pure pytest logging into a file Previously, test.py used pytest as a script(not a framework) and just captured pytest stdout and logged this data by itself This commit sets up the log files format that additionaly display Python processName, threadName adn taskName because test.py test cases use them, and now it is so hard to investigate issues that are connected with parallelism inside test case themselve In addition, commit splits the logging of different pytest workers(xdist) into different files. If pytest workers have ho failed test - log file for these workers will be deleted There is also additional logging for failures that will contain a separate file per test failure and contain the error itself (stacktrace) and all capture logs from stdout, stderr during the test run. With --save-log-on-success it will be a separate file per test on pass as well All this new functionality works with the new xdit scheduler (--test-py-init=True) Fixes SCYLLADB-713 Closes scylladb/scylladb#28705	2026-03-03 11:49:01 +01:00
Dimitrios Symonidis	80b74d7df2	tablet options: Add max_tablet_count tablet option to enforce tablet count upper bounds Introduced a new max_tablet_count tablet option that caps the maximum number of tablets a table can have. This feature is designed primarily for backup and restore workflows. During backup, when load balancing is disabled for snapshot consistency, the current tablet count is recorded in the backup manifest. During restore, max_tablet_count is set to this recorded value, ensuring the restored table's tablet count never exceeds the original snapshot's tablet distribution. This guarantee enables efficient file-based SSTable streaming during restore, as each SSTable remains fully contained within a single tablet boundary. Closes scylladb/scylladb#28450	2026-03-03 11:19:24 +03:00
Calle Wilund	69f8e722bf	table::snapshot_table_on_all_shards: Use set to keep track of tablets in manifest Fixes: SCYLLADB-828 Avoid iterating linear set of tablets when building manifest. Reduces complexity. Closes scylladb/scylladb#28851	2026-03-03 08:09:33 +02:00
Karol Nowacki	30487e8854	index: fix vector index with filtering target column The secondary index mechanism is currently used to determine the target column. This mechanism works incorrectly for vector indexes with filtering because it returns the last specified column as the target (vectors) column. However, the syntax for a vector index requires the first column to be the target: ``` CREATE CUSTOM INDEX ON t(vectors, users) USING 'vector_index'; ``` This discrepancy eventually leads to the following exception when performing an ANN search on a vector index with filtering columns: ```` ANN ordering by vector requires the column to be indexed using 'vector_index' ```` This commit fixes the issue by introducing dedicated logic for vector indexes to correctly identify the target(vectors) column. Fixes: SCYLLADB-635 Closes scylladb/scylladb#28740	2026-03-02 18:47:58 +02:00
Sergey Zolotukhin	33923578eb	Update DROP INDEX statement documentation Clarify behavior of DROP INDEX during ongoing builds. Closes scylladb/scylladb#28659	2026-03-02 17:31:23 +02:00
Botond Dénes	ab532882db	tools/scylla-sstable: introduce scylla sstable split Split input sstable(s) into multiple output sstables based on the provided token boundaries. The input sstable(s) are divided according to the specified split tokens, creating one output sstable per token range. Fixes: SCYLLADB-10 Closes scylladb/scylladb#28741	2026-03-02 15:19:17 +01:00
Marcin Maliszkiewicz	d95939d69a	test: add more logs to test_startup_no_auth_response When test fails with assert connections_observed we would like to know if it was unable to connect or execute query in attempt_good_connection	2026-03-02 14:53:46 +01:00
Marcin Maliszkiewicz	91126eb2fb	test: decrease strain in test_startup_response For 2025.3 and 2025.4 this test runs order of magnitude slower in debug mode. Potentially due to passwords::check running in alien thread and overwhelming the CPU (this is fixed in newer versions). Decreasing the number of connections in test makes it fast again, without breaking reproducibility. As additional measure we double the timeout. The fix is now cherry-picked to master as sometimes test fails there too. (cherry picked from commit `1f1fc2c2ac`)	2026-03-02 14:46:51 +01:00
Botond Dénes	bf3edaf220	tools/scylla-sstable: filter_operation(): use deferred_close() to close reader Manual closing is bypassed with exceptions, promoting an exception to a crash due to unclosed reader. Closes scylladb/scylladb#28797	2026-03-02 14:16:08 +01:00
Marcin Maliszkiewicz	6bf706ef1b	Merge 'scylla-sstable: query: handle nested UDTs' from Botond Dénes The query (and in certain modes the write) operations uses virtual table facility inside `cql_test_env`. The schema of the sstable is created as a table in `cql_test_env`. This involves registering all UDTs with the keyspace, so they are available for lookups. This was done with a flat loop over all column types, but this is not enough. UDTs might be nested in other types, like collections. One has to do a traversal of the type tree and register every UDT on the way. This PR changes the flat loop to a recursive traversal of the type tree. The query operation now works with UDTs, no matter how deeply nested they are. Backport: Implements missing functionality of a tool, no backport. Closes scylladb/scylladb#28798 * github.com:scylladb/scylladb: tools/scylla-sstable: create_table_in_cql_env(): register UDTs recursively tools/scylla-sstable: generalize dump_if_user_type tools/scylla-sstable: move dump_if_user_type() definition	2026-03-02 14:14:43 +01:00
dependabot[bot]	f5fa77ac9d	build(deps): bump sphinx-multiversion-scylla in /docs Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.4 to 0.3.7. --- updated-dependencies: - dependency-name: sphinx-multiversion-scylla dependency-version: 0.3.7 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#28833	2026-03-02 14:13:03 +01:00
Karol Nowacki	647172d4b8	vector_search: fix names of private members According to coding style in Scylla, member variables are prefixed with underscore.	2026-03-02 14:08:16 +01:00
Karol Nowacki	f2308b000f	vector_search: remove unused global variable	2026-03-02 14:08:07 +01:00
Marcin Maliszkiewicz	a83ee6cf66	Merge 'db/batchlog_manager: re-add v1 support for mixed clusters' from Botond Dénes `3f7ee3ce5d` introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective. It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time. However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost. This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible. The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem. Fixes: https://github.com/scylladb/scylladb/issues/27886 Needs backport to 2026.1, to fix upgrades for clusters using batches Closes scylladb/scylladb#28736 * github.com:scylladb/scylladb: test/boost/batchlog_manager_test: add tests for v1 batchlog test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2 test/boost/batchlog_manager_test: fix indentation test/boost/batchlog_manager_test: extract prepare_batches() method test/lib/cql_assertions: is_rows(): add dump parameter tools/scylla-sstable: extract query result printers tools/scylla-sstable: add std::ostream& arg to query result printers repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log db/batchlog_manager: re-add v1 support db/batchlog_manager: return all_replayed from process_batch() db/batchlog_manager: process_bath() fix indentation db/batchlog_manager: make batch() a standalone function db/batchlog_manager: make structs stats public db/batchlog_manager: allocate limiter on the stack db/batchlog_manager: add feature_service dependency gms/feature_service: add batchlog_v2 feature	2026-03-02 12:09:10 +01:00
Patryk Jędrzejczak	ba7f314cdc	test: test_full_shutdown_during_replace: retry replace after the replacing node is removed from gossip The test is currently flaky with `reuse_ip = True`. The issue is that the test retries replace before the first replace is rolled back and the first replacing node is removed from gossip. The second replacing node can see the entry of the first replacing node in gossip. This entry has a newer generation than the entry of the node being replaced, and both replacing nodes have the same IP as the node being replaced. Therefore, the second replacing node incorrectly considers this entry as the entry of the node being replaced. This entry is missing rack and DC, so the second replace fails with ``` ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed: std::runtime_error (Cannot replace node 8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on a different data center or rack. Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2) ``` Fixes SCYLLADB-805 Closes scylladb/scylladb#28829	2026-03-02 10:26:57 +02:00
Yaron Kaikov	ab02486ce8	.github/workflows/trigger-scylla-ci.yaml: fix org membership check in trigger-scylla-ci workflow Following `becb48b586` it seems we have a regression with trigger CI logic The Verify Org Membership step used gh api /orgs/scylladb/members/$AUTHOR with GITHUB_TOKEN to check if the user is an org member. However, GITHUB_TOKEN does not have read:org scope, so the API call fails for all users — even actual scylladb org members — causing CI triggers to be silently skipped. Replace the API call with the author_association field from the GitHub event payload, which is set by GitHub itself and requires no special token permissions. This allows any scylladb org member (MEMBER or OWNER) to trigger CI via comment, regardless of whether they authored the PR. Closes scylladb/scylladb#28837	2026-03-02 10:23:14 +02:00
Michael Litvak	8c4bc33e51	test: remove test_view_building_with_tablet_move remove the test since it's not relevant anymore, it's not testing what it's supposed to test and it's unstable. the purpose of the test was to reproduce an issue in the legacy view builder where a view starts to build at token T2 and then all tokens [T1, end) with T1<T2 migrate to another node while it's still building, exposing an issue when the view builder wraparounds the token ring. this is not relevant anymore because now view building with tablets is done via the view building coordinator for tablets, and all views start to build from the first token with no wraparound. besides, the test is unstable due to relying too much on specific timing, which was useful for investigating and fixing the original issue but not anymore. Fixes SCYLLADB-842 Closes scylladb/scylladb#28842	2026-03-02 07:42:08 +01:00
Marcin Maliszkiewicz	8c2da76fde	test/cqlpy: remove xfail from test_constant_function_parameter The issue was fixed by commit `cc03f5c89d` ("cql3: support literals and bind variables in selectors"), so the xfail marker is no longer needed. Closes scylladb/scylladb#28776	2026-03-01 20:03:42 +02:00
Jenkins Promoter	fb6eebc383	Update pgo profiles - aarch64	2026-03-01 05:17:44 +02:00
Jenkins Promoter	8edd532c05	Update pgo profiles - x86_64	2026-03-01 04:31:57 +02:00
Botond Dénes	1f09fcfb26	Merge 'Use standard ks/cf/data creation methods in test_restore_with_streaming_scopes' from Pavel Emelyanov The test uses create_dataset helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities. Also the PR simplifies the 3-level nested loops used to combine several sets of restoration parameters by using itertools.product facility. Continuation of #28600. Cleaning tests, not backporting Closes scylladb/scylladb#28608 * github.com:scylladb/scylladb: test/object_store: Use itertools.product() for deeply nested loops test/object_store: Replace dataset creation usage with standard methods test/object_store: Shift indentation right for test_restore_with_streaming_scopes	2026-02-27 16:15:55 +02:00
Avi Kivity	450a09b152	test: tools: restrict embedded perf tests from taking over host The perf-simple-query tests were not restricted on CPU count, so on a 96-CPU machine, they would run on 96 CPUs, and time out in debug mode. All restrict memory usage and add --overprovisioned so that pinning is disabled. Apply that to all tests. Closes scylladb/scylladb#28821	2026-02-27 16:06:22 +02:00
Botond Dénes	d3a3921487	Merge 'Re-use and improve the take_snapshot() helper in backup tests' from Pavel Emelyanov The helper is very simple yet generic -- it takes a snapshot of a keyspace on all servers and collects the resulting sstables from workdirs. Re-using it in all test cases saves some lines of code. Also, the method is "sequential", making it "parallel" reduces the waiting time a bit. Will help generalizing existing backup/restore tests to support clustered snapshot/backup/restore API (see #28525) later. Cleaning up tests, not backporting. Closes scylladb/scylladb#28660 * github.com:scylladb/scylladb: test/backup: Run keyspace flush and snapshot taking API in parallel test/backup: Re-use take_snapshot() helper in do_abort_restore() test/backup: Move take_snapshot() helper up	2026-02-27 15:58:18 +02:00
Łukasz Paszkowski	fb40d1e411	compaction_manager: improve readability of maybe_wait_for_sstable_count_reduction() Split the chained inject_parameter().value_or() expression into two separate named variables for clarity. Use condition_variable::when() instead of wait(). when() is the coroutine-native API, designed specifically for co_await contexts — it avoids the heap allocation of a promise_waiter that wait() uses, and instead uses a stack-based awaiter. Closes scylladb/scylladb#28824	2026-02-27 15:51:30 +02:00
Patryk Jędrzejczak	9a9202c909	Merge 'Remove gossiper topology code' from Gleb Natapov The PR removes most of the code that assumes that group0 and raft topology is not enabled. It also makes sure that joining a cluster in no raft mode or upgrading a node in a cluster that not yet uses raft topology to this version will fail. Refs #15422 No backport needed since this removes functionality. Closes scylladb/scylladb#28514 * https://github.com/scylladb/scylladb: group0: fix indentation after previous patch raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream raft_group0: remove unused code from raft_group0 node_ops: remove topology over node ops code topology: fix indentation after the previous patch topology: drop topology_change_enabled parameter from raft_group0 code storage_service: remove unused handle_state_* functions gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option storage_service: fix indentation after the last patch storage_service: remove gossiper bootstrapping code storage_service: drop get_group_server_if_raft_topolgy_enabled storage_service: drop is_topology_coordinator_enabled and its uses storage_service: drop run_with_api_lock_in_gossiper_mode_only topology: remove code that assumes raft_topology_change_enabled() may return false test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode test: schema_change_test: drop schema tests relevant for no raft mode only topology: remove upgrade to raft topology code group0: remove upgrade to group0 code group0: refuse to boot if a cluster is still is not in a raft topology mode storage_service: refuse to join a cluster in legacy mode	2026-02-27 14:43:41 +01:00
Anna Stuchlik	dfd46ad3fb	doc: add the upgrade guide from 2025.x to 2026.1 This commit adds the upgrade guide for version 2026.1. According to the new upgrade policy, the user can now upgrade to the major version (2026.1) from any previous minor version. So instead of adding a separate guide form 2025.4 to 2026.1, we need a guide from 2025.x to 2026.1. In addition, this commit: - Updates the upgrade policy for reflect the above change. - Removes the upgrade guides for the previous version. Fixes https://github.com/scylladb/scylladb/issues/28533 Fixes https://github.com/scylladb/scylladb/issues/28532 Closes scylladb/scylladb#28789	2026-02-27 15:36:34 +02:00
Botond Dénes	fcc570c697	Merge 'Exorcise assertions from Alternator, using a new throwing_assert() macro' from Nadav Har'El assert(), and SCYLLA_ASSERT() are evil (Refs #7871) because they can cause the entire Scylla cluster to crash mysteriously instead of cleanly failing the specific request that encountered a serious problem of failed pre-requisite. In this two-patch series, in the first patch we introduce a new macro throwing_assert(), a convenient drop-in replacement for SCYLLA_ASSERT() but which has all the benefits of on_internal_error() instead of the dangers of SCYLLA_ASSERT(). In the second patch we use the new function to replace every call to SCYLLA_ASSERT() in Alternator by the new throwing_assert(). Here is an example from the second patch to demonstrate the power of this approach: The Alternator code uses the attrs_column() function to retrieve the ":attrs" column of a schema. Since every Alternator table always has an ":attrs" column in its schema, we felt safe to SCYLLA_ASSERT() that this column exists. However, imagine that one day because of a bug, one Alternator table is missing this column. Or maybe not a bug - maybe a malicious user on a shared cluster found a way to deliberately delete this column (e.g, with a CQL command!) and this check fails. Before this patch, the entire Scylla node will crash. If the same request is sent to all nodes - the entire cluster will crash. The user might not even know which request caused this crash. In contrast, after this patch, the specific operation - e.g., PutItem - will get an exception. Only this operation, and nothing else, will be aborted, and the user who sent this request will even get an "Internal Server Error" with the assertion-failure message, alerting them that this specific query is causing problems, while other queries might work normally. There's no need to backport this patch - unless it becomes annoying that other branches don't have the throwing_assert() function and we want it to ease other backports. Fixes #28308. Closes scylladb/scylladb#28445 * github.com:scylladb/scylladb: alternator: replace SCYLLA_ASSERT with throwing_assert utils: introduce throwing_assert(), a safe replacement for assert	2026-02-27 15:35:36 +02:00
Roy Dahan	822c1597c9	install.sh: fix REST API paths for nonroot installations In nonroot installations, the install.sh script was hardcoding the api_ui_dir and api_doc_dir paths to /opt/scylladb/ in scylla.yaml, even though the actual files were installed to a different location (typically ~/scylladb). This caused REST API endpoints like /api-doc/failure_detector/ to fail with "transfer closed with outstanding read data remaining" error because Scylla couldn't find the API documentation files at the configured paths. Fix this by using the $prefix variable instead of hardcoded /opt/scylladb/ paths. This ensures that: - In regular installations: $prefix = /opt/scylladb (no change) - In nonroot installations: $prefix = ~/scylladb (paths now correct) Fixes: SCYLLADB-721 Backport: The hardcoded paths in install.sh have been present since the nonroot installation feature was introduced, making REST API endpoints non-functional in all nonroot installations across all live versions of Scylla. Closes scylladb/scylladb#28805	2026-02-27 15:32:54 +02:00
Botond Dénes	9521a51e4c	Merge 'generic_server: scale connection concurrency semaphore by listener count' from Marcin Maliszkiewicz The concurrency semaphore gates uninitialized connections across all do_accepts loops, but was initialized to a fixed value regardless of how many listeners exist. With multiple listeners competing for the same units, each effectively gets less than the configured concurrency. Initialize the semaphore to concurrency - 1 and signal 1 per listen() call, so total capacity is concurrency - 1 + nr_listeners. This guarantees each listener's accept loop can have at least one unit available. It mainly fixes problem when setting uninitialized_connections_semaphore_cpu_concurrency config value to 1 would result in not being able to process connections, as only 1 out of 2 listeners got the semaphore. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-762 Backport: no, it's a minor problem Closes scylladb/scylladb#28747 * github.com:scylladb/scylladb: test: add test_uninitialized_conns_semaphore generic_server: fix waiters count in shed log generic_server: scale connection concurrency semaphore by listener count	2026-02-27 15:06:50 +02:00
Taras Veretilnyk	5bbc44ed12	sstables: replace rewrite_statistics with new rewrite component mechanism This commits migrates all callers that used rewrite_statistics to new rewrite component mechanism.	2026-02-26 22:38:55 +01:00
Taras Veretilnyk	51c345aaf6	sstables: add new rewrite component mechanism for safe sstable component rewriting Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allows crash recovery by simply removing the temporary file on startup. However, this approach won't work once component digests are stored in scylla_metadata, as replacing a component like Statistics will require atomically updating both the component and scylla_metadata with the new digest—impossible with POSIX rename. The new mechanism creates a clone sstable with a fresh generation: - Hard-links all components from the source except the component being rewritten and scylla metadata if update_sstable_id is true - Copies original sstable components pointer and recognized components from the source - Invokes a modifier callback to adjust the new sstable before rewriting - Writes the modified component. If update_sstable_id is true, reads scylla metadata, generates new sstable_id and rewrites it. - Seals the new sstable with a temporary TOC - Replaces the old sstable atomically, the same way as it is done in compaction This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair). In case of any failure during the whole process, sstable will be automatically deleted on the node startup due to temporary toc persistence. This prepares the infrastructure for component digests. Once digests are introduced in scylla_metadata this mechanism will be extended to also rewrite scylla metadata with the updated digest alongside the modified component, ensuring atomic updates of both.	2026-02-26 22:38:55 +01:00
Taras Veretilnyk	4aa0a3acf9	compaction: add compaction_group_view method to specify sstable version Add make_sstable() overload that accepts sstable_version_types parameter to compaction_group_view interface and all implementations. This will be useful in rewrite component mechanism, as we need to preserve sstable version when creating the new one for the replacement.	2026-02-26 22:38:55 +01:00
Taras Veretilnyk	16ea7a8c1c	sstables: add null_data_sink and serialized_checksum for checksum-only calculation Introduce a null_data_sink and make_digest_calculator implementation that discards all writes, enabling checksum calculation without file I/O. This allows the existing checksummed_file_writer to be used for digest computation without writing data to disk. This will be used in a future commit to calculate the scylla metadata component checksum before writing it to disk, allowing the component to store its own checksum.	2026-02-26 22:38:51 +01:00
Łukasz Paszkowski	bb57b0f3b7	compaction_manager: fix maybe_wait_for_sstable_count_reduction() hanging forever The futurization refactoring in `9d3755f276` ("replica: Futurize retrieval of sstable sets in compaction_group_view") changed maybe_wait_for_sstable_count_reduction() from a single predicated wait: ``` co_await cstate.compaction_done.wait([..] { return num_runs_for_compaction() <= threshold \|\| !can_perform_regular_compaction(t); }); ``` to a while loop with a predicated wait: ``` while (can_perform_regular_compaction(t) && co_await num_runs_for_compaction() > threshold) { co_await cstate.compaction_done.wait([this, &t] { return !can_perform_regular_compaction(t); }); } ``` This was necessary because num_runs_for_compaction() became a coroutine (returns future<size_t>) and can no longer be called inside a condition_variable predicate (which must be synchronous). However, the inner wait's predicate — !can_perform_regular_compaction(t) — only returns true when compaction is disabled or the table is being removed. During normal operation, every signal() from compaction_done wakes the waiter, the predicate returns false, and the waiter immediately goes back to sleep without ever re-checking the outer while loop's num_runs_for_compaction() condition. This causes memtable flushes to hang forever in maybe_wait_for_sstable_count_reduction() whenever the sstable run count exceeds the threshold, because completed compactions signal compaction_done but the signal is swallowed by the predicate. Fix by replacing the predicated wait with a bare wait(), so that any signal (including from completed compactions) causes the outer while loop to re-evaluate num_runs_for_compaction(). Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610 Closes scylladb/scylladb#28801	2026-02-26 20:13:50 +02:00
Marcin Maliszkiewicz	a03ebe1a29	Merge 'cql: implement a new per-row TTL feature' from Nadav Har'El This series implements a new per-row TTL feature for CQL. The per-row TTL feature was requested in issue #13000. It is a feature that does not exist in Cassandra, and was inspired by DynamoDB's TTL feature - and under the hood uses the same implementation that we used in Alternator to implement this DynamoDB feature. The new per-row TTL feature is completely separate from CQL's existing per-write (and per-cell) TTL, and both will be available to users. In the per-row TTL feature, one column in the table is designated as the "TTL" column, and its value for a row is the expiration time for that row. The TTL column can be designated at table creation time, e.g.: ```cql CREATE TABLE tab ( id int PRIMARY KEY, t text, expiration timestamp TTL ); ``` Or after the table already exists with: ```cql ALTER TABLE tab TTL expiration ``` Expiration can also be disabled, with: ```cql ALTER TABLE tab TTL NULL ``` The new per-row TTL feature has two features that users have been asking for: 1. A user can change the value of just the TTL column - without rewriting the entire row - to change the expiration time of the entire row. 2. When an expired row is finally deleted, a CDC event about this deletion appears in the CDC log (if CDC is enabled), including - if a preimage is enabled - the content of the deleted row. To achieve the second goal (CDC events), a row is not guaranteed to disappear at exactly its expiration time (as CQL's original TTL feature guarantees). Rather, the row is deleted some time later, depending on `alternator_ttl_period_in_seconds`; Until the actual deletion, the row is still readable (and even writable). But we are guaranteed that when the row is finally deleted, the CDC event will come too. The implementation uses the same background thread used by Alternator to periodically scan for expired items and delete them. The expiration thread keeps the same metrics as it did for Alternator: * `scylla_expiration_scan_passes` * `scylla_expiration_scan_table` * `scylla_expiration_items_deleted` * `scylla_expiration_secondary_ranges_scanned` The series begins with a few small preparation patches, followed by the main part of the feature (which isn't big, since we are just enabling the pre-existing Alternator expiration machinary for CQL) and finally 30 tests (single-node and multi-node tests) and documentation. This series is a new feature, so traditionally would not be backported. However, I wouldn't be surprised if we will be requested to backport it so that customers will not need to wait for a new major release. Fixes #13000 Closes scylladb/scylladb#28320 * github.com:scylladb/scylladb: test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY docs/cql: document the new CQL per-row TTL feature test/cluster: tests for the new CQL per-row TTL feature test/cqlpy: tests for the new CQL per-row TTL feature test: set low alternator_ttl_period_in_seconds in CQL tests cql ttl: fix ALTER TABLE to disable TTL if column is dropped cql ttl: add setting/unsetting of TTL column to ALTER TABLE cql ttl: add TTL column support to CREATE TABLE and DESC TABLE ttl: add CQL support to Alternator's TTL expiration service alternator ttl: move TTL_TAG_KEY to a header file alternator ttl: remove unnecessary check of feature flag cql: add "cql_row_ttl" cluster feature alternator: fix error message if UpdateTimeToLive is not supported	2026-02-26 15:29:12 +01:00
Ferenc Szili	f22d75a57e	load_stats: add filtering for tablet sizes This patch adds filtering for tablet sizes collected in load_stats. This is needed to improve the chances that the balancer will have all the tablet sizes for the node, and that way avoid having the node ignored during balancing.	2026-02-26 15:17:39 +01:00
Marcin Maliszkiewicz	4d0f1bf5c9	conf: improve rf_rack_valid_keyspaces documentation is scylla.yaml Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-761 Closes scylladb/scylladb#28738	2026-02-26 14:34:28 +01:00
Yaniv Michael Kaul	ead9961783	cql: vector: fix vector dimension type Switch vector dimension handling to fixed-width `uint32_t` type, update parsing/validation, and add boundary tests. The dimension is parsed as `unsigned long` at first which is guaranteed to be at least 32-bit long, which is safe to downcast to `uint32_t`. Move `MAX_VECTOR_DIMENSION` from `cql3_type::raw_vector` to `cql3_type` to ensure public visibility for checks outside the class. Add tests to verify the type boundaries. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-223 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com> Closes scylladb/scylladb#28762	2026-02-26 14:46:53 +02:00
Ferenc Szili	d0a5a1d5d0	load_stats: move tablet filtering for table size computation This patch moves the table size tablet filtering code from a lambda in storage_service::load_stats_for_tablet_based_tables() to the code section where it will be used: tablet_storage_group_manager::table_load_stats() This is needed to better accomodate the next commit which will add code for filtering tablets for tablet sizes.	2026-02-26 11:07:53 +01:00
Ferenc Szili	9cd2a04e79	load_stats: bring the comment and code in sync Table sizes are collected in load stats, and are filtered according to the migration stage, so as to avoid double accounting of tablet sizes. The comment which explains the cut-off migration stage (cleanup) and the cut-off stage in the code (streaming) are not the same. This patch fixes that.	2026-02-26 11:03:33 +01:00
Michael Litvak	4a60ee28a2	test/cqlpy/test_materialized_view.py: increase view build timeout The test test_build_view_with_large_row creates a materialized view and expects the view to be built with a timeout of 5 seconds. It was observed to fail because the timeout is too short on slow machines. Increase the timeout to 60 seconds to make the test less flaky on slow machines. Similarly for the other tests in the file that have a timeout for view build, increase the timeout to 60 seconds to be consistent and safer. Fixes SCYLLADB-769 Closes scylladb/scylladb#28817	2026-02-26 11:27:51 +02:00
Michał Hudobski	579ed6f19f	secondary_index_manager: fix double registration bug We have observed a bug that caused Scylla to crash due to metrics double registration. This bug is really difficult to reproduce and was seen only once in the wild. We think that it may be caused by a request in-flight keeping a reference to the stats object, making it not deregister when the index is dropped, which casues a double registration when we recreate the index, however we are not 100% sure. This patch makes it so the metrics always get deregistered when we drop the index, which should fix the double registration bug. Fixes: #27252 Closes scylladb/scylladb#28655	2026-02-26 09:39:53 +01:00
Marcin Maliszkiewicz	30f18a91fd	Merge 'dtest: wait_for speedup' from Dario Mirovic Audit tests have been slow. They rely on wait_for function. This function first sleeps for the duration of the time step specified, and then calls the given function. The audit tests need 0.02-0.03 seconds for the given function, but the operation lasts around 1.02-1.03 seconds, since step is 1 second. This patch modifies wait_for dtest function so it first executes the given function, and afterwards calls time.sleep(step). This reduces time needed for the given function from 1.03 to 0.03 seconds. Total audit tests suite speedup is 3x. On the developer machine the time is reduced from 13+ minutes to 4 minutes. This patch also improves performance of some alternator tests that use the same wait_for dtest function. `wait_for` in dtest framework has default time step reduced to make the environment more responsive and test execution faster. Refs SCYLLADB-573 This is a performance improvement of testing framework. No need to backport. Closes scylladb/scylladb#28590 * github.com:scylladb/scylladb: dtest: shorten default sleep step in wait_for dtest: wait_for speedup	2026-02-26 09:33:38 +01:00
Amnon Heiman	5db971c2f9	estimated_histogram_test.cc: add to_metrics_histogram test Add a test that exercises to_metrics_histogram when Min is smaller than Precision. The test verifies duplicate integer bounds are collapsed, counts remain cumulative, and native histogram metadata is still present with the expected schema and min id. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-02-26 09:00:52 +02:00
Amnon Heiman	0b4f28ae21	histogram_metrics_helper.hh: Support Min < Precision to_metrics_histogram now collapses duplicate integer bucket bounds caused by Min less than Precision scaling while always keeping native histogram metadata. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-02-26 09:00:38 +02:00
Amnon Heiman	af6371c11f	estimated_histogram_test.cc: Add tests for approx_exponential_histogram with Min<Precision Add three test cases to verify the hybrid linear/exponential bucketing: - test_histogram_min_1_bucket_limits: Validates bucket lower limits - test_histogram_min_1_basic: Tests value insertion and bucket distribution - test_histogram_min_1_statistics: Tests min(), max(), quantile(), and mean() - test_histogram_min_2_precision_4: Test min == 2 and precision 4. These tests cover the new Min<Precision mode with Precision=4, verifying both the linear range and exponential range. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-02-26 08:53:06 +02:00
Amnon Heiman	6c21e5f80c	estimated_histogram.hh: support Min less than Precision histograms approx_exponential_histogram is a pseudo exponential histogram implementation that can insert and retrieve values into and from buckets in O 1 time. The implementation uses power of two ranges and splits them linearly into buckets. The number of buckets per power of two range is called Precision. The original implementation aimed at covering large value ranges had a limitation. The histogram Min value had to be greater than or equal to Precision. As a result code that needs histograms for small integer values could not use this implementation efficiently. This change addresses that gap by handling the case where Min is less than Precision. For Min smaller than Precision the value is scaled by a power of two factor during indexing so the existing exponential math can be reused without runtime branching. Bucket limits are scaled back to the original units which can lead to repeated bucket limits in the Min to Precision range for integer values. Example with Min 2 and Precision 4 Buckets 2 2 3 3 4 5 6 7 8 10 12 14 and so on Implementation details Introduce SHIFT based on log2 Precision minus log2 Min when positive Scale Min and Max by SHIFT for all exponential calculations Compute NUM_BUCKETS using the standard log2 Max over Min formula Use scaled value in find_bucket_index to avoid fractional bucket steps Return bucket limits by scaling back to original units Constraint relaxed from Min greater or equal to Precision to allow any Min less than Max still power of two This change maintains backward compatibility with existing histograms while enabling efficient tracking of small integer values. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-02-26 00:46:14 +02:00
Amnon Heiman	aca5284b13	test(alternator): add per-table latency coverage for item and batch ops Add missing tests for per-table Alternator latency metrics to ensure recent per-table latency accounting is actually validated. Changes in this patch: Refactor latency assertion helper into check_sets_latency_by_metric(), parameterized by metric name. Keep existing behavior by implementing check_sets_latency() as a wrapper over scylla_alternator_op_latency. Add test_item_latency_per_table() to verify scylla_alternator_table_op_latency_count increases for: PutItem, GetItem, DeleteItem, UpdateItem, BatchWriteItem, and BatchGetItem. This closes a test gap where only global latency metrics were checked, while per-table latency metrics were not covered. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-02-25 20:51:18 +02:00
Amnon Heiman	29e0b4e08c	alternator: track per-table latency for batch get/write operations Batch operations were updating only global latency histograms, which left table-level latency metrics incomplete. This change computes request duration once at the end of each operation and reuses it to update both global and per-table latency stats: Latencies are stored per table used, This aligns batch read/write metric behavior with other operations and improves per-table observability. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-02-25 20:51:18 +02:00
Yaron Kaikov	b211590bc0	.github/workflows: enable automatic backport PR creation with Jira sub-issue integration This workflow calls the reusable backport-with-jira workflow from scylladb/github-automation to enable automatic backport PR creation with Jira sub-issue integration. The workflow triggers on: - Push to master/next-/branch- branches (for promotion events) - PR labeled with backport/X.X pattern (for manual backport requests) - PR closed/merged on version branches (for chain backport processing) Features enabled by calling the shared workflow: - Creates Jira sub-issues under the main issue for each backport version - Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2) - Cherry-picks from previous version branch to avoid repeated conflicts - On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR Closes scylladb/scylladb#28804	2026-02-25 16:39:17 +02:00
Nadav Har'El	1d265e7d6d	test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY While adding the new syntax "TTL" to CREATE TABLE, I noticed that the parser actually allows a column to be defined as "STATIC PRIMARY KEY". So I add here a small test to verify that this is not really allowed: The syntax "c int STATIC PRIMARY KEY" is accepted, but then rejected by a later check. The syntax "c int PRIMARY KEY STATIC" is rejected as a syntax error. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:45 +02:00
Nadav Har'El	34c0c64d9d	docs/cql: document the new CQL per-row TTL feature Add user-facing documentation for the new CQL per-row TTL feature, in docs/cql/cql-extensions.md. Also mention (and link) the new alternative TTL feature in a few relevant documents about the old (per-write) TTL, about CDC, and about the CREATE TABLE and ALTER TABLE commands. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:44 +02:00
Nadav Har'El	23ad0be034	test/cluster: tests for the new CQL per-row TTL feature The previous patch added single-node functional tests (in test/cqlpy) for everything which was possible to test on a single node. In this patch we add four tests that we couldn't test on a single node, using the test/cluster test framework: 1. Test that the TTL expiration work - both the scanning threads and the actual deletion work on all nodes - happens on the "streaming" scheduling group. 2. Test that even if one of the cluster's nodes is down, still all the items get expired - another node "takes over" the dead node's work. 3. Test that rolling upgrade works as designed for the CQL per-row TTL feature: Before every single node in the cluster is upgraded to support this feature, a TTL column cannot be enabled on a table. And as soon as the last node of the cluster is upgraded, the TTL feature begins to work completely (you don't need to reboot all the nodes again). 4. Test that expiration works correctly on a multi-DC setup. The test doesn't check the efficiency of this process - i.e., that today each DC scans part of the data, reading with LOCAL_QUORUM, and writing the deletions across the entire cluster. Rather, the test only verifies the correctness - that expired rows do get deleted - for the usual case the data across the DCs is consistent. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:44 +02:00
Nadav Har'El	7a1351c6cf	test/cqlpy: tests for the new CQL per-row TTL feature This patch contains 27 functional tests (in the test/cqlpy framework) for the new CQL per-row TTL feature. The tests cover the TTL column configuration statements (CREATE TABLE, ALTER TABLE) as well as the actual item expiration or non-expiration depending on the value of the expiration-time column - and also CDC events generated on expiration and the metrics generated by the expiration process. These tests were written together with the code, as in "test-driven development", so they aim to cover every corner case considered during the development, and they reproduce every bug and misstep seen during the development process. As a result, they hopefully achieve very high code coverage - but since we don't have a working code-coverage tool, I can't report any specific code coverage numbers. These tests check everything which we can check on single-node cluster. The next patch will add additional multi-node tests for things we can't check here with a single node - such as the scheduling group used by the distributed work, the effect of dead nodes on the TTL functionality, and the process of rolling upgrade. The tests in this patch do NOT try to stress the background expiration scanning threads, or to check how they handle topology changes, large amounts of data or clusters spanning multiple DCs. These tests also don't test the performance impact of these scanning threads. Because the expiration scanning thread is identical to the one already used by Alternator TTL, we assume that many of these aspects were already tested for Alternator TTL and did not change when the same implementation is used for the new CQL feature. All new tests pass on ScyllaDB. Because the per-row TTL feature is a new ScyllaDB feature that does not exist on Cassandra, all these tests are skipped on Cassandra. Because some of these tests involve waiting for expiration, they can't be very quick. Still, because we set alternator_ttl_period_in_seconds to 0.5 seconds in the test framework, all 27 tests running sequentially finish in roughly 6 seconds total, which we consider acceptable. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:44 +02:00
Nadav Har'El	154cecda71	test: set low alternator_ttl_period_in_seconds in CQL tests In test/alternator/run we set alternator_ttl_period_in_seconds to a very low number (0.5 seconds) to allow TTL tests to expire items very quickly and finish quickly. Until now, we didn't need to do this for CQL tests, because they weren't using this Alternator-only feature. Now that CQL uses the same expiration feature with its original configuration parameter, we need to set it in CQL tests too. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:43 +02:00
Nadav Har'El	eebd7b0fbc	cql ttl: fix ALTER TABLE to disable TTL if column is dropped If "ALTER TABLE tab DROP x" is done to delete column x, and column x was the designated TTL column, then the per-row TTL feature should be disabled on this table. If we don't do this, the expiration scanner will continue to scan the table trying to read the dropped column - which will be wasteful or worse. A test for this case is also included in test/cqlpy/test_ttl_row.py in a later patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:43 +02:00
Nadav Har'El	acbdf637b6	cql ttl: add setting/unsetting of TTL column to ALTER TABLE The previous patch added the ability in CREATE TABLE to designate one of the regular columns as a "TTL column", to be used by the per-row TTL feature (Refs #13000). In this patch we add to ALTER TABLE the ability to enable per-row TTL on an existing table with a given column as the TTL column: ALTER TABLE tab TTL colname and also the ability to disable per-row TTL with ALTER TABLE tab TTL NULL as in CREATE TABLE, the designated TTL column must be a regular column (it can't be a primary key column or a static column), and must have the types timestamp, bigint or int. You can't enable per-row TTL if already enabled, or disable it if already disabled. To change the TTL column on an existing table, you must first disable TTL, and then re-enable it with the new column. A large collection of functional tests (in test/cqlpy), for every detail of this patch, will come in a later patch in this series. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:43 +02:00
Nadav Har'El	22c79b6af8	cql ttl: add TTL column support to CREATE TABLE and DESC TABLE This patch enables the per-row TTL feature in CQL (Refs #13000). This patch allows the user to create a new table with one of its columns designated as the TTL column with a syntax like: CREATE TABLE tab ( id int PRIMARY KEY, t text, expiration timestamp TTL ); The column marked "TTL" must have the "timestamp", "bigint" or "int" types (the choice of these types was explained in the previous patch), and there can only be one such column. We decided not to allow a column to be both a primary key column and a TTL column - although it would have worked (it's supported in Alternator), I considered this non-useful and confusing, and decided not to allow it in CQL. A TTL column also can't be a static column. We save the information of which column is the TTL column in a tag which is read by the "expiration service" - originally a part of Alternator's TTL implementation. After the previous patch, the expiration service is running and knows how to understand CQL tables, so the CQL per-row TTL feature will start to work. This patch also implements DESC TABLE, printing the word "TTL" in the right place of the output. This patch doesn't yet implement ALTER TABLE that should allow enabling or disabling the TTL column setting on an existing table - we'll do that in the next patch. A large collection of functional tests (in test/cqlpy), for every detail of this feature will be added in a later patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:42 +02:00
Nadav Har'El	e636bc39ad	ttl: add CQL support to Alternator's TTL expiration service The Alternator TTL feature uses an "expiration service", a background thread on each shard which periodically scans for expired items and deletes them. When writing the expiration service, we already anticipated that the day will come that we'll want to use it for CQL too. Well, now that we want to use it for CQL, we only need to make two changes: 1. Before this patch, the expiration service was only started if Alternator was enabled. Now we need to start it unconditionally, as both Alternator and CQL will need to use it. The performance impact of the new background threads, when not needed, should be minimal: These threads will wake up every alternator_ttl_period_in_seconds (by default - once a day) and just check if any table has per-row TTL enabled, and if not, do nothing. 2. Before this patch, the expiration-time column had to be of type "decimal" - a variable-precision floating-point type. This made sense in Alternator - where all numbers are of this type, but CQL offers better and more efficient types for this purpose. In this patch we add support for two additional types for the expiration time column: The "timestamp" type (which uses millisecond precision, which our implementation truncates to whole seconds) and for the "bigint" type storing a number of seconds since the UNIX epoch. We also support the smaller "int" type for compatibility with existing data, but it is not recommended because a signed 32-bit integer counting time from 1970 will break in 2038. After this patch, the expiration service supports CQL tables, but there is nothing yet that can enable it on CQL tables - i.e., nothing that sets the appropriate tag on the table to tell the expiration service which column is the expiration-time column. We'll add new syntax to do this in the next patch. At the moment, we leave the expiration service implementation in its existing location - alternator/ttl.cc. This is despite the fact that we now start it and use it also for CQL. For better modularity, we should probably later move the expiration service implementation to a separate module (directory). Similarly, the expiration service's period is still configured via alternator_ttl_period_in_seconds, which is now a misnomer because it also affects CQL. Later we can rename this configuration parameter, or alternatively, consider different scan periods for different tables and table types, and have separate configuration for Alternator TTL and CQL per-row TTL. The metrics kept by the expiration service are the same metrics existing for Alternator TTL, and fortunately do not have the name "alternator" in their name: * scylla_expiration_scan_passes * scylla_expiration_scan_table * scylla_expiration_items_deleted * scylla_expiration_secondary_ranges_scanned Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:42 +02:00
Nadav Har'El	2823780557	alternator ttl: move TTL_TAG_KEY to a header file TTL_TAG_KEY stores the name of the tag in which we store the name of the table's expiration-time column, for Alternator's TTL feature. We already need this name in two source files, and soon we'll need it in more files - as we want to use the same implementation also for for a new per-row TTL feature in CQL. So it's time to move the declaration of this variable to a new header file - alternator/ttl_tag.hh. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:42 +02:00
Nadav Har'El	5e16c59312	alternator ttl: remove unnecessary check of feature flag Every node that supports the Alternator TTL feature should start its background expiration-checking thread, without checking if other nodes support this feature. This patch removes the unnecessary check. Indeed, until all other nodes enable this feature, the background thread will have nothing to do. but when finally all nodes have this feature - we need this thread to already be on - without requiring another reboot of all nodes to start this thread. In practice, this change won't change anything on modern installations because this feature is already three years old and always enabled on modern clusters. But I don't want to repeat the same mistake for the new CQL per-row TTL feature, so better fix it in Alternator too. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:41 +02:00
Nadav Har'El	4f4e93b695	cql: add "cql_row_ttl" cluster feature This patch adds a new cluster feature "CQL_ROW_TTL", for the new CQL per-row TTL feature. With this patch, this node reports supporting this feature, but the CQL per-row TTL feature can only be used once all the nodes in the cluster supports the feature. In other words, user requests to enable per-row TTL on a table should check this feature flag (on the whole cluster) before proceeding. This is needed because the implementation of the per-row-TTL expiration requires the cooperation of all nodes to participate in scanning for expired items, so the feature can't be trusted until all nodes participate in it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:41 +02:00
Nadav Har'El	0d6b7a6211	alternator: fix error message if UpdateTimeToLive is not supported Since commit `2dedb5ea75`, the Alternator TTL feature is no longer experimental. It is still a "cluster feature" meaning it cannot be used on a partially-upgraded cluster until the entire cluster supports this feature. The error message we printed when the cluster doesn't support this feature was outdated, referring to the no-longer-existing experimental feature. So this patch fixes the error message. Since this feature is already three years old, nobody is likely to ever see this error message (it can be seen only by someone upgrading an even older cluster, during the rolling upgrade), but better not have wrong error messages in the code, even if it's not seen by users. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:59:41 +02:00
Nadav Har'El	b78bb914d7	alternator: replace SCYLLA_ASSERT with throwing_assert Replace all calls to SCYLLA_ASSSERT() in Alternator by the better and safer throwing_assert() introduced in the previous patch. As a result of this patch, if one of the call sites for these asserts is buggy and ever fails, only the involved operation will be killed by an exception, instead of crashing the whole server - and often the entire cluster (as the same buggy request reaches all nodes and crashes them all). Additionally, this patch replaces a few existing uses in Alternator of on_internal_error() with a non-interesting message with a more-or-less equivalent, but shorter, throwing_assert(). The idea is to convert the verbose idiom: if (!condition) { on_internal_error(logger, "some error message") } With the shorter throwing_assert(condition) Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:58:47 +02:00
Nadav Har'El	d876e7cd0a	utils: introduce throwing_assert(), a safe replacement for assert This patch introduces throwing_assert(cond), a better and safer replacement for assert(cond) or SCYLLA_ASSERT(cond). It aims to eventually replace all assertions in Scylla and provide a real solution to issue #7871 ("exorcise assertions from Scylla"). throwing_assert() is based on the existing on_internal_error() and inherits all its benefits, but brings with it the convenience of assert() and SCYLLA_ASSERT(): No need for a separate if(), new strings, etc. For example, you can do write just one line of throwing_assert(): throwing_assert(p != nullptr); Instead of much more verbose on_internal_error: if (p == nullptr) { utils::on_internal_error("assertion failed: p != nullptr") } Like assert() and SCYLLA_ASSERT(), in our tests throwing_assert() dumps core on failure. But its advantage over the other assertion functions like becomes clear in production: * assert() is compiled-out in release builds. This means that the condition is not checked, and the code after the failed condition continues to run normally, potentially to disasterous consequences. In contrast, throwing_assert() continues to check the condition even in release builds, and if the condition is false it throws an exception. This ensures that the code following the condition doesn't run. * SCYLLA_ASSERT() in release builds checks the condition and crashes Scylla if the condition is not met. In contrast, throwing_assert() doesn't crash, but throws an exception. This means that the specific operation that encountered the error is aborted, instead of the entire server. It often also means that the user of this operation will see this error somehow and know which operation failed - instead of encountering a mysterious server (or even whole-cluster crash) without any indication which operation caused it. Another benefit of throwing_assert() is that it logs the error message (and also a backtrace!) to Scylla's usual logging mechanisms - not to stderr like assert and SCYLLA_ASSERT write, where users sometimes can't see what is written. Fixes #28308. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-25 14:58:47 +02:00
Wojciech Mitros	d1ff8f1db3	docs: add strong consistency doc Add a new docs/dev document for the strongly consistent tables feature. For now, it only contains information about the Raft metadata persistence, but it should be updated as more of the strong-consistency components are added.	2026-02-25 12:34:58 +01:00
Wojciech Mitros	97dc88d6b6	test/cluster: add tests for strongly-consistent tables' metadata persistence In this patch we add various tests for checking how strongly consistent tables work while allowing their tablets to reside on non-0 shards and while using the new persistent storage for their raft metadata. The tests verify that: - strongly consistent tables' tablets can be allocated on different shards and we can write/read from them - the raft metadata is persistent across restarts even with disruptions - the sharder correctly routes metadata queries to specified shards - we can correctly perform multi-shard reads from the metadata tables - we can read using just the group_id (without shard) using ALLOW FILTERING For the tests we add logging to the sharder and partitioner and we add some extra logs for observability.	2026-02-25 12:34:58 +01:00
Wojciech Mitros	f841c0522d	raft: enable multi-shard raft groups for strongly consistent tablets In this patch we allow strongly consistent tables to have tablets on shards different than 0. For that, we remove the checks for shard 0 for the non-group0 raft groups, and we allow the tablet allocator to place tablets of strongly consistent tables on shards different than 0. We also start using the new storage (raft::persistence) for strongly consistent tables, added in the preceding commits.	2026-02-25 12:34:58 +01:00
Wojciech Mitros	ffe32e8e4d	test/raft: add unit tests for raft_groups_storage Most functions of the new storage for raft groups for strongly consistent tables are the same as for the system raft table storage, so we reuse the tests for them to test the new storage. We add additional tests for checking the new raft groups partitioner and sharder, and for verifying that writes using storages for different shards do not affect the data read on different shards. We also add a test for checking the snapshot_descriptor present after the storage bootstrap - for both system and strongly consistent storages we check that the storage contains the initial descriptor.	2026-02-25 12:34:58 +01:00
Wojciech Mitros	16977d7aa0	raft: add raft_groups_storage persistence class Add raft_groups_storage, a raft::persistence implementation for strongly consistent tablet groups. Currently, it's almost an exact copy of the raft_sys_table_storage that uses the new raft tables for strongly consistent tables (raft_groups, raft_groups_snapshots, raft_groups_snapshot_config) which have a (shard, group_id) partition key. In the future, the mutation, term and commit_idx data will be stored differently for for strongly consistent tables than for group0, which will differentiate this class from the original raft_sys_table_storage. The storage is created for each raft group server and it takes a shard parameter at construction time to ensure all queries target the correct partition (and thus shard).	2026-02-25 12:34:58 +01:00
Wojciech Mitros	654fe4b1ca	db: add system tables for strongly consistent tables' raft groups Add three new system tables for storing raft state for strongly consistent tablets, corresponding to the tables for group0: - system.raft_groups: Stores the raft log, term/vote, snapshot_id, and commit_idx for each tablet's raft group. - system.raft_groups_snapshots: Stores snapshot descriptors (index, term) for each group. - system.raft_groups_snapshot_config: Stores the raft configuration (current and previous voters) for each snapshot. These tables use a (shard, group_id) composite partition key with the newly added raft_groups_partitioner and raft_groups_sharder, ensuring data is co-located with the tablet replica that owns the raft group. The tables are only created when the STRONGLY_CONSISTENT_TABLES experimental feature is enabled.	2026-02-25 12:34:58 +01:00
Wojciech Mitros	cb0caea8bf	dht: add fixed_shard_partitioner and fixed_shard_sharder Add a custom partitioner and sharder that will be used for Raft tables for strongly consistent tables. These tables will have partition keys of the form (shard, group_id) and the partitioner creates tokens that encode the target shard in the high 16 bits. Token layout: [shard: 16 bits][partition key hash: 48 bits] This encoding guarantees that raft group data will be located on the same shard as the tablet replica corresponding to that raft group as long we use the tablet replica's shard as the value in the partition key. Storing the shard directly in the partition key avoids additional lookups for request routing to the incoming new raft tables. For even more simplicity, we avoid biasing between uint64_t and int64_t by limiting the acceptable shard ids up to 32767 (leaving the top bit 0), which results in the same value of the token when interpreting either as uint64_t or int64_t. The sharder decodes the shard by extracting the high bits, which is shard-count independent. This allows the partition key:shard mapping to remain the same even during smp changes (only increases are allowed, the same limitation as for tablets).	2026-02-25 12:34:51 +01:00
Botond Dénes	99244179f7	Merge 'CQL transport: Add histogram-based request/response size tracking' from Amnon Heiman This series closes a gap in how CQL request and response sizes are reported. Previously, request_size and response_size were tracked as simple counters, providing only cumulative totals per shard. This made it difficult to understand the distribution of message sizes and identify potential issues with very large or very small requests. After this series, the CQL transport reports detailed histogram metrics showing the distribution of request and response sizes. These histograms are tracked per-instance, per-type (per ops), and per-scheduling-group, providing much better visibility into CQL traffic patterns. The histograms are collected for QUERY, EXECUTE, and BATCH operations, which are the primary data path operations where message size distribution is most relevant. This data can help identify: - Clients sending unexpectedly large requests - Operations with oversized result sets - Scheduling group differences in traffic patterns To support this, the series extends the approx_exponential_histogram template to handle accurate sum, adds a bytes_histogram type alias optimized for byte-range measurements (1KB to 1GB). The existing per-shard counter metrics are maintained for backward compatibility. Metrics example: ``` scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808 scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409 scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631 scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809 scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079 scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98 scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432 scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079 scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57 ``` The field sees it as an important issue Fixes #14850 Closes scylladb/scylladb#28419 * github.com:scylladb/scylladb: test/boost/estimated_histogram_test.cc: Switch to real Sum transport/server: to bytes_histogram approx_exponential_histogram: Add sum() method for accurate value tracking utils/estimated_histogram.hh: Add bytes_histogram	2026-02-25 13:05:18 +02:00
Yaron Kaikov	98494e08eb	ci: harden trigger-scylla-ci workflow against credential leaks and untrusted PRs refs: https://github.com/scylladb/scylladb/security/advisories/GHSA-wrqg-xx2q-r3fv - Remove -v and -i flags from curl to prevent credentials from being logged in workflow output - Move PR_NUMBER and PR_REPO_NAME into the env block with proper quoting to prevent shell injection via crafted PR metadata - Add org membership verification step for pull_request_target events so that only PRs from scylladb org members can trigger Jenkins CI Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-796 Closes scylladb/scylladb#28785	2026-02-25 12:42:18 +02:00
Avi Kivity	511fab1f28	gossiper: exit failure detector sleep faster When running unit tests, there's a visible ~1-second sleep when gossip exits the failure detector loop. Improve this by adding a condition variable for exiting the loop and signaling it when any of the exit conditions are satisfied: the abort_source is pulled, the gossiper is shut down, or the sleep is complete. We can't just use the abort_source because gossip can be shut down independently of the rest of the system. To see the improvement, I ran cql_query_test in dev mode: Before: $ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2 > /dev/null 2>&1 real 2m26.904s user 0m24.307s sys 0m13.402s After: $ time ./build/dev/test/boost/combined_tests -t cql_query_test -- --smp 2 > /dev/null 2>&1 real 0m26.579s user 0m24.671s sys 0m13.636s Two minutes of real-time saved. Real-life improvement in test.py will be lower, because of the overhead of launching pytest for each test case. Closes scylladb/scylladb#28649	2026-02-25 11:41:02 +02:00
Andrzej Jackowski	e2c4b0a733	config: add write_consistency_levels_* guardrails configuration Add guardrails configuration that can be used later in this patch series. Refs: SCYLLADB-259	2026-02-25 10:30:03 +01:00
Avi Kivity	5baf16005f	build: install antlr3 from maven + source, not rpm packages Fedora removed the C++ backend from antlr3 [1], citing incompatible license. The license in question (the Unicode license) is fine for us. To be able to continue using antlr3, build it ourselves. The main executable can be used as is from Maven, since we don't need any patches for the parser. The runtime needs to be patched, so we download the source and patch it. Regenerated frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz Fixes https://scylladb.atlassian.net/browse/SCYLLADB-773 Closes scylladb/scylladb#28765	2026-02-25 11:03:19 +02:00
Andrei Chekun	729bad77b1	test.py: add possibility to run downloaded Scylla binary Add possibility to run Scylla binary that is stored or download the relocatable package with Scylla. Closes scylladb/scylladb#28787	2026-02-25 10:23:19 +02:00
Łukasz Paszkowski	9ade0b23da	reader_concurrency_semaphore: set _ex in on_preemptive_abort() When a permit is preemptively aborted, store the corresponding exception in permit's member: `reader_permit::impl::_ex`. This makes preemptively-aborted permits consistently report aborted() and prevents them from being treated as eligible for inactive registration in `register_inactive_read()`, avoiding assertion failures on unexpected permit state. Closes scylladb/scylladb#28591	2026-02-25 10:20:06 +02:00
Grzegorz Burzyński	b4f0eb666f	packaging: add systemctl command to dependencies scylladb/scylla container image doesn't include systemctl binary, while it is used by perftune.py script shipped within the same image. Scylla Operator runs this script to tune Scylla nodes/containers, expecting its all dependencies to be available in the container's PATH. Without systemctl, the script fails on systems that run irqbalance (e.g., on EKS nodes) as the script tries to reconfigure irqbalance and restart it via systemctl afterwards. Fixes: scylladb/scylla-operator#3080 Closes scylladb/scylladb#28567	2026-02-25 10:19:32 +02:00
Botond Dénes	56cc7bbeec	Merge 'Allow "global" snapshot using topology coordinator + add tablet metadata to manifest' from Calle Wilund Refs: SCYLLADB-193 Adds a "snapshot_table" topology operation and associated data structure/table columns to support dispatching a snapshot operation as a topo coordinator op. Logic is similar, and thus broken out and semi-shared with, truncation. Also adds optional tablet metadata to manifest, listing all tablets present in a given snapshot, as well as tablet sstable ownership, repair status, and token ranges. As per description in SCYLLADB-193, the alternative snapshot mechanism is in a separate namespace under 'tablets', which while dubious is the desired destination. The API is accessed via `nodetool cluster snapshot`, which more or less mirrors `nodetool snapshot`, but using topo op. TTL is added to message propagation as a separate patch here, since it is not (yet) used from API (or nodetool). Requires a syntax for both API and command line. Closes scylladb/scylladb#28525 * github.com:scylladb/scylladb: topology::snapshot: Add expiry (ttl) to RPC/topo op test_snapshot_with_tablets: Extend test to check manifest content table::manifest: Add tablet info to manifest.json test::test_snapshot_with_tablets: Add small test for topo coordinated snapshot scylla-nodetool: Add "cluster snapshot" command api::storage_service: Add tablets/snapshots command for cluster level snapshot db::snapshot-ctl: Add method to do snapshot using topo coordinator storage_proxy: Add snapshot_keyspace method topology_coordinator: Add handler for snapshot_tables storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS messaging_service: Add SNAPSHOT_WITH_TABLETS verb feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature topology_mutation: Add setter for snapshot part of row system_keyspace::topology_requests_entry: Add snapshot info to table topology_state_machine: Add snapshot_tables operation topology_coordinator: Break out logic from handle_truncate_table storage_proxy: Break out logic from request_truncate_with_tablets test/object_store: Remove create_ks_and_cf() helper test/object_store: Replace create_ks_and_cf() usage with standard methods test/object_store: Shift indentation right for test cases	2026-02-25 10:17:53 +02:00
Botond Dénes	166e245097	Merge 'test.py: Topology test pytest integration' from Andrei Chekun Migrate cluster tests directory to be handled by pytest. This is the next step in process of unification of the tests and migration to the pytest. With this PR cluster test will be executed with the full path to the file instead of `suite/test` paradigm. Backport is not needed because it framework enhancement. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-46 Closes scylladb/scylladb#27618 * github.com:scylladb/scylladb: test.py: remove setsid from the framework test.py: rename suite.yaml to test_config.yaml test.py: add cluster tests to be executed by pytest test.py: add random seed for topology tests reproducibility test.py: add explicit default values to pytest options test.py: replace SCYLLA env var with build_mode fixture	2026-02-25 10:17:20 +02:00
Botond Dénes	9dff9752b4	Merge 'Fix regression in Alternator TTL with tablets and node going down' from Nadav Har'El Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used. Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done. It turns out that recently, in commits `817fdad` and `d88036d`, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica. Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function. So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix): 1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica() 2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason). Fixes SCYLLADB-777. Closes scylladb/scylladb#28771 * github.com:scylladb/scylladb: test: add unit test for tablet_map::get_secondary_replica() test, alternator: add test for TTL expiration with a node down locator: fix get_secondary_replica() to match get_primary_replica()	2026-02-25 10:13:55 +02:00
Gleb Natapov	0f8cdd81f3	group0: fix indentation after previous patch	2026-02-25 10:08:32 +02:00
Gleb Natapov	7d7cbae763	raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more No need for locking any more so the function may just return a value and be synchronous.	2026-02-25 10:08:32 +02:00
Gleb Natapov	0689fb5ab2	raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream	2026-02-25 10:08:32 +02:00
Gleb Natapov	cd76604c79	raft_group0: remove unused code from raft_group0 Also do not pass raft_replace_info into setup_group0 since it is not used there for a long time now.	2026-02-25 10:08:32 +02:00
Gleb Natapov	6173ea476b	node_ops: remove topology over node ops code The code is no longer called.	2026-02-25 10:08:32 +02:00
Gleb Natapov	758d1c9c39	topology: fix indentation after the previous patch	2026-02-25 10:08:31 +02:00
Gleb Natapov	67cd5755b2	topology: drop topology_change_enabled parameter from raft_group0 code Since the parameter is always true there is no point to pass it everywhere. Just assume it is true at the point of use.	2026-02-25 10:08:31 +02:00
Gleb Natapov	50da142e77	storage_service: remove unused handle_state_* functions The handler are no longer called.	2026-02-25 10:08:31 +02:00
Gleb Natapov	1a57f2b22d	gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option The function is unused now and the option that allows to skip the wait is no longer needed as well.	2026-02-25 10:08:31 +02:00
Gleb Natapov	aa0f103eb9	storage_service: fix indentation after the last patch	2026-02-25 10:08:31 +02:00
Gleb Natapov	be6cced978	storage_service: remove gossiper bootstrapping code Remove code that is responsible for bootstrapping a node in gossiper mode since the mode is no longer supported.	2026-02-25 10:08:31 +02:00
Gleb Natapov	8776d00c44	storage_service: drop get_group_server_if_raft_topolgy_enabled Raft topology is always enabled now so the function does not make sense any longer.	2026-02-25 10:08:30 +02:00
Gleb Natapov	5fa4f5b08f	storage_service: drop is_topology_coordinator_enabled and its uses The code can now assume that topology coordinator is enabled.	2026-02-25 10:08:30 +02:00
Gleb Natapov	1e8d4436c7	storage_service: drop run_with_api_lock_in_gossiper_mode_only Since gossiper mode is no longer exists the function is no longer needed.	2026-02-25 10:08:30 +02:00
Gleb Natapov	a8a167623a	topology: remove code that assumes raft_topology_change_enabled() may return false The path removes the code protected by !raft_topology_change_enabled() since it is no longer reachable. Drop test_lwt_for_tablets_is_not_supported_without_raft since not raft mode is no longer supported.	2026-02-25 10:08:30 +02:00
Botond Dénes	8dbcd8a0b3	tools/scylla-sstable: create_table_in_cql_env(): register UDTs recursively It is not enough to go over all column types and register the UDTs. UDTs might be nested in other types, like collections. One has to do a traversal of the type tree and register every UDT on the way. That is what this patch does. This function is used by the query and write operations, which should now both work with nested UDTs. Add a test which fails before and passes after this patch.	2026-02-25 08:51:25 +02:00
Botond Dénes	cf39a5e610	tools/scylla-sstable: generalize dump_if_user_type Rename to invoke_on_user_type() and make the action taken on user types a function parameter. Enables reuse of the traverse logic by other code.	2026-02-25 08:51:25 +02:00
Botond Dénes	80049c88e9	tools/scylla-sstable: move dump_if_user_type() definition So it can be used by create_table_in_cql_env() code.	2026-02-25 08:51:25 +02:00
Dario Mirovic	3222a1a559	dtest: shorten default sleep step in wait_for Default sleep step of 1s is too long. Reduce it to make the test environment more responsive and faster. Refs SCYLLADB-573	2026-02-25 03:17:47 +01:00
Dario Mirovic	51e7c2f8d9	dtest: wait_for speedup Audit tests have been slow. They rely on wait_for function. This function first sleeps for the duration of the time step specified, and then calls the given function. The audit tests need 0.02-0.03 seconds for the given function, but the operation lasts around 1.02-1.03 seconds, since step is 1 second. This patch modifies wait_for dtest function so it first executes the given function, and afterwards calls time.sleep(step). This reduces time needed for the given function from 1.03 to 0.03 seconds. Total audit tests suite speedup is 3x. On the developer machine the time is reduced from 13+ minutes to 4 minutes. This patch also improves performance of some alternator tests that use the same wait_for dtest function. Refs SCYLLADB-573	2026-02-25 03:17:46 +01:00
Andrei Chekun	1b92b140ee	test.py: improve stdout output for boost test The current way of checking the boost's stdout can have a race condition when pytest will try to read the file before it was really flushed. So this PR should eliminate this possibility. Closes scylladb/scylladb#28783	2026-02-25 00:50:25 +01:00
Ferenc Szili	f70ca9a406	load_stats: implement the optimized sum of tablet sizes PR #28703 was merged into master but not with the latest version of the changes. This patch is an incremental fix for this. Currently, the elements of the tablet_sizes_per_shard vector are incremented in separate shards. This is prone to false sharing of cache lines, and ping-pong of memory, which leads to reduced performance. In this patch, in order to avoid cache line collisions while updating the sum of tablet sizes per shard, we align the counter to 64 bytes. Fixes: SCYLLADB-678 Closes scylladb/scylladb#28757	2026-02-24 22:19:25 +01:00
Marcin Maliszkiewicz	aa7816882e	test: add test_uninitialized_conns_semaphore Runtime in dev mode: 2s	2026-02-24 17:28:51 +01:00
Alex	5557770b59	test_mv_build_during_shutdown started two async CREATE MATERIALIZED VIEW operations and never awaited them (asyncio.gather(...) without await). This pr adds await for each one of the tasks to wait for the MV schema to be added successfully and then to start the server shutdown With this change we dont need will not get the shutdown races. Closes scylladb/scylladb#28774	2026-02-24 17:25:05 +01:00
Anna Stuchlik	64b1798513	doc: remove reduntant Java-related information This commit removes: - Instructions to install scylla-jmx (and all references) - The Java 11 requirement for Ubuntu. Fixes https://github.com/scylladb/scylladb/issues/28249 Fixes https://github.com/scylladb/scylladb/issues/28252 Closes scylladb/scylladb#28254	2026-02-24 14:37:39 +01:00
Anna Stuchlik	e2333a57ad	doc: remove the tablets limitation for Alternator This commit removes the information that Alternator doesn't support tablets. The limitation is no longer valid. Fixes SCYLLADB-778 Closes scylladb/scylladb#28781	2026-02-24 14:24:30 +02:00
Andrzej Jackowski	cd4caed3d3	test: fix configuration of test_autoretrain_dict `test_autoretrain_dict` sporadically fails because the default compression algorithm was changed after the test was written. `9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by changing the compression configuration during node startup. However, the configuration change had an incorrect YAML format and was ignored by ScyllaDB. This commit fixes it. Fixes: scylladb/scylladb#28204 Closes scylladb/scylladb#28746	2026-02-24 12:08:44 +01:00
Botond Dénes	067bb5f888	test/scylla_gdb: skip coroutine tests if coroutine frame is not found For a while, we have seen coroutine related tests (those that use the coroutine_task fixture) fail occasionally, because no coroutine frame is found. Multiple attempts were made to make this problem self-diagnosing and dump enough information to be able to debug this post-mortem. To no avail so far. A lot of time was invested into this this benign issue: See the long discussion at https://github.com/scylladb/scylladb/issues/22501. It is not known if the bug is in gdb, or the gdb script trying to find the coroutine frame. In any case, both are only used for debugging, so we can tolerate occasional failures -- we are forced to do so when working with gdb anyway. Instead of piling on more effor there, just skip these tests when the problem occurs. This solves the CI flakyness. Fixes: #22501 Closes scylladb/scylladb#28745	2026-02-24 10:12:03 +01:00
Marcin Maliszkiewicz	d5684b98c8	test: cluster: add continue-after-error to perf tool tests Add --continue-after-error true to perf-cql-raw and perf-alternator tests, and --stop-on-error false to perf-simple-query test, so that tests don't abort on the first error. Reason for this is that tests are flaky with example failure: Perf test failed: std::runtime_error (server returned ERROR to EXECUTE) When CPU is starved on CI we can return timeouts and/or other errors. The change should make tests more robust on the expense of smaller test scope. But those tests were written mostly to test startup sequence as it differs from Scylla's starup. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-759 Closes scylladb/scylladb#28767	2026-02-24 11:08:34 +02:00
Andrei Chekun	d9ce2db1a3	test.py: remove setsid from the framework With previous architecture, scylla servers were handled by test.py and if pytest fails, test.py was responsible for stopping scylla processes. Now with only pytest handling, there is no such mechanism, that's why I'm removing the setsid, so when the parent pytest process closes it will automatically close all child including any started process during testing. This will allow to not leave any scylla process in case pytest was killed.	2026-02-24 09:48:38 +01:00
Andrei Chekun	d3f5f7468c	test.py: rename suite.yaml to test_config.yaml Switch of discovery of the tests by test.py	2026-02-24 09:48:38 +01:00
Andrei Chekun	e439ec9d67	test.py: add cluster tests to be executed by pytest cluster tests are now executed by pytest also Run pytest in an executor to avoid blocking the event loop, allowing resource monitoring to run concurrently Logic for passing the arguments to pytest changed due to the fact that almost all directories now executed by pytest and directories that are not handled excluded in pytest.ini Modify the threads count for debug mode, because with the default logic CI agents die	2026-02-24 09:48:38 +01:00
Andrei Chekun	edf7154fee	test.py: add random seed for topology tests reproducibility Set TOPOLOGY_RANDOM_FAILURES_TEST_SHUFFLE_SEED environment variable during pytest configuration to enable to ensure that all xdist workers will discover the same scope of the tests. This is a known limitation of the xdist plugi where the discovered tests should be consistenta across master and workers.	2026-02-24 09:48:38 +01:00
Andrei Chekun	4a7d8cd99d	test.py: add explicit default values to pytest options Add explicit default values to pytest command line options to prevent issues when running tests with pytest's parallel execution where options are not present on upper conftest, so they're just not set at all.	2026-02-24 09:48:38 +01:00
Andrei Chekun	99234f0a83	test.py: replace SCYLLA env var with build_mode fixture Replace direct usage of SCYLLA environment variable with the build_mode pytest fixture and path_to helper function. This makes tests more flexible and consistent with the test framework. Also this allows to use tests with xdist, where environment variable can be left in the master process and will not be set in the workers Add using the fixture to get the scylla binary from the suite, this will align with getting relocatable Scylla exe.	2026-02-24 09:48:38 +01:00
Avi Kivity	0add130b9d	lua: avoid undefined behavior when converting doubles to integers Lua doesn't have separate integer and floating point numbers, so we check if a number can fit in an integer and if so convert it to an integer. The conversion routine invokes undefined behavior (and even acknowledges it!). More recent compilers changed their behavior when casting infinities, breaking test_user_function_double_return which tests this conversion. Fix by tightening the conversion to not invoke undefined behavior. Closes scylladb/scylladb#28503	2026-02-24 10:41:21 +02:00
Botond Dénes	1d5b8cc562	Merge 'Fix use after free in auth cache' from Marcin Maliszkiewicz This patchset: - ensures the loading semaphore is acquired in cross-shard callbacks - fixes iterator invalidation problem when reloading all cached permissions Fixes https://scylladb.atlassian.net/browse/SCYLLADB-780 Backport: no, affected code not released yet Closes scylladb/scylladb#28766 * github.com:scylladb/scylladb: auth: cache: fix permissions iterator invalidation in reload_all_permissions auth/cache: acquire _loading_sem in cross-shard callbacks	2026-02-24 10:35:46 +02:00
Pavel Emelyanov	5a5eb67144	vector_search/dns: Use newer seastar get_host_by_name API The hostent::addr_list is deprecated in favor of address_entry::addr field that contains the very same addresses. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28565	2026-02-23 21:28:43 +02:00
Pavel Emelyanov	6b02b50e3d	Merge 'object_storage: add retryable machinery to object storage' from Ernest Zaslavsky - add an overload to the rest http client to accept retry strategy instance as an argument - remove hand rolled error handling from object storage client and replace with common machinery that supports handling and retrying when appropriate No backport neede since it is only refactoring Closes scylladb/scylladb#28161 * github.com:scylladb/scylladb: object_storage: add retryable machinery to object storage rest_client: add `simple_send` overload	2026-02-23 21:28:51 +03:00
Wojciech Mitros	c1b3fec11a	raft: add group_id -> shard mapping to raft_group_registry To handle RPC from other nodes, we need to be able to redirect the requests for each raft group to the shard that owns it. We need to be able to do the redirection on all shards, so to achieve that, on all shards we need to store the information about which shard is occupied by each Raft group server. For that we add a group_id -> shard mapping to the raft_group_registry. The mapping is filled out when starting raft servers, it's emptied when we abort raft servers. We use it when registering RPC verb handlers, so that regardless of the shard handling the RPC, the work on the raft group can be performed on the corresponding shard.	2026-02-23 15:34:56 +01:00
Nadav Har'El	e463d528fe	test: add unit test for tablet_map::get_secondary_replica() This patch adds a unit test for tablet_map::get_secondary_replica(). It was never officially defined how the "primary" and "secondary" replicas were chosen, and their implementation changed over time, but the one invariant that this test verifies is that the secondary replica and the primary replica must be a different node. This test reproduces issue SCYLLADB-777, where we discovered that the get_primary_replica() changed without a corresponding change to get_primary_replica(). So before the previous patch, this test failed, and after the previous patch - it passes. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-23 16:19:43 +02:00
Nadav Har'El	0c7f499750	test, alternator: add test for TTL expiration with a node down We have many single-node functional tests for Alternator TTL in test/alternator/test_ttl.py. This patch adds a multi-node test in test/cluster/test_alternator.py. The new test verifies that: 1. Even though Alternator TTL splits the work of scanning and expiring items between nodes, all the items get correctly expired. 2. When one node is down, all the items still expire because the "secondary" owner of each token range takes over expiring the items in this range while the "primary" owner is down. This new test is actually a port of a test we already had in dtest (alternator_ttl_tests.py::test_multinode_expiration). This port is faster and smaller then the original (fewer nodes, fewer rows), but it still found a regression (SCYLLADB-777) that dtest missed - the new test failed when running with tablets and in release build mode. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-23 16:19:43 +02:00
Nadav Har'El	9ab3d5b946	locator: fix get_secondary_replica() to match get_primary_replica() The function tablet_map::get_secondary_replica() is used by Alternator TTL to choose a node different from get_primary_replica(). Unfortunately, recently (commits `817fdad` and d88037d) the implementation of the latter function changed, without changing the former. So this patch changes the former to match. The next two patches will have two tests that fail before this patch, and pass with it: 1. A unit test that checks that get_secondary_replica() returns a different node than get_primary_replica(). 2. An Alternator TTL test that checks that when a node is down, expirations still happen because the secondary replica takes over the primary replica's work. Fixes SCYLLADB-777 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-23 16:19:30 +02:00
Botond Dénes	dcd8de86ee	Merge 'docs: update a documentation of adding/removing DC and rebuilding a node' from Aleksandra Martyniuk Describe a procedure to convert tablet keyspace replication factor to rack list. Update the procedures of adding and removing a node to consider tablet keyspaces. Fixes: [SCYLLADB-398](https://scylladb.atlassian.net/browse/SCYLLADB-398) Fixes: https://github.com/scylladb/scylladb/issues/28306. Fixes: https://github.com/scylladb/scylladb/issues/28307. Fixes: https://github.com/scylladb/scylladb/issues/28270. Needs backport to all live branches as they all include tablets. [SCYLLADB-398]: https://scylladb.atlassian.net/browse/SCYLLADB-398?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#28521 * github.com:scylladb/scylladb: docs: update nodetool rebuild docs docs: update a procedure of decommissioning a DC docs: update a procedure of adding a DC docs: describe upgrade to enforce_rack_list option docs: describe conversion to rack-list RF	2026-02-23 16:15:16 +02:00
Andrei Chekun	6ae58c6fa6	test.py: move storage tests to cluster subdirectory Move the storage test suite from test/storage/ to test/cluster/storage/ to consolidate related cluster-based tests.This removes the standalone test/storage/suite.yaml as the tests will use the cluster's test configuration. Initially these tests were in cluster, but to use unshare at first iteration they were moved outside. Now they are using another way to handle volumes without unshare, they should be in cluster Closes scylladb/scylladb#28634	2026-02-23 16:14:15 +02:00
Gleb Natapov	e23af998e1	test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode They were running in recovery to reuse existing system tables without group0 id, but since we want to remove recovery mode we need to re-generate the tables.	2026-02-23 14:54:24 +02:00
Gleb Natapov	f589740a39	test: schema_change_test: drop schema tests relevant for no raft mode only They were running in no longer supported recovery mode to force gossip topology.	2026-02-23 14:54:24 +02:00
Gleb Natapov	92049c3205	topology: remove upgrade to raft topology code We do no longer need this code since we expect that cluster to be upgraded before moving to this version.	2026-02-23 14:54:24 +02:00
Gleb Natapov	4a9cf687cc	group0: remove upgrade to group0 code This patch removes ability of a cluster to upgrade from not having group0 to having one. This ability is used in gossiper based recovery procedure that is deprecated and removed in this version. Also remove tests that uses the procedure.	2026-02-23 14:54:24 +02:00
Gleb Natapov	dcafb5c083	group0: refuse to boot if a cluster is still is not in a raft topology mode We are going to drop legacy topology mode (gossiper mode) and no longer allow ScyllaDB to start in this mode. This patch refuses to boot if a cluster is not in raft topology mode yet. It may happen if a node of a cluster that is not yet in a raft topology is upgraded to a newer version. If this happens the node has to be downgraded. Raft topology has to be enabled on a cluster and then the node can be upgraded again.	2026-02-23 14:54:24 +02:00
Gleb Natapov	ed52d1a292	storage_service: refuse to join a cluster in legacy mode We are going to drop legacy topology mode (gossiper mode) and no longer allow ScyllaDB to start in this mode. This patch disallows a node to join a cluster that is still in legacy mode. A cluster needs to enable raft mode first.	2026-02-23 14:54:24 +02:00
Marcin Maliszkiewicz	c5dc086baf	Merge 'vector_search: return NaN for similarity_cosine with all-zero vectors' from Dawid Pawlik The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine. When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour. To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2]. If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles. It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results. Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour. Fixes: SCYLLADB-456 Backport to 2026.1 needed, as it fixes the bug for ANN vector queries using rescoring introduced there. Closes scylladb/scylladb#28609 * github.com:scylladb/scylladb: test/vector_search: add reproducer for rescoring with zero vectors vector_search: return NaN for similarity_cosine with all-zero vectors	2026-02-23 13:10:44 +01:00
Aleksandra Martyniuk	9ccc95808f	docs: update nodetool rebuild docs Update nodetool rebuild docs to mention that the command does not work for tablet keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/28270.	2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk	e4c42acd8f	docs: update a procedure of decommissioning a DC Update a procedure of decommissioning a DC for tablet keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/28307.	2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk	1c764cf6ea	docs: update a procedure of adding a DC Update a procedure of adding a DC for tablet keyspaces. Fixes: https://github.com/scylladb/scylladb/issues/28306.	2026-02-23 12:45:01 +01:00
Aleksandra Martyniuk	e08ac60161	docs: describe upgrade to enforce_rack_list option	2026-02-23 12:44:57 +01:00
Aleksandra Martyniuk	eefe66b2b2	docs: describe conversion to rack-list RF Fixes: SCYLLADB-398	2026-02-23 12:41:33 +01:00
Marcin Maliszkiewicz	54dca90e8c	Merge 'test: move dtest/guardrails_test.py to test_guardrails.py' from Andrzej Jackowski This patch series moves `test/cluster/dtest/guardrails_test.py` to `test/cluster/test_guardrails.py`, and migrates it from `cluster/dtest/` to `cluster/` framework. There are two motivations for moving the test: - Execution time reduction (from 12s to 9s in 'dev' in my env) - Facilitate adding new tests to the `guardrails_test.py` file No backport, `dtest/guardrails_test.py` is only on master Closes scylladb/scylladb#28737 * github.com:scylladb/scylladb: test: move dtest/guardrails_test.py to test_guardrails.py test: prepare guardrails_test.py to be moved to test/cluster/	2026-02-23 12:34:43 +01:00
Marcin Maliszkiewicz	1293b94039	auth: cache: fix permissions iterator invalidation in reload_all_permissions The inner loops in reload_all_permissions iterate role's permissions and _anonymous_permissions maps across yield points. Concurrent load_permissions calls (which don't hold _loading_sem) can emplace into those same maps during a yield, potentially triggering a rehash that invalidates the active iterator. We want to avoid adding semaphore acquire in load_permissions because it's on a common path (get_permissions). Fixing by snapshotting the keys into a vector before iterating with yields, so no long-lived map iterator is held across suspension points.	2026-02-23 12:14:22 +01:00
Calle Wilund	fec7df7cbb	topology::snapshot: Add expiry (ttl) to RPC/topo op Not set yet, but includes it in messages so it can be properly set in calling code. Will add entry to manifest.	2026-02-23 11:37:17 +01:00
Calle Wilund	cc60d014ed	test_snapshot_with_tablets: Extend test to check manifest content Verifies we have the expected tablet info in manifest.	2026-02-23 11:37:17 +01:00
Calle Wilund	f7aa2aacfc	table::manifest: Add tablet info to manifest.json If using tablets, will add info for each tablet present in snapshot, with repair time and token ranges + map each sstable to its owning tablet	2026-02-23 11:37:17 +01:00
Calle Wilund	ae10b5a897	test::test_snapshot_with_tablets: Add small test for topo coordinated snapshot	2026-02-23 11:37:16 +01:00
Calle Wilund	bac81df20f	scylla-nodetool: Add "cluster snapshot" command Similar to "normal" snapshot, but will use the cluster-wide, topolgy coordinated snapshot API and path.	2026-02-23 11:37:16 +01:00
Calle Wilund	b0604d9840	api::storage_service: Add tablets/snapshots command for cluster level snapshot Calls the topology_coordinator path for snapshots.	2026-02-23 11:37:16 +01:00
Piotr Dulikowski	a4c389413c	Merge 'Hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks' from Alex Dathskovsky This series hardens MV shutdown behavior by fixing lifecycle tracking for detached view-builder callbacks and aligning update handling with the same async dispatch style used by create/drop. Patch 1 refactors on_update_view to use a dedicated coroutine dispatcher (dispatch_update_view), keeping update logic serialized under the existing view-builder lock and consistent with the callback architecture already used for create/drop paths. Patch 2 adds explicit callback lifetime coordination in view_builder: - introduce a seastar::gate member - acquire _ops_gate.hold() when launching detached create/update/drop dispatch futures - keep the hold alive until each detached future resolves - close the gate during view_builder::drain() so shutdown waits for in-flight callback work before final teardown Together, these changes reduce shutdown race exposure in MV event handling while preserving existing behavior for normal operation. Testing: - pytest --test-py-init test/cluster/mv (47 passed, 7 skipped) backport: not required started happening in master fixes: SCYLLADB-687 Closes scylladb/scylladb#28648 * github.com:scylladb/scylladb: db/view: gate detached view-builder callbacks during shutdown db:view: refactor on_update_view to use coroutine dispatcher	2026-02-23 11:28:37 +01:00
Calle Wilund	9680541144	db::snapshot-ctl: Add method to do snapshot using topo coordinator Separated from "local" snapshot.	2026-02-23 11:27:15 +01:00
Calle Wilund	425d6b4441	storage_proxy: Add snapshot_keyspace method Takes set of ks->tables tuples and issues snapshot for each. If feature is enabled, keyspace is non-local, and uses tablets, will issue topo coordinator call across cluster. Keyspaces not fitting the above will just go to "normal" (node local) snapshot.	2026-02-23 11:27:15 +01:00
Calle Wilund	012a065364	topology_coordinator: Add handler for snapshot_tables Similar to truncate, translates the topo op into RPC call to relevant replicas and waits.	2026-02-23 11:27:15 +01:00
Calle Wilund	2bc633c3bd	storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS	2026-02-23 10:44:42 +01:00
Calle Wilund	d1b06482f0	messaging_service: Add SNAPSHOT_WITH_TABLETS verb	2026-02-23 10:44:42 +01:00
Calle Wilund	3075311f21	feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature To detect if cluster can do coordinated snapshot	2026-02-23 10:44:41 +01:00
Calle Wilund	988c5238cf	topology_mutation: Add setter for snapshot part of row	2026-02-23 10:44:41 +01:00
Calle Wilund	8bb81f00f8	system_keyspace::topology_requests_entry: Add snapshot info to table Adds required info to communicate snapshot requests via topology coordinator.	2026-02-23 10:44:38 +01:00
Calle Wilund	642aa44937	topology_state_machine: Add snapshot_tables operation Note: placed after "noop" op, to not confuse ops in a mixed (upgrading) cluster	2026-02-23 10:43:28 +01:00
Calle Wilund	e3d4493bf6	topology_coordinator: Break out logic from handle_truncate_table Makes handle_truncate_table use shared logic, because we would like to reuse some of it for other, coming ops. I.e. snapshot.	2026-02-23 10:43:28 +01:00
Calle Wilund	6e39c3bb83	storage_proxy: Break out logic from request_truncate_with_tablets Makes request_truncate_with_tablets use a parameterized helper, because eventually we will want to use almost identical logic for other ops, like snapshot.	2026-02-23 10:43:28 +01:00
Pavel Emelyanov	ad0c2de0d1	test/object_store: Remove create_ks_and_cf() helper Now all test cases use standard facilities to create data they test Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-23 10:43:28 +01:00
Pavel Emelyanov	6711afd73b	test/object_store: Replace create_ks_and_cf() usage with standard methods To create a keyspace theres new_test_keyspace helper Table is created with a single cql.run_async with explicit schema Dataset is populated with a single parallel INSERT as well Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-23 10:43:28 +01:00
Pavel Emelyanov	ed3a326637	test/object_store: Shift indentation right for test cases This is preparational patch. Next will need to replace foo() bar() with with something() as s: foo() bar() Effectively -- only add the `with something()` line. Not to shift the whole file right together with that future change, do it here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-23 10:43:28 +01:00
Marcin Maliszkiewicz	75d4bc26d3	auth/cache: acquire _loading_sem in cross-shard callbacks distribute_role() modifies _roles on non-zero shards via invoke_on_others() without holding _loading_sem. Similarly, load_all()'s invoke_on_others() callback calls prune_all() without the semaphore. When these run concurrently with reload_all_permissions(), which iterates _roles across yield points, an insertion can trigger absl::flat_hash_map::resize(), freeing the backing storage while an iterator still references it. Fix by acquiring _loading_sem on the target shard in both distribute_role()'s and load_all()'s invoke_on_others callbacks, serializing all _roles mutations with coroutines that iterate the map.	2026-02-23 10:30:03 +01:00
Pavel Emelyanov	3d07633300	test/object_store: Use itertools.product() for deeply nested loops The test_restore_with_streaming_scopes want to run some loop body for all (almost) combinations of scope, primary-replica-only and min tablet count. For that three nested loops are used. Using itertools.product() makes the code shorter, less indented and more explicit. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-23 12:28:53 +03:00
Pavel Emelyanov	a9a82f89ac	test/object_store: Replace dataset creation usage with standard methods Two places are fixed 1. The call to create_dataset() is replaced with three "library" methods. This makes it explicit which options and schema are used for that. Eventually, the large and bulky create_dataset will be removed 2. The part that restores data into a fresh new table calls some CQLs by hand, and partially re-uses variables obtained from previous call to create_dataset(). Using the same "library" methods to re-create an empty table makes this part much simpler Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-23 12:27:41 +03:00
Pavel Emelyanov	988606ac7f	test/object_store: Shift indentation right for test_restore_with_streaming_scopes This is preparational patch. Next will need to replace foo() bar() with with something() as s: foo() bar() Effectively -- only add the `with something()` line. Not to shift the whole file right together with that future change, do it here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-23 12:27:09 +03:00
Pavel Emelyanov	5161aeee95	test/backup: Run keyspace flush and snapshot taking API in parallel The take_snapshot() helper runs these API sequentially for every server. Running them with asyncio.gather() slightly reduces the wait-time thus improving the total runtime. Before: CPU utilization: 2.1% real 0m33,871s user 0m22,500s sys 0m13,207s After: CPU utilization: 2.4% real 0m29,532s user 0m22,351s sys 0m12,890s Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-23 12:20:36 +03:00
Pavel Emelyanov	21752a43fe	test/backup: Re-use take_snapshot() helper in do_abort_restore() The test in question does _exactly_ what this helper does, but in a longer way. The only difference is that it uses server_id as key to dict with sstable components, but it's easy to tune. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-23 12:20:35 +03:00
Pavel Emelyanov	818a99810c	test/backup: Move take_snapshot() helper up So that it's not in the middle of tests themselves, but near other "helper" functions in the .py file Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-23 12:20:35 +03:00
Ernest Zaslavsky	321d4caf0c	object_storage: add retryable machinery to object storage remove hand rolled error handling from object storage client and replace with common machinery that supports exception handling and retrying when appropriate	2026-02-22 14:00:44 +02:00
Ernest Zaslavsky	24972da26d	rest_client: add `simple_send` overload add an overload to rest client `simple_send` to accept a retry_strategy for http's make_request	2026-02-22 14:00:44 +02:00
Marcin Maliszkiewicz	aba5a8c37f	generic_server: fix waiters count in shed log Capture semaphore waiters count when blocking starts, not after the wait completes.	2026-02-20 17:04:23 +01:00
Marcin Maliszkiewicz	23bed55170	generic_server: scale connection concurrency semaphore by listener count The concurrency semaphore gates uninitialized connections across all do_accepts loops, but was initialized to a fixed value regardless of how many listeners exist. With multiple listeners competing for the same units, each effectively gets less than the configured concurrency. Initialize the semaphore to concurrency - 1 and signal 1 per listen() call, so total capacity is concurrency - 1 + nr_listeners. This guarantees each listener's accept loop can have at least one unit available.	2026-02-20 16:59:19 +01:00
Patryk Jędrzejczak	e8efcae991	Merge 'Use standard ks/cf/data creation methods in object_store/test_basic.py test' from Pavel Emelyanov The test uses create_ks_and_cf helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities. Also it prepares the ground for testing keyspace storage options with rf=3 Cleaning tests, not backporting Closes scylladb/scylladb#28600 * https://github.com/scylladb/scylladb: test/object_store: Remove create_ks_and_cf() helper test/object_store: Replace create_ks_and_cf() usage with standard methods test/object_store: Shift indentation right for test cases	2026-02-20 15:53:38 +01:00
Nadav Har'El	d01915131a	test/cqlpy: make test_indexing_paging_and_aggregation much faster Currently, test_secondary_index.py::test_indexing_paging_and_aggregation is very slow, and the slowest test in the test/cqlpy framework: It takes around 13 seconds on dev build, and because it is CPU-bound (doesn't sleep), it is much slower on debug builds. The reason for this slowness is that it needs to set up and read over 10,000 rows which is the default select_internal_page_size. But after the patches in pull request (#25368), we can configure select_internal_page_size, so in this patch we change the test to temporarily reduce this option to just 50, and then the test can reach the same code paths with just 142 rows instead of 20120 rows before this patch. As a result, the test should now be 140 times faster than it was before. In practice, because of some fixed overheads (the test creates several tables and indexes), in dev build mode the test run speedup is "only" 26-fold (to around half a second). I verified that removing the code added in `bb08af7` indeed makes the new shorter test fail - and this is the only test in test_secondary_index.py that starts to fail besides test_index_paging_group_by which is also related (so my revert didn't just break secondary indexing completely). So the shorter test is still a good regression test. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28268	2026-02-20 15:44:53 +02:00
Avi Kivity	92bc5568c5	tools: toolchain: build sanitizers for future toolchain The future toolchain did not build the sanitizers, so debug executables did not link. Fix by not disabling the sanitizers. Closes scylladb/scylladb#28733	2026-02-20 15:44:24 +02:00
Botond Dénes	6c04e02f66	Merge 'Fix restoration test's validation of streaming directions' from Pavel Emelyanov The test_restore_with_streaming_scopes among other things checks how data streams flow while restoring. Whether or not to check the streams is decided based on the min tablet count value, which is compared with a hardcoded 512. This value of 512 matched the tablet count used by this test until it was "optimized" by #27839, where this number changed to 5 and streaming checks became off. Good news is that the very same checks are still performed by test_refresh_with_streaming_scopes. But it's better to have a working restoration test anyway. Minor test fix, not backporting Closes scylladb/scylladb#28607 * github.com:scylladb/scylladb: test: Fix the condition for streaming directions validation test: Split test_backup.py::check_data_is_back() into two	2026-02-20 15:42:10 +02:00
Botond Dénes	6f88c0dbd3	Merge ' test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance' from Tomasz Grabiec Currently, the test assumes that when 'topology_coordinator_pause_before_processing_backlog: waiting' is logged, the task for decommission must be there. This was based on the assumption that topology coordinator is idle and decommission request wakes it up. But if the server is slow enough, it may still be running the load balancer in reaction to table creation, and block on that injection point before decommission request was added. Fix by waiting for the task to appear rather than the injection. Fixes SCYLLADB-715 Only 2026.1 vulnerable. Closes scylladb/scylladb#28688 * github.com:scylladb/scylladb: test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance test: cluster: task_manager_client: Introduce wait_task_appears() tests: pylib: util: Add exponential backoff to wait_for	2026-02-20 15:05:36 +02:00
Pavel Emelyanov	c96420c015	tests: Re-use manager.get_server_exe() There's a bunch of incremental repair tests that want to call scylla sstable command. For that they try to find where scylla binary by scanning /proc directory (see local_process_id and get_scylla_path helpers). There's shorter way -- just call manager.get_server_exe(). Same for backup-restore test. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28676	2026-02-20 14:59:30 +02:00
Pavel Emelyanov	a4a0d75eee	test/object_store: Parametrize test_simple_backup_and_restore() There are three tests and a function with a pair of boolean parameters called by those. It's less code if the function becomes a test with parameters. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28677	2026-02-20 14:57:30 +02:00
Pavel Emelyanov	a2e1293f86	test/object_store: Squash two simple-backup tests together The test_backup_simple creates a ks/cf, takes a snapshot, backs it up, then checks that the files were uploaded. The test_backup_move does the same, but also plays with 'move_files' parameter to be true/false. In fact, the "move" test was the copy of "simple" one that dropepd check for scheduling group being "streaming" (backup with --move-files can check the same, it's not bad), and check for destination bucket to contain needed files (same here -- checking that files arrived to bucket after --move-files is good). In the end of the day, after the change backup test is run two times, instead of three, and performs extra checks for --move-files case. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28606	2026-02-20 14:49:30 +02:00
Botond Dénes	7e90ed657c	Merge 'Fix `client_options` docs' from Karol Baryła https://github.com/scylladb/scylladb/pull/25746 added a new column to `system.clients`: `client_options frozen<map<text, text>>`. This column stores all options sent by the client in the `STARTUP` message. This PR also added `CLIENT_OPTIONS` to the list of values sent in `SUPPORTED` message, and documented that drivers can send their configuration (as JSON) in `STARTUP` under this key. Documentation for the new column was not added to the description of `system.clients` table, and documentation about the new `STARTUP` key was added in `protocol-extensions.md`, but in the section about shard awareness extension. This PR adds missing `system.clients` column description, moves the documentation of `CLIENT_OPTIONS` into its own section, and expands it a bit. Backport: none, because this fixes internal documentation. Closes scylladb/scylladb#28126 * github.com:scylladb/scylladb: protocol-extensions.md: Fix client_options docs system_keyspace.md: Add client_options column system_keyspace.md: Fix order in system.clients	2026-02-20 14:23:34 +02:00
Pavel Emelyanov	525cb5b3eb	table: Use fmt::to_string() to stringify compation group ID Doing it with format("{}", foo) is correct, but to_string is a bit more lightweight. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28630	2026-02-20 14:13:15 +02:00
Patryk Jędrzejczak	d399a197f5	Merge 'raft: Await instead of returning future in wait_for_state_change' from Dawid Mędrek The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling. As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem. Fixes SCYLLADB-665 Backport: This is a small improvement and may help when debugging, so let's backport it to all supported versions. Closes scylladb/scylladb#28624 * https://github.com/scylladb/scylladb: test: raft: Add test_aborting_wait_for_state_change raft: Describe exception types for wait_for_state_change and wait_for_leader raft: Await instead of returning future in wait_for_state_change	2026-02-20 12:17:22 +01:00
Andrzej Jackowski	eb5a564df2	test: move dtest/guardrails_test.py to test_guardrails.py This commit moves `guardrails_test.py`, prepared in the previous commit of this patch series, to `test/cluster/test_guardrails.py`. It also cleans up `suite.yaml`.	2026-02-20 11:39:52 +01:00
Andrzej Jackowski	9df426d2ae	test: prepare guardrails_test.py to be moved to test/cluster/ Disable `test/cluster/dtest/guardrails_test.py` in `suite.yaml` and make it compatible with the `test/cluster/` framework. This will allow moving this file from `test/cluster/dtest/` to `test/cluster/` in the next commit of this patch series. There are two motivations for moving the test: - Execution time reduction (from 12s to 9s in 'dev' in my env) - Facilitate adding new tests to the `guardrails_test.py` file	2026-02-20 11:39:43 +01:00
Raphael S. Carvalho	f33f324f77	mutation_compactor: Fix tombstone GC metrics to account for only expired There are 3 metrics (that goes in every compaction_history entry): total_tombstone_purge_attempt total_tombstone_purge_failure_due_to_overlapping_with_memtable total_tombstone_purge_failure_due_to_overlapping_with_uncompacting_sstable When a tombstone is not expired (e.g. doesn't satisfy "gc_before" or grace period), it can be currently accounted as failure due to overlapping with either memtable or uncompacting sstable. So those 2 last metrics have noise of unexpired tombstones. What we should do is to only account for expired tombstones in all those 3 metrics. We lose the info of knowing the amount of tombstones processed by compaction, now we'll only know about the expired ones. But those metrics were primarily added for explaining why expired tombstones cannot be removed. We could have alternatively added a new field purge_failure_due_to_being_unexpired or something, but it requires adding a new field to compaction_history. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-737. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#28669	2026-02-20 10:43:58 +02:00
Botond Dénes	0bf4c68af5	Merge 'docs: fix link to docker build readme in the README.MD' from Marcin Szopa Links were pointing to the `debian` subdirectory. However, there docker build was refactored to use `redhat`: `1abf981a73`, see https://github.com/scylladb/scylladb/pull/22910 No backport, just a README link fixes. Closes scylladb/scylladb#28699 * github.com:scylladb/scylladb: docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory docs: fix link to docker build README.MD	2026-02-20 08:21:46 +02:00
Botond Dénes	51a25c8af3	test/boost/batchlog_manager_test: add tests for v1 batchlog The v1 table is used while upgrading from a pre-v2 version. We need tests to ensure it still works.	2026-02-20 07:03:46 +02:00
Botond Dénes	83344dacbd	test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2 Make the actual table name a parameter and add logic to adapt to the variant used. Also add dump_to_log::yes to is_rows() invokation to help debuging tests.	2026-02-20 07:03:46 +02:00
Botond Dénes	2956714e19	test/boost/batchlog_manager_test: fix indentation	2026-02-20 07:03:46 +02:00
Botond Dénes	23732227fe	test/boost/batchlog_manager_test: extract prepare_batches() method To be shared between multiple tests in future commits. Indentation is left broken.	2026-02-20 07:03:46 +02:00
Botond Dénes	af26956bb4	test/lib/cql_assertions: is_rows(): add dump parameter When set to true, the query results will be logged by the testlog logger with debug level. A huge help when debugging failures around cql assertions: seeing the actual query result is often enough to immediately understand why the test failed.	2026-02-20 07:03:46 +02:00
Botond Dénes	48e9b3d668	tools/scylla-sstable: extract query result printers To cql3/query_result_printer.hh. Allowing for other users, outside of tools.	2026-02-20 07:03:46 +02:00
Botond Dénes	978627c4e1	tools/scylla-sstable: add std::ostream& arg to query result printers Make them more general-purpose, in preparation to extracting them to their own header.	2026-02-20 07:03:46 +02:00
Botond Dénes	0549b61d55	repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log Provides visibility into whether batchlog replay was successful or not.	2026-02-20 07:03:46 +02:00
Botond Dénes	dd50bd9bd4	db/batchlog_manager: re-add v1 support system.batchlog will still have to be used while the cluster is upgrading from an older version, which doesn't know v2 yet. Re-add support for replaying v1 batchlogs. The switch to v2 will happen after the BATCHLOG_V2 cluster feature is enabled. The only external user -- storage_proxy -- only needs a minor adjustment: switch between the table names. The rest is handled transparently by the db/batchlog.hh interface and the batchlog_manager.	2026-02-20 07:03:46 +02:00
Botond Dénes	8ffa3d32c0	db/batchlog_manager: return all_replayed from process_batch() process_batch() currently returns stop_iteration::no from all control paths. This is not useful. Return the all_replayed output param instead. This requires making the batch() lambda a coroutine, but considering the amount of work process_batch() does (send multiple writes), this should be inconsequential.	2026-02-20 07:03:46 +02:00
Botond Dénes	091b43f54b	db/batchlog_manager: process_bath() fix indentation	2026-02-20 07:03:46 +02:00
Botond Dénes	ef2b8b4819	db/batchlog_manager: make batch() a standalone function Currently it is a huge lambda. Deserves to be a standalone function, to make the replay_all_failed_batches() easier to read and modify.	2026-02-20 07:03:46 +02:00
Botond Dénes	ca2bbbad97	db/batchlog_manager: make structs stats public Need to rename stats() -> get_stats() because it shadows the now exported type name.	2026-02-20 07:03:46 +02:00
Botond Dénes	f8bfaedb6e	db/batchlog_manager: allocate limiter on the stack Now that replay_all_failed_batches() is a coroutine, there is no need to make it a shared pointer anymore.	2026-02-20 07:03:46 +02:00
Botond Dénes	ac059dadc6	db/batchlog_manager: add feature_service dependency Will be needed to check for batchlog_v2 feature.	2026-02-20 07:03:46 +02:00
Botond Dénes	c901ab53d2	gms/feature_service: add batchlog_v2 feature	2026-02-20 07:03:45 +02:00
Avi Kivity	66bef0ed36	lua, tools: adjust for lua 5.5 lua_newstate seed parameter Lua 5.5 adds a seed parameter to lua_newstate(), provide it with a strong random seed. Closes scylladb/scylladb#28734	2026-02-20 06:52:37 +02:00
Avi Kivity	27a5502f14	Merge 'Reapply "main: test: add future and abort_source to after_init_func"' from Marcin Maliszkiewicz The patchset fixes abort_source implementation for perf-alternator and perf-cql-raw. It moves run_standalone function to common code in perf.hh with necessary templating. We also add extensive testing so that it's more difficult to break the tooling in the future. Fixes SCYLLADB-560 Backport: no, internal tooling improvement Closes scylladb/scylladb#28541 * github.com:scylladb/scylladb: test: cluster: add tests for perf tools test: perf: fix port race condition on startup in connect workload test: perf: prepare benchmarks to bind to custom host test: perf: make perf-alterantor remote port configurable test: perf: fix ASAN leak warnings in perf-alternator Reapply "main: test: add future and abort_source to after_init_func"	2026-02-19 19:12:46 +02:00
Dawid Mędrek	c9d192c684	Merge 'raft ropology: prevent crashes of multiple nodes' from Patryk Jędrzejczak Some assertions in the Raft-based topology are likely to cause crashes of multiple nodes due to the consistent nature of the Raft-based code. If the failing assertion is executed in the code run by each follower (e.g., the code reloading the in-memory topology state machine), then all nodes can crash. If the failing assertion is executed only by the leader (e.g., the topology coordinator fiber), then multiple consecutive group0 leaders will chain-crash until there is no group0 majority. Crashing multiple nodes is much more severe than necessary. It's enough to prevent the topology state machine from making more progress. This will naturally happen after throwing a runtime error. The problematic fiber will be killed or will keep failing in a loop. Note that it should be safe to block the topology state machine, but not the whole group0, as the topology state machine is mostly isolated from the rest of group0. We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT` with `on_internal_error`. These are not all occurrences, as some fatal assertions make sense, for example, in the bootstrap procedure. We also raise an internal error to prevent a segmentation fault in a few places. Fixes #27987 Backporting this PR is not required, but we can consider it at least for 2026.1 because: - it is LTS, - the changes are low-risk, - there shouldn't be many conflicts. Closes scylladb/scylladb#28558 * github.com:scylladb/scylladb: raft topology: prevent accessing nullptr returned by topology::find raft topology: make some assertions non-crashing	2026-02-19 16:50:03 +01:00
Marcin Maliszkiewicz	22c3d8d609	Merge 'db/config: enable table audit by default' from Piotr Smaron In https://github.com/scylladb/scylladb/pull/27262 table audit has been re-enabled by default in `scylla.yaml`, logging certain categories to a table, which should make new Scylla deployments have audit enabled. Now, in the next release, we also want to enable audit in `db/config.cc`, which should enable audit for all deployments, which don't explicitly configure audit otherwise in `scylla.yaml` (or via cmd line). BTW. Because this commit aligns audit's default config values in `db/config.cc` to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which is based on `db/config.cc` will start showing that table audit is the default. Refs: https://github.com/scylladb/scylladb/issues/28355 Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222 No backport: table audit has been enabled in 2026.1 in `scylla.yaml`, and should be always on starting from the next release, which is the release we're currently merging to (2026.2). Closes scylladb/scylladb#28376 * github.com:scylladb/scylladb: docs: decommission: note audit ks may require ALTERing docs: mention table audit enabled by default audit: disable DDL by default db/config: enable table audit by default test/cluster: fix `test_table_desc_read_barrier` assertion test/cluster: adjust audit in tests involving decommissioning its ks audit_test: fix incorrect config in `test_audit_type_none`	2026-02-19 16:30:11 +01:00
Pavel Emelyanov	b4b9b547ce	replica: Remove unused sched groups from keyspace and table configs Compaction and statement groups are carried over on those configs, but are in fact unused. Drop both. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28540	2026-02-19 15:47:31 +01:00
Patryk Jędrzejczak	45115415fb	Merge 'Parametrize and merge several restoration test cases' from Pavel Emelyanov There are four tests that check how restore with primary-replica-only option works in various scopes and topologies. Cases that check same-racks and same-datacenters are very very similar, so are those that check different-racks and different-datacenters. Parametrizing them and merging saves lots of code (+30 lines, -116 lines) It's probably worth merging the resulting same-domain with different-domain tests, because the similarity is still large in both, but the result becomes too if-y, so not done here. Maybe later. Improving tests, not backporting Closes scylladb/scylladb#28569 * https://github.com/scylladb/scylladb: test: Merge test_restore_primary_replica_different_... tests test: Merge test_restore_primary_replica_same_... tests test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all	2026-02-19 15:42:55 +01:00
Pavel Emelyanov	26372e65df	Merge 's3_perf: Fix the s3 perf test' from Ernest Zaslavsky Fix the build of the test and the upload operation flow No need to backport since it is only a test we barely use Closes scylladb/scylladb#28595 * github.com:scylladb/scylladb: s3_perf: fix upload operation flow s3_perf: fix the CMake build	2026-02-19 15:31:43 +02:00
Avi Kivity	7ec710c250	Merge 'tablets: Reduce per-shard migration concurrency to 2' from Tomasz Grabiec Tablet migration keeps sstable snapshot during streaming, which may cause temporary increase in disk utilization if compaction is running concurrently. SSTables compacted away are kept on disk until streaming is done with them. The more tablets we allow to migrate concurrently, the higher disk space can rise. When the target tablet size is configured correcly, every tablet should own about 1% of disk space. So concurrency of 4 shouldn't put us at risk. But target tablet size is not chosen dynamically yet, and it may not be aligned with disk capacity. Also, tablet sizes can temporarily grow above the target, up to 2x before the split starts, and some more because splits take a while to complete. To reduce the impact from this, reduce concurrency of migration. Concurrency of 2 should still be enough to saturate resources on the leaving shard. Also, reducing concurrency means that load balancing is more responsive to preemption. There will be less bandwidth sharing, so scheduled migrations complete faster. This is important for scale-out, where we bootstrap a node and want to start migrations to that new node as soon as possible. Refs scylladb/siren#15317 Closes scylladb/scylladb#28563 * github.com:scylladb/scylladb: tablets, config: Reduce migration concurrency to 2 tablets: load_balancer: Always accept migration if the load is 0 config, tablets: Make tablet migration concurrency configurable	2026-02-19 15:31:43 +02:00
Dawid Mędrek	fae71f79c2	test: raft: Add test_aborting_wait_for_state_change	2026-02-19 14:21:01 +01:00
Karol Nowacki	ca7f9a8baf	vector_search: fix TLS server name with IP SNI works only with DNS hostnames. Adding an IP address causes warnings on the server side. This change adds SNI only if it is not an IP address. This change has no unit tests, as this behavior is not critical, since it causes a warning on the server side. The critical part, that the server name is verified, is already covered. Fixes: VECTOR-528	2026-02-19 13:00:03 +01:00
Karol Nowacki	6205aad601	vector_search: add warn log for failed ann requests In order to simplify troubleshooting connection problems, this patch adds an extra warn log that prints the error for the vector search request whenever it fails.	2026-02-19 13:00:03 +01:00
Dawid Mędrek	e4f2b62019	raft: Describe exception types for wait_for_state_change and wait_for_leader The methods of `raft::server` are abortable and if the passed `abort_source` is triggered, they throw `raft::request_aborted`. We document that. Although `raft::server` is an interface, this is consistent with the descriptions of its other methods.	2026-02-19 12:47:14 +01:00
Dawid Mędrek	c36623baad	raft: Await instead of returning future in wait_for_state_change The `try-catch` expression is pretty much useless in its current form. If we return the future, the awaiting will only be performed by the caller, completely circumventing the exception handling. As a result, instead of handling `raft::request_aborted` with a proper error message, the user will face `seastar::abort_requested_exception` whose message is cryptic at best. It doesn't even point to the root of the problem. Fixes SCYLLADB-665	2026-02-19 12:47:14 +01:00
Marcin Maliszkiewicz	de4e5e10af	test: perf: fix prepared statements logic in perf-simple-query Due to lack of checks present in process_execute_internal from transport/server.cc needs_authorization bool was always set to true doing some extra work (check_access()) for each request. We mirror the logic in this patch in test env which perf-simple-query uses. This can also potentially improve runtime of unittests (marginally). Note that bug is only in perf tool not scylla itself, the fix decreases insns/op by around 10%: Before: 41065 insns/op After: 37452 insns/op Command: ./build/release/scylla perf-simple-query --duration 5 --smp 1 Fixes https://github.com/scylladb/scylladb/issues/27941 Closes scylladb/scylladb#28704	2026-02-19 12:42:07 +02:00
Avi Kivity	58a662b9db	dist: refresh container base image (ubi9-minimal) Using an outdated image can cause problems when `microdnf update` runs, if the distribution doesn't maintain good update hygiene. Although, I suspect that when update failures happen they're really caused by propagation delay of packages to mirrors. Fix by using --pull=always to get a fresh image. Ref https://scylladb.atlassian.net/browse/SCYLLADB-714 Closes scylladb/scylladb#28680	2026-02-19 12:42:43 +03:00
Ferenc Szili	f1bc17bd4c	load_stats: fix race condition when computing sum_tablet_sizes In storage_service::load_stats_for_tablet_based_tables(), we are passing a reference to sum_tablet_sizes to the lambda which increments this value on each shard via map_reduce0(). This means we could have a race condition because this is executed on separate threads/CPUs. This patch fixed the problem by collecting the sums by shard into a vector, then summing those up. Refs: SCYLLADB-678 Closes scylladb/scylladb#28703	2026-02-19 12:29:25 +03:00
Avi Kivity	dee868b71a	interval: avoid clang 23 warning on throw statement in potentially noexcept function interval_data's move constructor is conditionally noexcept. It contains a throw statemnt for the case that the underlying type's move constructor can throw; that throw statemnt is never executed if we're in the noexept branch. Clang 23 however doesn't understand that, and warns about throwing in a noexcept function. Fix that by rewriting the logic using seastar::defer(). In the noexcept case, the optimizer should eliminate it as dead code. Closes scylladb/scylladb#28710	2026-02-19 12:24:20 +03:00
Ernest Zaslavsky	45d824e0fe	s3_perf: fix upload operation flow Correct the upload operation logic. The previous flow incorrectly checked for the test file on S3 even when performing operations that do not download the file, such as uploads.	2026-02-19 11:14:59 +02:00
Botond Dénes	b637e17b19	db/config: don't use RBNO for scaling Remove bootstrap and decomission from allowed_repair_based_node_ops. Using RBNO over streaming for these operations has no benefits, as they are not exposed to the out-of-date replica problem that replace, removenode and rebuild are. On top of that, RBNO is known to have problems with empty user tables. Using streaming for boostrap and decomission is safe and faster than RBNO in all condition, especially when the table is small. One test needs adjustment as it relies on RBNO being used for all node ops. Fixes: SCYLLADB-105 Closes scylladb/scylladb#28080	2026-02-19 09:51:09 +01:00
Calle Wilund	8e71a6f52a	gcp: Add handling of 429 (too many requests) to exponential backoff Fixes: SCYLLADB-611 Adds http error code 429 to codes handled by exponential backoff. Closes scylladb/scylladb#28588	2026-02-19 09:42:39 +01:00
Marcin Maliszkiewicz	3417d50add	test: cluster: add tests for perf tools It checks if all workloads can be properly executed with succesfull startup and teardown. Especially testing alternator in remote mode is important because it's invoked like this during pgo training in pgo.py. Test runtime: Release - 24s Debug - 1m 15s Test time consists mostly of Scylla startup in various modes.	2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz	c69534504c	test: perf: fix port race condition on startup in connect workload Other workloads at startup call prepopulate() which connects with retry loop therefore it waits until cql port is open. This commit adds a single place where we will wait for port for all workloads. Timeout is set to 5 minutes so that even slowest machines are able to start.	2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz	828f2fbdb1	test: perf: prepare benchmarks to bind to custom host This is usefull for tests where we use local networks like 127.5.5.5 to avoid port and host collisions.	2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz	9f2b97bef4	test: perf: make perf-alterantor remote port configurable It could be a usefull option to have.	2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz	f5a212e91e	test: perf: fix ASAN leak warnings in perf-alternator Those were intentional as test process is short lived but when we add automated tests in the following commits we expect clean exit, with 0 exit code.	2026-02-19 09:33:10 +01:00
Marcin Maliszkiewicz	0c76c73e34	Reapply "main: test: add future and abort_source to after_init_func" This reverts commit `ceec703bb7`. The commit was fixed with abort source handling for alternator standalone path so it's safe to reapply.	2026-02-19 09:33:10 +01:00
Piotr Dulikowski	7d6f734a51	dictionary compression: add missing co_awaits on get_units There is a handful of places in the code related to dictionary compression which calls get_units to acquire semaphore units but the returned future is not awaited, seemingly by mistake. The result of get_units is assigned to a variable - which is reasonable at a glance because the semaphore units need to be assigned to a variable in order to control their scope - but at the same time if co_await is mistakenly omitted, like here, doing so will silence the nodiscard check of seastar::future and, effectively, the get_units call will be nearly useless. Unfortunately, this is an easy mistake to make. Fix the places in the code that acquire semaphore units via get_units but never await the future returned by it. I found them by manual code inspection, so I hope that I didn't miss any. Closes scylladb/scylladb#28581	2026-02-18 16:40:40 +01:00
Ernest Zaslavsky	4026b54a5e	s3_perf: fix the CMake build Fix the CMake build of the perf_s3_client by adding the necessary linkage with the jsoncpp library.	2026-02-18 17:12:08 +02:00
Piotr Smaron	797c5cd401	docs: decommission: note audit ks may require ALTERing With audit feature enabled, it's not immediately obvious that its pseudo-system keyspace `audit` may require adjusting its RF across DCs before decommissioning a node, and this should be documented.	2026-02-18 15:14:57 +01:00
Piotr Smaron	65eec6d8e7	docs: mention table audit enabled by default Also align the documentation with the current audit settings.	2026-02-18 15:14:57 +01:00
Piotr Smaron	c30607d80b	audit: disable DDL by default DDL audit category doesn't make sense if its enabled by default on its own, as no DDL statements are going to be audited if audit_keyspaces/audit_tables setting is empty. This may be counter-intuitive to our users, who may expect to actually see these statements logged if we're enabling this by default. Also, it doesn't make sense to enable a setting by default if it has no effect. Additionally, listed all possible audit categories for user's convenience.	2026-02-18 15:14:57 +01:00
Piotr Smaron	08dc1008ba	db/config: enable table audit by default In https://github.com/scylladb/scylladb/pull/27262 table audit has been re-enabled by default in `scylla.yaml`, logging certain categories to a table, which should make new Scylla deployments have audit enabled. Now, in the next release, we also want to enable audit in `db/config.cc`, which should enable audit for all deployments, which don't explicitly configure audit otherwise in `scylla.yaml` (or via cmd line). BTW. Because this commit aligns audit's default config values in `db/config.cc` to those of `scylla.yaml`, `docs/reference/configuration-parameters.rst`, which is based on `db/config.cc` will start showing that table audit is the default. Refs: https://github.com/scylladb/scylladb/issues/28355 Refs: https://scylladb.atlassian.net/browse/SCYLLADB-222	2026-02-18 15:14:57 +01:00
Piotr Smaron	95ee4a562c	test/cluster: fix `test_table_desc_read_barrier` assertion The test `assertion desc_schema[0] == desc_schema[1]` does a direct list comparison, which is order-sensitive. Before enabling audit by default, both nodes would return only the test keyspace/table, so the order didn't matter. With audit enabled, there will be multiple keyspaces, and they can be returned in different order by different nodes.	2026-02-18 15:14:57 +01:00
Piotr Smaron	2e12b83366	test/cluster: adjust audit in tests involving decommissioning its ks When table audit is enabled, Scylla creates the "audit" ks with NetworkTopologyStrategy and RF=3. During node decommission, streaming can fail for the audit ks with "zero replica after the removal" when all nodes from a DC are removed, and so we have to ALTER audit ks to either zero the number of its replicas, to allow for a clear decommission, or have them in the 2nd DC. BTW. https://github.com/scylladb/scylladb/issues/27395 is the same change, but in dtests repository.	2026-02-18 15:14:55 +01:00
Piotr Smaron	0cf20fa15a	audit_test: fix incorrect config in `test_audit_type_none` Passing Python `None` to setup is incorrect, because config updates are sent as a dict and `None` is treated as "unset" - meaning: use Scylla's default. Using the explicit string "none" to guarantee that audit is disabled.	2026-02-18 15:12:26 +01:00
Asias He	1be80c9e86	repair: Skip auto repair for tables using RF one There is no point running repair for tables using RF one. Row level repair will skip it but the auto repair scheduler will keep scheduling such repairs since repair_time could not be updated. Skip such repairs at the scheduler level for auto repair. If the request is issued by user, we will have to schedule such repair otherwise the user request will never be finished. Fixes SCYLLADB-561 Closes scylladb/scylladb#28640	2026-02-18 14:32:50 +02:00
Andrzej Jackowski	4221d9bbfd	docs: improve examples in `Handling Audit Failures` section This commit introduces four changes: - In the `table` example, singular forms (node, partition) are changed to plural forms (nodes, partitions). Currently, the default `table` audit configuration is RF=3 and writes use CL=ONE. Therefore, a `table` audit log write failure should not be caused by a single node unavailability, and plural forms are more adequate. - In the `table` example, unreachability due to network issues is mentioned because with RF=3, audit failure due to network problems is more likely to happen than a simultaneous failure of three nodes (such network failures happened in SCYLLADB-706). - In the `syslog` example, a slash `/` is changed to `or`, so `table` and `syslog` examples have similar structure. - As the `syslog` line is already being changed, I also change `unix` to `Unix`, as the capitalized form is the correct one. Refs SCYLLADB-706 Closes scylladb/scylladb#28702	2026-02-18 13:10:01 +01:00
Botond Dénes	3bfd47da4b	Merge 'transport: fix connection code to consume only initially taken semaphore units' from Marcin Maliszkiewicz The connection's `cpu_concurrency_t` struct tracks the state of a connection to manage the admission of new requests and prevent CPU overload during connection storms. When a connection holds units (allowed only 0 or 1), it is considered to be in the "CPU state" and contributes to the concurrency limits used when accepting new connections. The bug stems from the fact that `counted_data_source_impl::get` and `counted_data_sink_impl::put` calls can interleave during execution. This occurs because of `should_parallelize` and `_ready_to_respond`, the latter being a future chain that can run in the background while requests are being read. Consequently, while reading request (N), the system may concurrently be writing the response for request (N-1) on the same connection. This interleaving allows `return_all()` to be called twice before the subsequent `consume_units()` is invoked. While the second `return_all()` call correctly returns 0 units, the matching `consume_units()` call would mistakenly take an extra unit from the semaphore. Over time, a connection blocked on a read operation could end up holding an unreturned semaphore unit. If this pattern repeats across multiple connections, the semaphore units are eventually depleted, preventing the server from accepting any new connections. The fix ensures that we always consume the exact number of units that were previously returned. With this change, interleaved operations behave as follows: get() return_all — returns 1 unit put() return_all — returns 0 units get() consume_units — takes back 1 unit put() consume_units — takes back 0 units Logically, the networking phase ends when the first network operation concludes. But more importantly, when a network operation starts, we no longer hold any units. Other solutions are possible but the chosen one seems to be the simplest and safest to backport. Fixes SCYLLADB-485 Backport: all supported affected versions, bug introduced with initial feature implementation in: `ed3e4f33fd` Closes scylladb/scylladb#28530 * github.com:scylladb/scylladb: test: auth_cluster: add test for hanged AUTHENTICATING connections transport: fix connection code to consume only initially taken semaphore units	2026-02-18 13:48:49 +02:00
Marcin Szopa	9217f85e99	docs: fix path to the build_docker.sh which was moved from debian to redhat subdirectory	2026-02-18 12:19:27 +01:00
Marcin Szopa	e66bf4a6f5	docs: fix link to docker build README.MD Link was pointing to the old place of the README. It was moved in the `1abf981a73`	2026-02-18 12:12:46 +01:00
Piotr Dulikowski	b9db3c9c75	Merge 'Add consistent permissions cache' from Marcin Maliszkiewicz This patchset replaces permissions cache based on loading_cache with a new unified (permissions and roles), full, coherent auth cache. Reason for the change is that we want to improve scenarios under stress and simplify operation manuals. New cache doesn't require any tweaking. And it behaves particularly better in scenarios with lots of schema entities (e.g. tables) combined with unprepared queries. Old cache can generate few thousands of extra internal tps due to cache refresh. Benchmark of unprepared statements (just to populate the cache) with 1000 tables shows 3k tps of internal reads reduction and 9.1% reduction of median instructions per op. So many tables were used to show resource impact, cache could be filled with other resource types to show the same improvement. Backport: no, it's a new feature. Fixes https://github.com/scylladb/scylladb/issues/7397 Fixes https://github.com/scylladb/scylladb/issues/3693 Fixes https://github.com/scylladb/scylladb/issues/2589 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-147 Closes scylladb/scylladb#28078 * github.com:scylladb/scylladb: test: boost: add auth cache tests auth: add cache size metrics docs: conf: update permissions cache documentation auth: remove old permissions cache auth: use unified cache for permissions auth: ldap: add permissions reload to unified cache auth: add permissions cache to auth/cache auth: add service::revoke_all as main entry point auth: explicitly life-extend resource in auth_migration_listener	2026-02-18 12:03:20 +01:00
Tomasz Grabiec	af0b5d0894	Merge 'tablets global barrier: acknowledge barrier_and_drain from all nodes' from Petr Gusev Before this series, the `global_barrier` used during tablet migration did not guarantee that `barrier_and_drain` was acknowledged by tablet replicas. As a result, if a request coordinator was fenced out, stale requests from previous topology versions could still execute on replicas in parallel with new requests from incompatible topology versions. For example, stale requests from `tablet_transition_stage::streaming` could run concurrently with new requests from `tablet_transition_stage::use_new`. This caused several issues, including [#26864](https://github.com/scylladb/scylladb/issues/26864) and [#26375](https://github.com/scylladb/scylladb/issues/26375). This PR fixes the problem in two steps: * Replicas now hold an erm strong pointer while handling RPCs from coordinators. * The tablet barrier is updated to require `barrier_and_drain` acknowledgments from all nodes. A description of alternative solutions and various tradeoffs can be found in [this document](https://docs.google.com/document/d/1tpDtPOsrGaZGBYkdwOKApQv4eMzrBydMM1GaYYmaPgg/edit?pli=1&tab=t.0#heading=h.vidfy0hrz5j7). [A previous attempt on this changes](https://github.com/scylladb/scylladb/pull/27185). Fixes [scylladb/scylladb#26864](https://github.com/scylladb/scylladb/issues/26864) Fixes [scylladb/scylladb#26375](https://github.com/scylladb/scylladb/issues/26375) backport: needs backport to 2025.4 (fixes #26864 for tablets LWT) Closes scylladb/scylladb#27492 * github.com:scylladb/scylladb: tests: extract get_topology_version helper global tablets barrier: require all nodes to ack barrier_and_drain topology_coordinator: pass raft_topology_cmd by value storage_proxy: hold erms in replica handlers token_metadata: improve stale versions diagnostics	2026-02-18 11:45:56 +01:00
Pavel Emelyanov	0c443d5764	gms: Use newer seastar get_host_by_name API The hostent::addr_list is deprecated in favor of address_entry::addr field that contains the very same addresses. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28566	2026-02-18 12:24:35 +02:00
Pavel Emelyanov	5b740afe9a	database: Remove streaming sched group getter All users of it had been updated to get the streaming group elsewhere, so this getter is no longer needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28527	2026-02-18 12:23:35 +02:00
Avi Kivity	c5a1f44731	tools: toolchain: switch from ccache to sccache sccache combines the functions of ccache and distcc, and promises to support C++20 modules in the future. Switch to sccache in anticipation of modules support. The documentation is adjusted since cache will be persistent for sccache without further work. Closes scylladb/scylladb#28524	2026-02-18 12:23:12 +02:00
Botond Dénes	36167a155e	Merge 'Remove map_to_key_value() helpers from API' from Pavel Emelyanov There are some places that get `map<foo, bar>` and return it to the caller as `"key": string(foo), "value": string(bar)` json. For that there's `map_to_key_value()` helper in api.hh that re-formats the map into a vector of json elements and returns it, letting seastar json-ize that vector. Recently in seastar there appeared stream_range_as_array() helper that helps streaming any range without converting it into intermediate collection. Some of the hottest users of `map_to_key_value()` had been converted, this PR converts few remainders and removes the helper in question to encourage further usage of the stream_range_as_array(). Code cleanup, not backporting Closes scylladb/scylladb#28491 * github.com:scylladb/scylladb: api: Remove map_to_key_value() helpers api: Streamify view_build_statuses handler api: Streamify few more storage_service/ handlers api: Add map_to_json() helper api: Coroutinize view_build_statuses handler	2026-02-18 12:22:00 +02:00
Ernest Zaslavsky	196f7cad93	nodetool: fix handling of "--primary-replica-only" argument The "--primary-replica-only" ("-pro") flag was previously ignored by the `restore` operation. This patch ensures the argument is parsed and applied correctly. Closes scylladb/scylladb#28490	2026-02-18 12:21:27 +02:00
Pavel Emelyanov	bce43c6b20	api: Remove unused (lost) local variable Lost when the get_range_to_endpoint_map hander was implemented for real (`48c3c94aa6`) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28489	2026-02-18 12:20:30 +02:00
Ernest Zaslavsky	afac984632	s3_client: reorganize tests in part_size_calculation_test just group all BOOST_REQUIRE_EXCEPTION tests in one block and remove artificial scopes	2026-02-18 12:12:04 +02:00
Ernest Zaslavsky	1a20877afe	s3_client: switch using s3 limits constants in tests instead of using magic numbers, switch using s3 limit constants to make it clearer what and why is tested	2026-02-18 12:12:04 +02:00
Ernest Zaslavsky	d763bdabc2	s3_client: fix the s3::range max object size in s3::Range class start using s3 global constant for two reasons: 1) uniformity, no need to introduce semantically same constant in each class 2) the value was wrong	2026-02-18 12:12:04 +02:00
Ernest Zaslavsky	24e70b30c8	s3_client: remove "aws" prefix from object limits constants remove "aws" prefix from object limits constants since it is irrelevant and unnecessary when sitting under s3 namespace	2026-02-18 12:12:04 +02:00
Ernest Zaslavsky	329c156600	s3_client: make s3 object limits accessible make s3 limits constants publicly accessible to reuse it later	2026-02-18 12:12:04 +02:00
Alex	c44ad31d44	db/view: gate detached view-builder callbacks during shutdown Detached migration callbacks (on_create_view, on_update_view, on_drop_view) can race with view_builder::drain() teardown. Add a lifetime gate to view_builder and wire callback launches through _ops_gate.hold() so each detached dispatch future is tracked until it completes (finally keeps the hold alive). During shutdown, drain() now waits for all tracked callback work with _ops_gate.close(). This ensures drain does not proceed past callback lifetime while shutdown is in progress, and ignores only gate_closed_exception at callback entry as the expected shutdown path.	2026-02-18 11:56:41 +02:00
Pavel Emelyanov	b01adf643c	Merge 'init: fix infinite loop on npos wrap with updated Seastar' from Emil Maskovsky Fixes parsing of comma-separated seed lists in "init.cc" and "cql_test_env.cc" to use the standard `split_comma_separated_list` utility, avoiding manual `npos` arithmetic. The previous code relied on `npos` being `uint32_t(-1)`, which would not overflow in `uint64_t` target and exit the loop as expected. With Seastar's upcoming change to make `npos` `size_t(-1)`, this would wrap around to zero and cause an infinite loop. Switch to `split_comma_separated_list` standardized way of tokenization that is also used in other places in the code. Empty tokens are handled as before. This prevents startup hangs and test failures when Seastar is updated. The other commit also removes the unnecessary creation of temporary `gms::inet_address()` objects when calling `std::set<gms::inet_address>::emplace()`. Refs: https://github.com/scylladb/seastar/pull/3236 No backport: The problem will only appear in master after the Seastar will be upgraded. The old code works with the Seastar before https://github.com/scylladb/seastar/pull/3236 (although by accident because of different integer bitsizes). Closes scylladb/scylladb#28573 * github.com:scylladb/scylladb: init: fix infinite loop on npos wrap with updated Seastar init: remove unnecessary object creation in emplace calls	2026-02-18 11:46:26 +03:00
Aleksandra Martyniuk	100ccd61f8	tasks: increase tasks_vt_get_children timeout test_node_ops_tasks.py::test_get_children fails due to timeout of tasks_vt_get_children injection in debug mode. Compared to a successful run, no clear root cause stands out. Extend the message timeout of tasks_vt_get_children from 10s to 60s. Fixes: #28295. Closes scylladb/scylladb#28599	2026-02-18 11:39:19 +03:00
Dani Tweig	aac0f57836	.github/workflows: add SMI to milestone sync Jira project keys What changed Updated .github/workflows/call_sync_milestone_to_jira.yml to include SMI in jira_project_keys Why (Requirements Summary) Adding SMI to create releases in the SMI Jira project based on new milestones from scylladb.git. This will create a new release in the SMI Jira project when a milestone is added to scylladb.git. Fixes:PM-190 Closes scylladb/scylladb#28585	2026-02-18 09:35:37 +02:00
Nadav Har'El	a1475dbeb9	test/cqlpy: make test testMapWithLargePartition faster Right now the slowest test in the test/cqlpy directory is cassandra_tests/validation/entities/collections_test.py:: testMapWithLargePartition This test (translated from Cassandra's unit test), just wants to verify that we can write and flush a partition with a single large map - with 200 items totalling around 2MB in size. 200 items totalling 2MB is large, but not huge, and is not the reason why this test was so so slow (around 9 seconds). It turns out that most of the test time was spent in Python code, preparing a 2MB random string the slowest possible way. But there is no need for this string to be random at all - we only care about the large size of the value, not the specific characters in it! Making the characters written in this text constant instead of random made it 20 times fast - it now takes less than half a second. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28271	2026-02-18 10:12:16 +03:00
Raphael S. Carvalho	5b550e94a6	streaming: Release space incrementally during file streaming File streaming only releases the file descriptors of a tablet being streamed in the very streaming end. Which means that if the streaming tablet has compaction on largest tier finished after streaming started, there will be always ~2x space amplification for that single tablet. Since there can be up to 4 tablets being migrated away, it can add up to a significant amount, since nodes are pushed to a substantial usage of available space (~90%). We want to optimize this by dropping reference to a sstable after it was fully streamed. This way, we reduce the chances of hitting 2x space amplification for a given tablet. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#28505	2026-02-18 10:10:40 +03:00
Avi Kivity	f3cbd76d93	build: install cassandra-stress RPM with no signature check Fedora 45 tightened the default installation checks [1]. As a result the cassandra-stress rpm we provide no longer installs. Install it with --no-gpgchecks as a workaround. It's our own package so we trust it. Later we'll sign it properly. We install its dependencies via the normal methods so they're still checked. [1] https://fedoraproject.org/wiki/Changes/Enforcing_signature_checking_by_default Closes scylladb/scylladb#28687	2026-02-18 10:08:13 +03:00
Pavel Emelyanov	89d8ae5cb6	Merge 'http: prepare http clients retry machinery refactoring' from Ernest Zaslavsky Today S3 client has well established and well testes (hopefully) http request retry strategy, in the rest of clients it looks like we are trying to achieve the same writing the same code over and over again and of course missing corner cases that already been addressed in the S3 client. This PR aims to extract the code that could assist other clients to detect the retryability of an error originating from the http client, reuse the built in seastar http client retryability and to minimize the boilerplate of http client exception handling No backport needed since it is only refactoring of the existing code Closes scylladb/scylladb#28250 * github.com:scylladb/scylladb: exceptions: add helper to build a chain of error handlers http: extract error classification code aws_error: extract `retryable` from aws_error	2026-02-18 10:06:37 +03:00
Pavel Emelyanov	2f10fd93be	Merge 's3_client: Fix s3 part size and number of parts calculation' from Ernest Zaslavsky - Correct `calc_part_size` function since it could return more than 10k parts - Add tests - Add more checks in `calc_part_size` to comply with S3 limits Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640 Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters Closes scylladb/scylladb#28592 * github.com:scylladb/scylladb: s3_client: add more constrains to the calc_part_size s3_client: add tests for calc_part_size s3_client: correct multipart part-size logic to respect 10k limit	2026-02-18 10:04:53 +03:00
Tomasz Grabiec	d33d38139f	test_tablets_parallel_decommission: Fix flakiness due to delayed task appearance Currently, the test assumes that when 'topology_coordinator_pause_before_processing_backlog: waiting' is logged, the task for decommission must be there. This was based on the assumption that topology coordinator is idle and decommission request wakes it up. But if the server is slow enough, it may still be running the load balancer in reaction to table creation, and block on that injection point before decommission request was added. Fix by waiting for the task to appear rather than the injection. Fixes SCYLLADB-715	2026-02-18 01:02:50 +01:00
Tomasz Grabiec	2454de4f8f	test: cluster: task_manager_client: Introduce wait_task_appears()	2026-02-18 01:02:44 +01:00
Tomasz Grabiec	e14eca46af	tests: pylib: util: Add exponential backoff to wait_for Allows balancing the trade-off between fast execution in case the condition is satisfied quickly and not adding load when it's not.	2026-02-18 01:02:19 +01:00
Szymon Malewski	668d6fe019	vector: Improve similarity functions performance Improves performance of deserialization of vector data for calculating similarity functions. Instead of deserializing vector data into a std::vector<data_value>, we deserialize directly into a std::vector<float> and then pass it to similarity functions as a std::span<const float>. This avoids overhead of data_value allocations and conversions. Example QPS of `SELECT id, similarity_cosine({vector<float, 1536>}, {vector<float, 1536>}) ...`: client concurrency 1: before: ~135 QPS, after: ~1005 QPS client concurrency 20: before: ~280 QPS, after: ~2097 QPS Measured using https://github.com/zilliztech/VectorDBBench (modified to call above query without ANN search) Fixes https://scylladb.atlassian.net/browse/SCYLLADB-471 Closes scylladb/scylladb#28615	2026-02-18 00:33:34 +02:00
Calle Wilund	ab4e4a8ac7	commitlog: Always abort replenish queue on loop exit Fixes #28678 If replenish loop exits the sleep condition, with an empty queue, when "_shutdown" is already set, a waiter might get stuck, unsignalled waiting for segments, even though we are exiting. Simply move queue abort to always be done on loop exit. Closes scylladb/scylladb#28679	2026-02-17 23:46:47 +02:00
Emil Maskovsky	6b98f44485	init: fix infinite loop on npos wrap with updated Seastar Fixes parsing of comma-separated seed lists in "init.cc" and "cql_test_env.cc" to use the standard `split_comma_separated_list` utility, avoiding manual `npos` arithmetic. The previous code relied on `npos` being `uint32_t(-1)`, which would not overflow in `uint64_t` target and exit the loop as expected. With Seastar's upcoming change to make `npos` `size_t(-1)`, this would wrap around to zero and cause an infinite loop. Switch to `split_comma_separated_list` standardized way of tokenization that is also used in other places in the code. Empty tokens are handled as before. This prevents startup hangs and test failures when Seastar is updated. Refs: scylladb/seastar#3236	2026-02-17 17:57:13 +00:00
Emil Maskovsky	bda0fc9d93	init: remove unnecessary object creation in emplace calls Simplifies code by directly passing constructor arguments to emplace, avoiding redundant temporary gms::inet_address() object creation. Improves clarity and potentially performance in affected areas.	2026-02-17 17:57:12 +00:00
Marcin Maliszkiewicz	741969cf4c	test: boost: add auth cache tests The cache is covered already with general auth dtests but some cases are more tricky and easier to express directly as calls to cache class. For such tests boost test file was added.	2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz	c11eb73a59	auth: add cache size metrics	2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz	a059798de9	docs: conf: update permissions cache documentation	2026-02-17 18:18:40 +01:00
Marcin Maliszkiewicz	a23e503e7b	auth: remove old permissions cache	2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz	9d9184e5b7	auth: use unified cache for permissions	2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz	7eedf50c12	auth: ldap: add permissions reload to unified cache The LDAP server may change role-chain assignments without notifying Scylla. As a result, effective permissions can change, so some form of polling is required. Currently, this is handled via cache expiration. However, the unified cache is designed to be consistent and does not support expiration. To provide an equivalent mechanism for LDAP, we will periodically reload the permissions portion of the new cache at intervals matching the previously configured expiration time.	2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz	10996bd0fb	auth: add permissions cache to auth/cache We want to get rid of loading cache because its periodic refresh logic generates a lot of internal load when there is many entries. Also our operation procedures involve tweaking the config while new unified cache is supposed to work out of the box.	2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz	03c4e4bb10	auth: add service::revoke_all as main entry point In the following commit we'll need to add some cache related logic (removing resource permissions). This logic doesn't depend on authorizer so it should be managed by the service itself.	2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz	070d0bfc4c	auth: explicitly life-extend resource in auth_migration_listener Otherwise it's easy to trigger use-after-free when code slightly changes.	2026-02-17 17:56:27 +01:00
Marcin Maliszkiewicz	3b98451776	test: auth_cluster: add test for hanged AUTHENTICATING connections Test runtime: Release - 2s Debug - 5s	2026-02-17 17:55:48 +01:00
Marcin Maliszkiewicz	0376d16ad3	transport: fix connection code to consume only initially taken semaphore units The connection's cpu_concurrency_t struct tracks the state of a connection to manage the admission of new requests and prevent CPU overload during connection storms. When a connection holds units (allowed only 0 or 1), it is considered to be in the "CPU state" and contributes to the concurrency limits used when accepting new connections. The bug stems from the fact that `counted_data_source_impl::get` and `counted_data_sink_impl::put` calls can interleave during execution. This occurs because of `should_parallelize` and `_ready_to_respond`, the latter being a future chain that can run in the background while requests are being read. Consequently, while reading request (N), the system may concurrently be writing the response for request (N-1) on the same connection. This interleaving allows `return_all()` to be called twice before the subsequent `consume_units()` is invoked. While the second `return_all()` call correctly returns 0 units, the matching `consume_units()` call would mistakenly take an extra unit from the semaphore. Over time, a connection blocked on a read operation could end up holding an unreturned semaphore unit. If this pattern repeats across multiple connections, the semaphore units are eventually depleted, preventing the server from accepting any new connections. The fix ensures that we always consume the exact number of units that were previously returned. With this change, interleaved operations behave as follows: get() return_all — returns 1 unit put() return_all — returns 0 units get() consume_units — takes back 1 unit put() consume_units — takes back 0 units Logically, the networking phase ends when the first network operation concludes. But more importantly, when a network operation starts, we no longer hold any units. Other solutions are possible but the chosen one seems to be the simplest and safest to backport. Fixes SCYLLADB-485	2026-02-17 17:55:48 +01:00
Dani Tweig	5dc06647e9	.github: add workflow to auto-close issues from ScyllaDB associates Added .github/workflows/close_issue_for_scylla_employee.yml workflow file to automatically close issues opened by ScyllaDB associates We want to allow external users to open issues in the scylladb repo, but for ScyllaDB associates, we would like them to open issues in Jira instead. If a ScyllaDB associates opens by mistake an issue in scylladb.git repo, the issue will be closed automatically with an appropriate comment explaining that the issue should be opened in Jira. This is a new github action, and does not require any code backport. Fixes: PM-64 Closes scylladb/scylladb#28212	2026-02-17 17:18:32 +02:00
Dani Tweig	bb8a2c3a26	.github/workflow/:Add milestone sync to Jira based on GitHub Action What changed Added new workflow file .github/workflows/call_jira_sync_pr_milestone.yml Why (Requirements Summary) Adds a GitHub Action that will be triggered when a milestone is set or removed from a PR When milestone is added (milestoned event), calls main_jira_sync_pr_milestone_set.yml from github-automation.git, which will add the version to the 'Fix Versions' field in the relevant linked Jira issue When milestone is removed (demilestoned event), calls main_jira_sync_pr_milestone_removed.yml from github-automation.git, which will remove the version from the 'Fix Versions' field in the relevant linked Jira issue Testing was performed in staging.git and the STAG Jira project. Fixes:PM-177 Closes scylladb/scylladb#28575	2026-02-17 16:41:03 +02:00
Botond Dénes	2e087882fa	Merge 'GCS object storage. Fix incompatibilty issues with "real" GCS' from Calle Wilund Fixes #28398 Fixes #28399 When used as path elements in google storage paths, the object names need to be URL encoded. Due to a.) tests not really using prefixes including non-url valid chars (i.e. / etc) and b.) the mock server used for most testing not enforcing this particular aspect, this was missed. Modified unit tests to use prefixing for all names, so when running real GS, any errors like this will show. "Real" GCS also behaves a bit different when listing with pager, compared to mock; The former will not give a pager token for last page, only penultimate. Adds handling for this. Needs backport to the releases that have (though might not really use) the feature, as it is technically possible to use google storage for backup and whatnot there, and it should work as expected. Closes scylladb/scylladb#28400 * github.com:scylladb/scylladb: utils/gcp/object_storage: URL-encode object names in URL:s utils::gcp::object_storage: Fix list object pager end condition detection	2026-02-17 16:40:02 +02:00
Andrei Chekun	1b5789cd63	test.py: refactor manager fixture The current manager flow have a flaw. It will trigger pytest.fail when it found errors on teardown regardless if the test was already failed. This will create an additional record in JUnit report with the same name and Jenkins will not be able to show the logs correctly. So to avoid this, this PR changes logic slightly. Now manager will check that test failed or not to avoid two fails for the same test in the report. If test passed, manager will check the cluster status and fail if something wrong with a status of it. There is no need to check the cluster status in case of test fail. If test passed, and cluster status if OK, but there are unexpected errors in the logs, test will fail as well. But this check will gather all information about the errors and potential stacktraces and will only fail the test if it's not yet failed to avoid double entry in report. Closes scylladb/scylladb#28633	2026-02-17 14:35:18 +01:00
Dawid Mędrek	5b5222d72f	Merge 'test: make test_different_group0_ids work with the Raft-based topology' from Patryk Jędrzejczak The test was marked with xfail in #28383, as it needed to be updated to work with the Raft-based topology. We are doing that in this patch. With the Raft-based topology, there is no reason to check that nodes with different group0 IDs cannot merge their topology/token_metadata. That is clearly impossible, as doing any topology change requires being in the same group0. So, the original regression test doesn't make sense. We can still test that nodes with different group0 IDs cannot gossip with each other, so we keep the test. It's very fast anyway. No backport, test update. Closes scylladb/scylladb#28571 * github.com:scylladb/scylladb: test: run test_different_group0_ids in all modes test: make test_different_group0_ids work with the Raft-based topology	2026-02-17 13:56:41 +01:00
Dawid Mędrek	1b80f6982b	Merge 'test: make the load balancer simulator tablet size aware' from Ferenc Szili Currently, the load balancing simulator computes node, shard and tablet load based on tablet count. This patch changes the load balancing simulator to be tablet size aware. It generates random tablet sizes with a normal distribution, and a mean value of `default_target_tablet_size`, and reports the computed load for nodes and tables based on tablet size sum, instead of tablet count. This is the last patch in the size based load balancing series. It is the last PR in the Size Based Load Balancing series: - First part for tablet size collection via load_stats: scylladb/scylladb#26035 - Second part reconcile load_stats: scylladb/scylladb#26152 - The third part for load_sketch changes: scylladb/scylladb#26153 - The fourth part which performs tablet load balancing based on tablet size: scylladb/scylladb#26254 - The fifth part changes the load balancing simulator: scylladb/scylladb#26438 This is a new feature and backport is not needed. Closes scylladb/scylladb#26438 * github.com:scylladb/scylladb: test, simulator: compute load based on tablet size instead of count test, simulator: generate tablet sizes and update load_stats test, simulator: postpone creation of load_stats_ptr	2026-02-17 13:29:37 +01:00
Avi Kivity	ffde2414e8	cql3: grammar: remove special case for vector similarity functions in selectors In `b03d520aff` ("cql3: introduce similarity functions syntax") we added vector similarity functions to the grammar. The grammar had to be modified because we wanted to support literals as vector similarity function arguments, and the general function syntax in selectors did not allow that. In `cc03f5c89d` ("cql3: support literals and bind variables in selectors") we extended the selector function call grammar to allow literals as function arguments. Here, we remove the special case for vector similarity functions as the general case in function calls covers all the possibilities the special case does. As a side effect, the vector similarity function names are no longer reserved. Note: the grammar change fixes an inconsistency with how the vector similarity functions were evaluated: typically, when a USE statement is in effect, an unqualified function is first matched against functions in the keyspace, and only if there is no match is the system keyspace checked. But with the previous implementation vector similarity functions ignored the USE keyspace and always matched only the system keyspace. This small inconsistency doesn't matter in practice because user defined functions are still experimental, and no one would name a UDF to conflict with a system function, but it is still good to fix it. Closes scylladb/scylladb#28481	2026-02-17 12:40:21 +01:00
Ernest Zaslavsky	30699ed84b	api: report restore params report restore params once the API's call for restore is invoked Closes scylladb/scylladb#28431	2026-02-17 14:27:21 +03:00
Andrei Chekun	767789304e	test.py: improve C++ fail summary in pytest Currently, if the test fail, pytest will output only some basic information about the fail. With this change, it will output the last 300 lines of the boost/seastar test output. Also add capturing the output of the failed tests to JUnit report, so it will be present in the report on Jenkins. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-449 Closes scylladb/scylladb#28535	2026-02-17 14:25:28 +03:00
Pavel Emelyanov	6d4af84846	Merge 'test: increase open file limit for sstable tests' from Avi Kivity In `ebda2fd4db` ("test: cql_test_env: increase file descriptor limit"), we raised the open file limit for cql_test_env. Here, we raise it for sstables::test_env as well, to fix a couple of twcs resharding tests failing outside dbuild. These tests open 256 sstables, and with 2 files/sstable + resharding work it is understandable that they overflow the 1024 limit. No backport: this is a quality of life improvement for developers running outside dbuild, but they can use dbuild for branches. Closes scylladb/scylladb#28646 * github.com:scylladb/scylladb: test: sstables::test_env: adjust file open limit test: extract cql_test_env's adjust_rlimit() for reuse	2026-02-17 14:19:43 +03:00
Avi Kivity	41925083dc	test: minio: tune sync setting Disable O_DSYNC in minio to avoid unnecessary slowdown in S3 tests. Closes scylladb/scylladb#28579	2026-02-17 14:19:27 +03:00
Avi Kivity	f03491b589	Update seastar submodule * seastar f55dc7eb...d2953d2a (13): > io_tester: Revive IO bandwidth configuration > Merge 'io_tester: add vectorized I/O support' from Travis Downs doc: add vectorized I/O options to io-tester.md io_tester: add vectorized I/O support > Merge 'Remove global scheduling group ID bitmap' from Pavel Emelyanov reactor: Drop sched group IDs bitmap reactor: Allocate scheduling group on shard-0 first reactor: Detach init_scheduling_group_specific_data() reactor: Coroutinize create_scheduling_group() > set_iterator: increase compatibility with C++ ranges > test: fix race condition in test_connection_statistics > Add Claude Code project instructions > reactor: Unfriend pollable_fd via pollable_fd_state::make() > Merge 'rpc_tester: introduce rpc_streaming job based on streaming API' from Jakub Czyszczoń apps: rpc_tester: Add STREAM_UNIDIRECTIONAL job We introduce an unidirectional streaming to the rpc_streaming job. apps: rpc_tester: Add STREAM_BIDIRECTIONAL job This commit extends the rpc_tester with rpc_streaming job that uses rpc::sink<> and rpc::source<> to stream data between the client and the server. > treewide: remove remnants of SEASTAR_MODULE > test: Tune abort-accept test to use more readable async() > build: support sccache as a compiler cache (#3205) > posix-stack: Reuse parent class _reuseport from child > Merge 'reactor_backend: Fix another busy spin bug in the epoll backend' from Stephan Dollberg tests: Add unit test for epoll busy spin bug reactor_backend: Fix another busy spin bug in epoll Closes scylladb/scylladb#28513	2026-02-17 13:13:22 +02:00
Jakub Smolar	189b056605	scylla_gdb: use run_ctx to nahdle Scylla exe and remove pexpect Previous implementation of Scylla lifecycle brought flakiness to the test. This change leaves lifecycle management up to PythonTest.run_ctx, which implements more stability logic for setup/teardown. Replace pexpect-driven GDB interaction with GDB batch mode: - Avoids DeprecationWarning: "This process is multi-threaded, use of forkpty() may lead to deadlocks in the child.", which ultimately caused CI deadlocks. - Removes timeout-driven flakiness on slow systems - no interactive waits/timeouts. - Produces cleaner, more direct assertions around command execution and output. - Trade-off: batch mode adds ~10s per command per test, but with --dist=worksteal this is ~10% overall runtime increase across the suite. Closes scylladb/scylladb#28484	2026-02-17 11:36:20 +01:00
Łukasz Paszkowski	f45465b9f6	test_out_of_space_prevention.py: Lower the critical disk utilization threshold After PR https://github.com/scylladb/scylladb/pull/28396 reduced the test volumes to 20MiB to speed up test_out_of_space_prevention.py, keeping the original 0.8 critical disk utilization threshold can make the tests flaky: transient disk usage (e.g. commitlog segment churn) can push the node into ENOSPC during the run. These tests do not write much data, so reduce the critical disk utilization threshold to 0.5. With 20MiB volumes this leaves ~10MiB of headroom for temporary growth during the test. Fixes: https://github.com/scylladb/scylladb/issues/28463 Closes scylladb/scylladb#28593	2026-02-16 15:10:18 +02:00
Andrei Chekun	e26cf0b2d6	test/cluster: fix two flaky tests test_maintenance_socket with new way of running is flaky. Looks like the driver tries to reconnect with an old maintenance socket from previous driver and fails. This PR adds white list for connection that stabilize the test test_no_removed_node_event_on_ip_change was flaky on CI, while the issue never reproduced locally. The assumption that under load we have race condition and trying to check the logs before message is arrived. Small for loop to retry added to avoid such situation. Closes scylladb/scylladb#28635	2026-02-16 14:50:54 +02:00
Patryk Jędrzejczak	0693091aff	test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart The test can currently fail like this: ``` > await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}") E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">}) ``` The following happens: - node A is restarted and becomes the group0 leader, - the driver sends the ALTER TABLE request to node B, - the request hits group 0 concurrent modification error 10 times and fails because node A performs tablet migrations at the the same time. What is unexpected is that even though the driver session uses the default retry policy, the driver doesn't retry the request on node A. The request is guaranteed to succeed on node A because it's the only node adding group0 entries. The driver doesn't retry the request on node A because of a missing `wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect the driver just in case to prevent hitting scylladb/python-driver#295. Moreover, we can revert the workaround from `4c9efc08d8`, as the fix from this commit also prevents DROP KEYSPACE failures. The commit has been tested in byo with `_concurrent_ddl_retries{0}` to verify that node A really can't hit group 0 concurrent modification error and always receives the ALTER TABLE request from the driver. All 300 runs in each build mode passed. Fixes #25938 Closes scylladb/scylladb#28632	2026-02-16 12:56:18 +01:00
Marcin Maliszkiewicz	6a4aef28ae	Merge 'test: explicitly set compression algorithm in test_autoretrain_dict' from Andrzej Jackowski When `test_autoretrain_dict` was originally written, the default `sstable_compression_user_table_options` was `LZ4Compressor`. The test assumed (correctly) that initially the compression doesn't use a trained dictionary, and later in the test scenario, it changed the algorithm to one with a dictionary. However, the default `sstable_compression_user_table_options` is now `LZ4WithDictsCompressor`, so the old assumption is no longer correct. As a result, the assertion that data is initially not compressed well may or may not fail depending on dictionary training timing. To fix this, this commit explicitly sets `ZstdCompressor` as the initial `sstable_compression_user_table_options`, ensuring that the assumption that initial compression is without a dictionary is always met. Note: `ZstdCompressor` differs from the former default `LZ4Compressor`. However, it's a better choice — the test aims to show the benefit of using a dictionary, not the benefit of Zstd over LZ4 (and the test uses ZstdWithDictsCompressor as the algorithm with the dictionary). Fixes: https://github.com/scylladb/scylladb/issues/28204 Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent). Closes scylladb/scylladb#28625 * github.com:scylladb/scylladb: test: explicitly set compression algorithm in test_autoretrain_dict test: remove unneeded semicolons from python test	2026-02-16 11:38:24 +01:00
Ernest Zaslavsky	034c6fbd87	s3_client: limit multipart upload concurrency Prevent launching hundreds or thousands of fibers during multipart uploads by capping concurrent part submissions to 16. Closes scylladb/scylladb#28554	2026-02-16 13:32:58 +03:00
Botond Dénes	9f57d6285b	Merge 'test: improve error reporting and retries in get_scylla_2025_1_executable' from Marcin Maliszkiewicz Harden get_scylla_2025_1_executable() by improving error reporting when subprocesses fail, increasing curl's retry count for more resilient downloads, and enabling --retry-all-errors to retry on all failures. Fixes https://github.com/scylladb/scylladb/issues/27745 Backport: no, it's not a bug fix Closes scylladb/scylladb#28628 * github.com:scylladb/scylladb: test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call test: pylib: increase curl's number of retries when downloading scylla test: pylib: improve error reporting in get_scylla_2025_1_executable	2026-02-16 10:09:17 +02:00
Petr Gusev	c785d242a7	tests: extract get_topology_version helper This is a refactoring commit. We need to load the cluster version for a host in several places, so extract a helper for this.	2026-02-16 08:57:42 +01:00
Petr Gusev	ffe3262e8d	global tablets barrier: require all nodes to ack barrier_and_drain Previously, global_tablet_token_metadata_barrier() could proceed with fencing even if some nodes did not acknowledge the barrier_and_drain. This could cause problems: * In scylladb/scylladb#26864, replica locks did not provide mutual exclusion, because “fenced out” requests from old topology versions could run in parallel with requests using newer versions. * In scylladb/scylladb#26375, the barrier could succeed even though we did not wait for closed sessions to become unused. This could leave aborted repair or streaming tasks running concurrently after a tablet transition was aborted, and thus running concurrently with the next transition. In this commit we add a parameter drain_all_nodes: bool to the global_token_metadata_barrier function. If this parameter is set, the barrier waits for all nodes to acknowledge the barrier_and_drain round of RPCs. If any of the nodes are not accessible or throw an error, such errors are rethrown to the caller. We set this parameter only in global_tablet_token_metadata_barrier since for topology migrations the old behavior should be preserved. In case of errors, the tablet migration is blocked until the problem goes away by itself or the problematic node is added to the ignore_nodes list. The test_fenced_out_on_tablet_migration_while_handling_paxos_verb is removed: with tablets, we now drain all nodes, so after a successful barrier_and_drain round there can be no coordinators with an old topology version. The fence_token check after executing a request on a replica is therefore unnecessary for tablets, but still required for vnodes, where topology changes do not wait for all nodes. Topology fencing is covered by test_fence_lwt_during_bootstrap. Fixes scylladb/scylladb#26864 Fixes scylladb/scylladb#26375	2026-02-16 08:57:42 +01:00
Petr Gusev	06f88b43e5	topology_coordinator: pass raft_topology_cmd by value It's just a single enum. Passing by reference risks use-after-free if a temporary command is created on the stack in a non-coroutine function.	2026-02-16 08:57:42 +01:00
Petr Gusev	df73f723a6	storage_proxy: hold erms in replica handlers Add explicit erm-holding variables in all replica-side RPC handlers. This is required to ensure that tablet migration waits for in-flight replica requests even if a non-replica coordinator has been fenced out. Holding erms on the replica side may increase the global-barrier wait time, since the barrier must drain these requests. We believe this is acceptable because: * We already hold erms during replica-side request execution, but in an ad-hoc, non-systemic way in lower layers of storage_proxy (e.g. in sp::mutate_locally and do_query_tablets). * Replica requests are bounded by replica-side timeouts, so the global-barrier wait time cannot exceed the maximum of these timeouts. For Paxos verbs, we use token_metadata_guard, which wraps the ERM and automatically refreshes it when tablet migration does not affect the current token; see the token_metadata_guard comments for details. We use this guard only for Paxos verbs because regular reads and writes already hold raw erms in storage_proxy and on the coordinators. The erms must be held in all RPC handlers that support fencing — that is, those with a fencing_token parameter in storage_proxy.idl. Counter updates already hold erms in mutate_counter_on_leader_and_replicate. Fix test_tablets2::test_timed_out_reader_after_cleanup: the tablets barrier now waits for all nodes. As a result, the replica read is expected to finish, rather than fail due to the tablet having moved as it did previously. The test is renamed to test_tablets_barrier_waits_for_replica_erms to better reflect its purpose. Refs scylladb/scylladb#26864	2026-02-16 08:57:42 +01:00
Petr Gusev	e39f4b399c	token_metadata: improve stale versions diagnostics Before waiting on stale_versions_in_use(), we log the stale versions the barrier_and_drain handler will wait for, along with the number of token_metadata references representing each version. To achieve this, we store a pointer to token_metadata in version_tracker, traverse the _trackers list, and output all items with a version smaller than the latest. Since token_metadata contains the version_tracker instance, it is guaranteed to remain alive during traversal. To count references, token_metadata now inherits from enable_lw_shared_from_this. This helps diagnose tablet migration stalls and allows more deterministic tests: when a barrier is expected to block, we can verify that the log contains the expected stale versions rather than checking that the barrier_and_drain is blocked on stale_versions_in_use() for a fixed amount of time.	2026-02-16 08:57:42 +01:00
Andrei Chekun	8c5c1096c2	test: ensure that that table used it cqlpy/test_tools have at least 3 pk One of the tests check that amount of the PK should be more than 2, but the method that creates it can return table with less keys. This leads to flakiness and to avoid it, this PR ensures that table will have at least 3 PK Closes scylladb/scylladb#28636	2026-02-16 09:50:58 +02:00
Anna Mikhlin	33cf97d688	.github/workflows: ignore quoted comments for trigger CI prevent CI from being triggered when trigger-ci command appears inside quoted (>) comment text Fixes: https://scylladb.atlassian.net/browse/RELENG-271 Closes scylladb/scylladb#28604	2026-02-16 09:33:16 +02:00
Andrei Chekun	e144d5b0bb	test.py: fix JUnit double test case records Move the hook for overwriting the XML reporter to be the first, to avoid double records. Closes scylladb/scylladb#28627	2026-02-15 19:02:24 +02:00
Alex	75e25493c1	db:view: refactor on_update_view to use coroutine dispatcher on_update_view() currently runs its serialized logic inline via with_semaphore() from a detached callback path, while create/drop already use dedicated async dispatchers. Refactor update handling to follow the same pattern: - add dispatch_update_view(sstring ks_name, sstring view_name) - move update logic into that coroutine - acquire the existing view-builder lock via get_or_adopt_view_builder_lock() - keep existing behavior for missing base/view state - keep background invocation semantics from on_update_view() This aligns update/create/drop flow and keeps async lifecycle handling and a first step to fix shutdown issue.	2026-02-15 18:50:32 +02:00
Avi Kivity	a365e2deaa	test: sstables::test_env: adjust file open limit The twcs compaction tests open more than 1024 files (not so good), and will fail in a user session with the default soft limit (1024). Attempt to raise the limit so the tests pass. On a modern systemd installation the hard limit is >500,000, so this will work. There's no problem in dbuild since it raises the file limit globally.	2026-02-15 14:27:37 +02:00
Avi Kivity	bab3afab88	test: extract cql_test_env's adjust_rlimit() for reuse The sstable-oriented sstable::test_env would also like to use it, so extract it into a neutral place.	2026-02-15 14:26:46 +02:00
Jenkins Promoter	69249671a7	Update pgo profiles - aarch64	2026-02-15 05:22:17 +02:00
Jenkins Promoter	27aaafb8aa	Update pgo profiles - x86_64	2026-02-15 04:26:36 +02:00
Piotr Dulikowski	9c1e310b0d	Merge 'vector_search: Fix flaky vector_store_client_https_rewrite_ca_cert' from Karol Nowacki Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused the test case to fail because the ANN request duration exceeded the test case timeout. The PR introduces two changes: * Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites simultaneously with ANN requests that utilize those certificates. * Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout. Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write operation, potentially bypassing connect timeout. Fixes: #28012 Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test. Closes scylladb/scylladb#28617 * github.com:scylladb/scylladb: vector_search: Fix missing timeout on TLS handshake vector_search: test: Fix flaky cert rewrite test	2026-02-13 19:03:50 +01:00
Taras Veretilnyk	f140ab0332	sstables: extract default write open flags into a constant Extract the commonly used `open_flags::wo \| open_flags::create \| open_flags::exclusive` into a reusable constant `sstable_write_open_flags` to reduce duplication.	2026-02-13 14:27:01 +01:00
Taras Veretilnyk	c8281b7b8b	sstables: Add write_simple_with_digest for component checksumming Introduce new methods to write SSTable components while calculating and returning their CRC32 checksums. This adds: - make_digests_component_file_writer(): creates a crc32_digest_file_writer for component writing with checksum tracking - write_simple_with_digest() and do_write_simple_with_digest(): write components and return the full checksum value	2026-02-13 14:27:01 +01:00
Taras Veretilnyk	1bf934c77c	sstables: Extract file writer closing logic into separate methods Refactor the consume_end_of_stream() method by extracting the inline file writer closing logic into dedicated methods: - close_index_writer() - close_partitions_writer() - close_rows_writer()	2026-02-13 14:27:01 +01:00
Taras Veretilnyk	dec5e48666	sstables: Implement CRC32 digest-only writer Introduce template parameter to checksummed file writer to support digest-only calculation without storing chunk checksums. This will be needed for future to calculate digest of other components.	2026-02-13 14:27:00 +01:00
Patryk Jędrzejczak	aebc108b1b	test: run test_different_group0_ids in all modes CI currently fails in release and debug modes if the PR only changes a test run only in dev mode. There is no reason to wait for the CI fix, as there is no reason to run this test only in dev mode in the first place. The test is very fast.	2026-02-13 13:30:29 +01:00
Patryk Jędrzejczak	59746ea035	test: make test_different_group0_ids work with the Raft-based topology The test was marked with xfail in #28383, as it needed to be updated to work with the Raft-based topology. We are doing that in this patch. With the Raft-based topology, there is no reason to check that nodes with different group0 IDs cannot merge their topology/token_metadata. That is clearly impossible, as doing any topology change requires being in the same group0. So, the original regression test doesn't make sense. We can still test that nodes with different group0 IDs cannot gossip with each other, so we keep the test. It's very fast anyway.	2026-02-13 13:30:28 +01:00
Marcin Maliszkiewicz	1b0a68d1de	test: pylib: retry on all errors in get_scylla_2025_1_executable curl's call It's difficult to say if our download backend would always return transient error correctly so that the curl could retry. Instead it's more robust to always retry on error.	2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz	8ca834d4a4	test: pylib: increase curl's number of retries when downloading scylla By default curl does exponential backoff, and we want to keep that but there is time cap of 10 minutes, so with 40 retries we'd wait long time, instead we set the cap to 60 seconds. Total waiting time (excluding receiving request time): before - 17m after - 35m	2026-02-12 16:18:52 +01:00
Marcin Maliszkiewicz	70366168aa	test: pylib: improve error reporting in get_scylla_2025_1_executable Curl or other tools this function calls will now log error in the place they fail instead of doing plain assert.	2026-02-12 16:18:52 +01:00
Andrzej Jackowski	9ffa62a986	test: explicitly set compression algorithm in test_autoretrain_dict When `test_autoretrain_dict` was originally written, the default `sstable_compression_user_table_options` was `LZ4Compressor`. The test assumed (correctly) that initially the compression doesn't use a trained dictionary, and later in the test scenario, it changed the algorithm to one with a dictionary. However, the default `sstable_compression_user_table_options` is now `LZ4WithDictsCompressor`, so the old assumption is no longer correct. As a result, the assertion that data is initially not compressed well may or may not fail depending on dictionary training timing. To fix this, this commit explicitly sets `ZstdCompressor` as the initial `sstable_compression_user_table_options`, ensuring that the assumption that initial compression is without a dictionary is always met. Note: `ZstdCompressor` differs from the former default `LZ4Compressor`. However, it's a better choice — the test aims to show the benefit of using a dictionary, not the benefit of Zstd over LZ4 (and the test uses ZstdWithDictsCompressor as the algorithm with the dictionary). Fixes: scylladb/scylladb#28204	2026-02-12 14:58:39 +01:00
Andrzej Jackowski	e63cfc38b3	test: remove unneeded semicolons from python test	2026-02-12 14:49:17 +01:00
Patryk Jędrzejczak	8e9c7397c5	raft topology: prevent accessing nullptr returned by topology::find It's better to raise an internal error than cause a segmentation fault on possibly multiple nodes.	2026-02-12 13:10:04 +01:00
Patryk Jędrzejczak	e21ecf69de	raft topology: make some assertions non-crashing Some assertions in the Raft-based topology are likely to cause crashes of multiple nodes due to the consistent nature of the Raft-based code. If the failing assertion is executed in the code run by each follower (e.g., the code reloading the in-memory topology state machine), then all nodes can crash. If the failing assertion is executed only by the leader (e.g., the topology coordinator fiber), then multiple consecutive group0 leaders will chain-crash until there is no group0 majority. Crashing multiple nodes is much more severe than necessary. It's enough to prevent the topology state machine from making more progress. This will naturally happen after throwing a runtime error. The problematic fiber will be killed or will keep failing in a loop. Note that it should be safe to block the topology state machine, but not the whole group0, as the topology state machine is mostly isolated from the rest of group0. We replace some occurrences of `on_fatal_internal_error` and `SCYLLA_ASSERT` with `on_internal_error`. These are not all occurrences, as some fatal assertions make sense, for example, in the bootstrap procedure.	2026-02-12 13:10:03 +01:00
Ferenc Szili	d7cfaf3f84	test, simulator: compute load based on tablet size instead of count This patch changes the load balancing simulator so that it computes table load based on tablet sizes instead of tablet count. best_shard_overcommit measured minimal allowed overcommit in cases where the number of tablets can not be evenly distributed across all the available shards. This is still the case, but instead of computing it as an integer div_ceil() of the average shard load, it is now computed by allocating the tablet sizes using the largest-tablet-first method. From these, we can get the lowest overcommit for the given set of nodes, shards and tablet sizes.	2026-02-12 12:54:55 +01:00
Ferenc Szili	216443c050	test, simulator: generate tablet sizes and update load_stats This change adds a random tablet size generator. The tablet sizes are created in load_stats. Further changes to the load balance simulator: - apply_plan() updates the load_stats after a migration plan is issued by the load balancer, - adds the option to set a command line option which controls the tablet size deviation factor.	2026-02-12 12:54:55 +01:00
Ferenc Szili	e31870a02d	test, simulator: postpone creation of load_stats_ptr With size based load balancing, we will have to move the tablet size in load_stats after each internode migration issued by balance_tablets(). This will be done in a subsequent commit in apply_plan() which is called from rebalance_tablets(). Currently, rebalance_tablets() is passed a load_stats_ptr which is defined as: using load_stats_ptr = lw_shared_ptr<const load_stats>; Because this is a pointer to const, apply_plan() can't modify it. So, we pass a reference to load_stats to rebalance_tablets() and create a load_stats_ptr from it for each call to balance_tablets().	2026-02-12 12:54:55 +01:00
Aleksandra Martyniuk	f955a90309	test: fix test_remove_node_violating_rf_rack_with_rack_list test_remove_node_violating_rf_rack_with_rack_list creates a cluster with four nodes. One of the nodes is excluded, then another one is stopped, excluded, and removed. If the two stopped nodes were both voters, the majority is lost and the cluster loses its raft leader. As a result, the node cannot be removed and the operation times out. Add the 5th node to the cluster. This way the majority is always up. Fixes: https://github.com/scylladb/scylladb/issues/28596. Closes scylladb/scylladb#28610	2026-02-12 12:58:48 +02:00
Ferenc Szili	4ca40929ef	test: add read barrier to test_balance_empty_tablets The test creates a single node cluster, then creates 3 tables which remain empty. Then it adds another node with half the disk capacity of the first one, and then it waits for the balancer to migrate tablets to the newly added node by calling the quiesce topology API. The number of tablets on the smaller node should be exactly half the number of tablets on the larger node. After waiting for quiesce topology, we could have a situation where we query the number of tablets from the node which still hasn't processed the last tablet migrations and updated system.tablets. This patch adds a read barrier so that both nodes see the same tablets metadata before we query the number of tablets. Fixes: SCYLLADB-603 Closes scylladb/scylladb#28598	2026-02-12 11:16:34 +02:00
Karol Nowacki	079fe17e8b	vector_search: Fix missing timeout on TLS handshake Currently the TLS handshake in the vector search client does not have a timeout. This is because tls::connect does not perform handshake itself; the handshake is deferred until the first read/write operation is performed. This can lead to long hangs on ANN requests. This commit calls tls::check_session_is_resumed() after tls::connect to force the handshake to happen immediately and to run under with_timeout.	2026-02-12 10:08:37 +01:00
Karol Nowacki	aef5ff7491	vector_search: test: Fix flaky cert rewrite test The test is flaky most likely because when TLS certificate rewrite happens simultaneously with an ANN request, the handshake can hang for a long time (~60s). This leads to a timeout in the test case. This change introduces a checkpoint in the test so that it will wait for the certificate rewrite to happen before sending an ANN request, which should prevent the handshake from hanging and make the test more reliable. Fixes: #28012	2026-02-12 09:58:54 +01:00
Piotr Dulikowski	38c4a14a5b	Merge 'test: cluster: Fix test_sync_point' from Dawid Mędrek The test `test_sync_point` had a few shortcomings that made it flaky or simply wrong: 1. We were verifying that hints were written by checking the size of in-flight hints. However, that could potentially lead to problems in rare situations. For instance, if all of the hints failed to be written to disk, the size of in-flight hints would drop to zero, but creating a sync point would correspond to the empty state. In such a situation, we should fail immediately and indicate what the cause was. 2. A sync point corresponds to the hints that have already been written to disk. The number of those is tracked by the metric `written`. It's a much more reliable way to make sure that hints have been written to the commitlog. That ensures that the sync point we'll create will really correspond to those hints. 3. The auxiliary function `wait_for` used in the test works like this: it executes the passed callback and looks at the result. If it's `None`, it retries it. Otherwise, the callback is deemed to have finished its execution and no further retries will be attempted. Before this commit, we simply returned a bool, and so the code was wrong. We improve it. --- Note that this fixes scylladb/scylladb#28203, which was a manifestation of scylladb/scylladb#25879. We created a sync point that corresponded to the empty state, and so it immediately resolved, even when node 3 was still dead. As a bonus, we rewrite the auxiliary code responsible for fetching metrics and manipulating sync points. Now it's asynchronous and uses the existing standard mechanisms available to developers. Furthermore, we reduce the time needed for executing `test_sync_point` by 27 seconds. --- The total difference in time needed to execute the whole test file (on my local machine, in dev mode): Before: CPU utilization: 0.9% real 2m7.811s user 0m25.446s sys 0m16.733s After: CPU utilization: 1.1% real 1m40.288s user 0m25.218s sys 0m16.566s --- Refs scylladb/scylladb#25879 Fixes scylladb/scylladb#28203 Backport: This improves the stability of our CI, so let's backport it to all supported versions. Closes scylladb/scylladb#28602 * github.com:scylladb/scylladb: test: cluster: Reduce wait time in test_sync_point test: cluster: Fix test_sync_point test: cluster: Await sync points asynchronously test: cluster: Create sync points asynchronously test: cluster: Fetch hint metrics asynchronously	2026-02-12 09:34:09 +01:00
Dawid Pawlik	4e32502bb3	test/vector_search: add reproducer for rescoring with zero vectors Add reproducer for the SCYLLADB-456 issue following exception on ANN vector queries with rescoring with similarity cosine.	2026-02-11 13:41:09 +01:00
Dawid Pawlik	af0889d194	vector_search: return NaN for similarity_cosine with all-zero vectors The ANN vector queries with all-zero vectors are allowed even on vector indexes with similarity function set to cosine. When enabling the rescoring option, those queries would fail as the rescoring calls `similarity_cosine` function underneath, causing an `InvalidRequest` exception as all-zero vectors were not allowed matching Cassandra's behaviour. To eliminate the discrepancy we want the all-zero vector `similarity_cosine` calls to pass, but return the NaN as the cosine similarity for zero vectors is mathematically incorrect. We decided not to use arbitrary values contrary to USearch, for which the distance (not to be confused with similarity) is defined as cos(0, 0) = 0, cos(0, x) = 1 while supporting the range of values [0, 2]. If we wanted to convert that to similarity, that would mean sim_cos(0, x) = 0.5, which does not support mathematical reasoning why that would be more similar than for example vectors marking obtuse angles. It's safe to assume that all-zero vectors for cosine similarity shouldn't make any impact, therefore we return NaN and eliminate them from best results. Adjusted the tests accordingly to check both proper Cassandra and Scylla's behaviour. Fixes: SCYLLADB-456	2026-02-11 12:31:47 +01:00
Pavel Emelyanov	2a3a56850c	test: Fix the condition for streaming directions validation Commit `ea8a661119` tried to reduce the dataset for restoration tests. While doing it effectively disabled part of itself -- the checks for streaming directions were never ran after this change. The thing is that this check only runs if restored tablet count matches some hardcoded one of 512. This was the real dataset size of the test before the aforementioned commit, but after it it had changed to over values, and the comparison with 512 became always False. Fix it with a local variable to prevent such mistakes in the future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-11 12:55:27 +03:00
Pavel Emelyanov	f187dceb1a	test: Split test_backup.py::check_data_is_back() into two This method does two things -- checks that the data is indeed back, and validates streaming directions. The latter is not quite about "data is back", so better to have it as explicit dedicated method. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-11 12:54:20 +03:00
Dawid Mędrek	f83f911bae	test: cluster: Reduce wait time in test_sync_point If everything is OK, the sync point will not resolve with node 3 dead. As a result, the waiting will use all of the time we allocate for it, i.e. 30 seconds. That's a lot of time. There's no easy way to verify that the sync point will NOT resolve, but let's at least reduce the waiting to 3 seconds. If there's a bug, it should be enough to trigger it at some point, while reducing the average time needed for CI.	2026-02-10 17:05:02 +01:00
Dawid Mędrek	a256ba7de0	test: cluster: Fix test_sync_point The test had a few shortcomings that made it flaky or simply wrong: 1. We were verifying that hints were written by checking the size of in-flight hints. However, that could potentially lead to problems in rare situations. For instance, if all of the hints failed to be written to disk, the size of in-flight hints would drop to zero, but creating a sync point would correspond to the empty state. In such a situation, we should fail immediately and indicate what the cause was. 2. A sync point corresponds to the hints that have already been written to disk. The number of those is tracked by the metric `written`. It's a much more reliable way to make sure that hints have been written to the commitlog. That ensures that the sync point we'll create will really correspond to those hints. 3. The auxiliary function `wait_for` used in the test works like this: it executes the passed callback and looks at the result. If it's `None`, it retries it. Otherwise, the callback is deemed to have finished its execution and no further retries will be attempted. Before this commit, we simply returned a bool, and so the code was wrong. We improve it. Note that this fixes scylladb/scylladb#28203, which was a manifestation of scylladb/scylladb#25879. We created a sync point that corresponded to the empty state, and so it immediately resolved, even when node 3 was still dead. Refs scylladb/scylladb#25879 Fixes scylladb/scylladb#28203	2026-02-10 17:05:02 +01:00
Dawid Mędrek	c5239edf2a	test: cluster: Await sync points asynchronously There's a dedicated HTTP API for communicating with the cluster, so let's use it instead of yet another custom solution.	2026-02-10 17:05:02 +01:00
Dawid Mędrek	ac4af5f461	test: cluster: Create sync points asynchronously There's a dedicated HTTP API for communicating with the nodes, so let's use it instead of yet another custom solution.	2026-02-10 17:05:01 +01:00
Dawid Mędrek	628e74f157	test: cluster: Fetch hint metrics asynchronously There's a dedicated API for fetching metrics now. Let's use it instead of developing yet another solution that's also worse.	2026-02-10 17:04:59 +01:00
Pavel Emelyanov	875fd03882	test/object_store: Remove create_ks_and_cf() helper Now all test cases use standard facilities to create data they test Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-10 15:59:05 +03:00
Pavel Emelyanov	94176f7477	test/object_store: Replace create_ks_and_cf() usage with standard methods To create a keyspace theres new_test_keyspace helper Table is created with a single cql.run_async with explicit schema Dataset is populated with a single parallel INSERT as well Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-10 15:58:05 +03:00
Pavel Emelyanov	6665cda23f	test/object_store: Shift indentation right for test cases This is preparational patch. Next will need to replace foo() bar() with with something() as s: foo() bar() Effectively -- only add the `with something()` line. Not to shift the whole file right together with that future change, do it here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-10 15:56:27 +03:00
Ernest Zaslavsky	960adbb439	s3_client: add more constrains to the calc_part_size Enforce more checks on part size and object size as defined in "Amazon S3 multipart upload limits", see https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html and https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html	2026-02-10 13:15:07 +02:00
Ernest Zaslavsky	6280cb91ca	s3_client: add tests for calc_part_size Introduce tests that validate the corrected multipart part-size calculation, including boundary conditions and error cases.	2026-02-10 13:13:26 +02:00
Ernest Zaslavsky	289e910cec	s3_client: correct multipart part-size logic to respect 10k limit The previous calculation could produce more than 10,000 parts for large uploads because we mixed values in bytes and MiB when determining the part size. This could result in selecting a part size that still exceeded the AWS multipart upload limit. The updated logic now ensures the number of parts never exceeds the allowed maximum. This change also aligns the implementation with the code comment: we prefer a 50 MiB part size because it provides the best performance, and we use it whenever it fits within the 10,000-part limit. If it does not, we increase the part size (in bytes, aligned to MiB) to stay within the limit.	2026-02-10 13:13:25 +02:00
Wojciech Mitros	c5a44b0f88	schema: add with_sharder overload accepting static_sharder reference Add a schema_builder::with_sharder() overload that accepts a const reference to dht::static_sharder. This allows schemas to use custom sharder instances instead of only static sharder configurations. This is needed to support tables that use custom partitioning and sharding strategies, such as the incoming raft metadata tables for strongly consistent tables.	2026-02-10 10:52:00 +01:00
Ernest Zaslavsky	7142b1a08d	exceptions: add helper to build a chain of error handlers Generalize error handling by creating exception dispatcher which allows to write error handlers by sequentially applying handlers the same way one would write `catch ()` blocks	2026-02-09 08:48:41 +02:00
Ernest Zaslavsky	7fd62f042e	http: extract error classification code move http client related error classification code to a common location for future reuse	2026-02-09 08:48:41 +02:00
Ernest Zaslavsky	5beb7a2814	aws_error: extract `retryable` from aws_error Move aws::retryable to common location to reuse it later in other http based clients	2026-02-09 08:48:41 +02:00
Pavel Emelyanov	83e64b516a	hint: Don't switch group in database::apply_hint() The method is called from storage_proxy::mutate_hint() which is in turn called from hint_mutation::apply_locally(). The latter is either called from directly by hint sender, which already runs in streaming group, or via RPC HINT_MUTATION handler which uses index 1 that negotiates streaming group as well. To be sure, add a debugging check for current group being the expected one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-09 08:54:51 +03:00
Pavel Emelyanov	727f1be11c	hint_sender: Switch to sender group on stop either Currently sender only switches group for hints sending on start. It's worth doing the same on stop too for consistency. There's nothing to compete with at this point. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-09 08:54:51 +03:00
Pavel Emelyanov	44715a2d45	api: Remove map_to_key_value() helpers All the callers had already been patched to stream their results directly as json. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-09 08:52:50 +03:00
Pavel Emelyanov	dcbb5cb45b	api: Streamify view_build_statuses handler Similarly to previous patch, the handler can stream the map of build statuses. Unlike previous patch, it doesn't need to fmt::format() key and value, as these are strings already. It could be a map_to_json<string, string> partial specialization, but there's so far only one caller, so probably not worth it yet. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-09 08:52:50 +03:00
Pavel Emelyanov	73512a59ff	api: Streamify few more storage_service/ handlers Like get_token_endpoint one streams the map that it got from storage service, the get_ownership and get_effective_ownership can do the same. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-09 08:52:49 +03:00
Pavel Emelyanov	a4bd9037b3	api: Add map_to_json() helper The get_token_endpoint handler converts iterator of std::map into generated maplist_mapper type. Next patch will do the same for more handlers, so it's good to have a helper converter for it. As a nice side effect, it's possible to avoid multiline lambda argument to stream_range_as_array(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-09 08:52:49 +03:00
Pavel Emelyanov	63cafab56c	api: Coroutinize view_build_statuses handler Further patching will be nicer if this handler is a coroutine Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-09 08:52:49 +03:00
Pawel Pery	81d11a23ce	Revert "Merge 'vector_search: add validator tests' from Pawel Pery" This reverts commit `bcd1758911`, reversing changes made to `b2c2a99741`. There is a design decision to not introduce additional test orchestration tool for scylladb.git (see comments for #27499). One commit has already been reverted in `55c7bc7`. Last CI runs made validator test flaky, so it is a time to remove all remaining validator tests. It needs a backport to 2026.1 to remove remaining validator tests from there. Fixes: VECTOR-497 Closes scylladb/scylladb#28568	2026-02-08 16:29:58 +02:00
Avi Kivity	bb99bfe815	test: scylla_gdb: tighten check for Error output from gdb When running a gdb command, we check that the string 'Error' does not appear within the output. However, if the command output includes the string 'Error' as part of its normal operation, this generates a false positive. In fact the task_histogram can include the string 'error::Error' from the Rust core::error module. Allow for that and only match 'Error' that isn't 'error::Error'. Fixes #28516. Closes scylladb/scylladb#28574	2026-02-08 09:48:23 +02:00
Pavel Emelyanov	1b1aae8a0d	test: Merge test_restore_primary_replica_different_... tests The difference is very small: @@ -1,18 +1,18 @@ @pytest.mark.asyncio async def test_restore_primary_replica_different_...(manager: ManagerClient, object_storage): ''' Comment ''' - topology = topo(rf = 2, nodes = 2, racks = 2, dcs = 1) - scope = "dc" + topology = topo(rf = 1, nodes = 2, racks = 1, dcs = 2) + scope = "all" ks = 'ks' cf = 'cf' - servers, host_ids = await create_cluster(topology, True, manager, logger, object_storage) + servers, host_ids = await create_cluster(topology, False, manager, logger, object_storage) await manager.disable_tablet_balancing() cql = manager.get_cql() @@ -41,7 +41,6 @@ async def test_restore_primary_replica_d log = await manager.server_open_log(s.server_id) res = await log.grep(r'INFO.sstables_loader - load_and_stream:.target_node=([0-9a-z-]+),.*num_bytes_sent=([0-9]+)') streamed_to = set([ r[1].group(1) for r in res ]) - logger.info(f'{s.ip_addr} {host_ids[s.server_id]} streamed to {streamed_to}') + logger.info(f'{s.ip_addr} {host_ids[s.server_id]} streamed to {streamed_to}, expected {servers}') assert len(streamed_to) == 2 The (removed in the above example) test description comments differ only in their usage of "rack" and "dc" words. Squashing them into one parametrized test makes perfect sense. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-06 14:14:43 +03:00
Pavel Emelyanov	70988f9b61	test: Merge test_restore_primary_replica_same_... tests The difference is very tiny: @@ -1,12 +1,12 @@ @pytest.mark.asyncio async def test_restore_primary_replica_same_...(manager: ManagerClient, object_storage): ''' comment ''' - topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 1) - scope = "rack" + topology = topo(rf = 4, nodes = 8, racks = 2, dcs = 2) + scope = "dc" ks = 'ks' cf = 'cf' @@ -42,7 +42,7 @@ async def test_restore_primary_replica_s for r in res: nodes_by_operation[r[1].group(1)].append(r[1].group(2)) - scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.rack == servers[i].rack ]) + scope_nodes = set([ str(host_ids[s.server_id]) for s in servers if s.datacenter == servers[i].datacenter ]) for op, nodes in nodes_by_operation.items(): logger.info(f'Operation {op} streamed to nodes {nodes}') assert len(nodes) == 1, "Each streaming operation should stream to exactly one primary replica" The (removed in the above example) test description comments differ only in their usage of "rack" and "dc" words. Squashing them into one parametrized test makes perfect sense. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-06 14:12:23 +03:00
Pavel Emelyanov	98b4092153	test: Don't specify expected_replicas in test_restore_primary_replica_different_dc_scope_all If not specified, the call would use dcs * rf default, which match this teat parameters perfectly Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-06 14:11:47 +03:00
Pavel Emelyanov	1ac1e90b16	test: Remove local r_servers variable from test_restore_primary_replica_different_dc_scope_all It merely copies the `servers` one, no need in it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-02-06 14:11:22 +03:00
Anna Stuchlik	dc8f7c9d62	doc: replace the OS Support page with a link to the new location We've moved that page to another place; see https://github.com/scylladb/scylladb/issues/28561. This commit replaces the page with the link to the new location and adds a redirection. Fixes https://github.com/scylladb/scylladb/issues/28561 Closes scylladb/scylladb#28562	2026-02-06 11:38:21 +02:00
Tomasz Grabiec	41930c0176	tablets, config: Reduce migration concurrency to 2 Tablet migration keeps sstable snapshot during streaming, which may cause temporary increase in disk utilization if compaction is running concurrently. SStables compacted away are kept on disk until streaming is done with them. The more tablets we allow to migrate concurrently, the higher disk space can rise. When the target tablet size is configured correcly, every tablet should own about 1% of disk space. So concurrency of 4 shouldn't put us at risk. But target tablet size is not chosen dynamically yet, and it may not be aligned with disk capacity. Also, tablet sizes can temporary grow above the target, up to 2x before the split starts, and some more because splits take a while to complete. The reduce impact from this, reduce concurrency of migation. Concurrency of 2 should still be enough to saturate resources on the leaving shard. Also, reducing concurrency means that load balancing is more responsive to preemption. There will be less bandwidth sharing, so scheduled migrations complete faster. This is important for scale-out, where we bootstrap a node and want to start migrations to that new node as soon as possible. Refs scylladb/siren#15317	2026-02-06 00:42:19 +01:00
Tomasz Grabiec	56e40e90c9	tablets: load_balancer: Always accept migration if the load is 0 Different transitions have different weights, and limits are configurable. We don't want a situation where a high-cost migration is cut off by limits and the system can make no progress. For example, repair uses weight 2 for read concurrency. Migrating co-located tablets scales the cost by the number of co-located tablets.	2026-02-06 00:42:18 +01:00
Tomasz Grabiec	39492596c2	config, tablets: Make tablet migration concurrency configurable We're about to reduce it. It's better to not have it hard-coded in case we change our mings again.	2026-02-06 00:42:18 +01:00
Avi Kivity	7a3ce5f91e	test: minio: disable web console minio starts a web console on a random port. This was seen to interfere with the nodetool tests when the web console port clashed with the mock API port. Fix by disabling the web console. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-496 Closes scylladb/scylladb#28492	2026-02-05 20:11:32 +02:00
Nikos Dragazis	5d1e6243af	test/cluster: Remove short_tablet_stats_refresh_interval injection The test `test_size_based_load_balancing.py::test_balance_empty_tablets` waits for tablet load stats to be refreshed and uses the `short_tablet_stats_refresh_interval` injection to speed up the refresh interval. This injection has no effect; it was replaced by the `tablet_load_stats_refresh_interval_in_seconds` config option (patch: `1d6808aec4`), so the test currently waits for 60 seconds (default refresh interval). Use the config option. This reduces the execution time to ~8 seconds. Fixes SCYLLADB-556. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#28536	2026-02-05 20:11:32 +02:00
Pavel Emelyanov	10c278fff7	database: Remove _flush_sg member from replica::database This field is only used to initialize the following _memtable_controller one. It's simpler just to do the initialization with whatever value the field itself is initialized and drop the field itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28539	2026-02-05 13:02:35 +02:00
Petr Hála	a04dbac369	open-coredump: Change to use new backtrace * This is a breaking change, which removes compatibility with the old backtrace - See https://staging.backtrace.scylladb.com/api/docs#/default/search_by_build_id_search_build_id_post for the APIDoc * Add timestamp field to log * Tested locally Closes scylladb/scylladb#28325	2026-02-05 11:50:47 +02:00
Marcin Maliszkiewicz	0753d9fae5	Merge 'test: remove xfail marker from a few passing tests' from Nadav Har'El This patch fixes the few remaining cases of XPASS in test/cqlpy and test/alternator. These are tests which, when written, reproduced a bug and therefore were marked "xfail", but some time later the bug was fixed and we either did not notice it was ever fixed, or just forgot to remove the xfail marker. Removing the no-longer-needed xfail markers is good for test hygiene, but more importantly is needed to avoid regressions in those already-fixed areas (if a test is already marked xfail, it can start to fail in a new way and we wouldn't notice). Backport not needed, xpass doesn't bother anyone. Closes scylladb/scylladb#28441 * github.com:scylladb/scylladb: test/cqlpy: remove xfail from tests for fixed issue 7972 test/cqlpy: remove xfail from tests for fixed issue 10358 test/cqlpy: remove xfail from passing test testInvalidNonFrozenUDTRelation test/alternator: remove xfail from passing test_update_item_increases_metrics_for_new_item_size_only	2026-02-05 10:10:43 +01:00
Marcin Maliszkiewicz	6eca74b7bb	Merge 'More Alternator tests for BatchWriteItem' from Nadav Har'El The goal of this small pull request is to reproduce issue #28439, which found a bug in the Alternator Streams output when BatchWriteItem is called to write multiple items in the same partition, and always_use_lwt write isolation mode is used. * The first patch reproduces this specific bug in Alternator Streams. * The second patch adds missing (Fixes #28171) tests for BatchWriteItem in different write modes, and shows that BatchWriteItem itself works correctly - the bug is just in Alternator Streams' reporting of this write. Closes scylladb/scylladb#28528 * github.com:scylladb/scylladb: test/alternator: add test for BatchWriteItem with different write isolations test/alternator: reproducer for Alternator Streams bug	2026-02-05 10:07:29 +01:00
Yaron Kaikov	b30ecb72d5	ci: fix PR number extraction for unlabeled events When the workflow is triggered by removing the 'conflicts' label (pull_request_target unlabeled event), github.event.issue.number is not available. Use github.event.pull_request.number as fallback. Fixes: https://scylladb.atlassian.net/browse/RELENG-245 Closes scylladb/scylladb#28543	2026-02-05 08:41:43 +02:00
Michał Hudobski	6b9fcc6ca3	auth: add CDC streams and timestamps to vector search permissions It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR. Fixes: SCYLLADB-522 Closes scylladb/scylladb#28519	2026-02-04 09:10:08 +01:00
Nadav Har'El	47e827262f	test/alternator: add test for BatchWriteItem with different write isolations Alternator's various write operations have different code paths for the different write isolation modes. Because most of the test suite runs in only a single write mode (currently - only_rmw_uses_lwt), we already introduced a test file test/alternator/test_write_isolation.py for checking the different write operations in all four write isolation modes. But we missed testing one write operation - BatchWriteItem. This operation isn't very "interesting" because it doesn't support any read-modify-option option (it doesn't support UpdateExpression, ConditionExpression or ReturnValues), but even without those, the pure write code still has different code paths with and without LWT, and should be tested. So we add the missing test here - and it passes. In issue #28439 we discovered a bug that can be seen in Alternator Streams in the case of BatchWriteItem with multiple writes to the same partition and always_use_lwt mode. The fact that the test added here passes shows that the bug is NOT in BatchWriteItem itself, which works correctly in this case - but only in the Alternator Streams layer. Fixes #28171 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-04 09:24:29 +02:00
Nadav Har'El	c63f43975f	test/alternator: reproducer for Alternator Streams bug This patch adds a reproducer for an Alternator Streams bug described in issue #28439, where the stream returns the wrong events (and fewer of them) in the following specific combination of the following circumstances: 1. A BatchWriteItem operation writing multiple items to the same partition. 2. The "always_use_lwt" write isolation mode is used. (the bug doesn't occur in other write isolation modes). We didn't catch this bug earlier because the Alternator Streams test we had for BatchWriteItem had multiple items in multiple partitions, and we missed the multiple-items-in-one-partition case. Moreover, today we run all the tests in only_rmw_uses_lwt mode (in the past, we did use always_use_lwt, but changed recently in commit `e7257b1393` following commit `76a766c` that changed test.py). As issue #28439 explains, the underlying cause of the bug is that the always_use_lwt causes the multiple items to be written with the same timestamp, which confused the Alternator Streams code reading the CDC log. The bug is not in BatchWriteItem itself, or in ScyllaDB CDC, but just in the Alternator Streams layer. The test in this patch is parameterized to run on each of the four write isolation modes, and currently fails (and so marked xfail) just for the one mode 'always_use_lwt'. The test is scylla_only, as its purpose is to checks the different write isolation mode - which don't exist in AWS DynamoDB. Refs #28439 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-04 09:17:48 +02:00
Radosław Cybulski	03ff091bee	alternator: improve events output when test failed Improve events printing, when test in test_streams.py failed. New code will print both expected and received events (keys, previous image, new image and type). New code will explicitly mark, at which output event comparison failed. Fixes #28455 Closes scylladb/scylladb#28476	2026-02-03 21:55:07 +02:00
Anna Stuchlik	a427ad3bf9	doc: remove the link to the Open Source blog post Fixes https://github.com/scylladb/scylladb/issues/28486 Closes scylladb/scylladb#28518	2026-02-03 14:15:16 +01:00
Botond Dénes	3adf8b58c4	Merge 'test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0' from Patryk Jędrzejczak The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from ``` co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms)); ``` as the default value of `shutdown_announce_in_ms` is 2000. This sleep makes every `server_stop_gracefully` call 2s slower. There are ~300 such calls in cluster tests (note that some come from `rolling_restart`). So, it looks like this sleep makes cluster tests 300 * 2s = 10min slower. Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min on the potwor machine (the one in the Warsaw office) without it. We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them faster. The sleep is completely unnecessary in tests. Removing it could introduce flakiness, but if that's the case, then the test for which it happens is incorrect in the first place. Tests shouldn't assume that all nodes receive and handle the shutdown message in 2s. They should use functions like `server_not_sees_other_server` instead, which are faster and more reliable. Improvement of the tests running time, so no backport. The fix of `test_tablets_parallel_decommission` may have to be backported to 2026.1, but it can be done manually. Closes scylladb/scylladb#28464 * github.com:scylladb/scylladb: test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0 test: test_tablets_parallel_decommission: prevent group0 majority loss test: delete test_service_levels_work_during_recovery	2026-02-03 08:19:05 +02:00
Pavel Emelyanov	19ea05692c	view_build_worker: Do not switch scheduling groups inside work_on_view_building_tasks The handler appeared back in `c9e710dca3`. In this commit it performed the "core" part of the task -- the do_build_range() method -- inside the streaming sched group. The setup code looks seemingly was copied from the view_builder::do_build_step() method and got the explicit switch of the scheduling group. The switch looks both -- justified and not. On one hand, it makes it explict that the activity runs in the streaming scheduling group. On the other hand, the verb already uses RPC index on 1, which is negotiated to be run in streaming group anyway. On the "third hand", even though being explicit the switch happens too late, as there exists a lot of other activities performed by the handler that seems to also belong to the same scheduling group, but which is not switched into explicitly. By and large, it seems better to avoid the explicit switch and rely on the RPC-level negotiation-based sched group switching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28397	2026-02-03 07:00:32 +02:00
Anna Stuchlik	77480c9d8f	doc: fix the links on the repair-related pages This is a follow-up to https://github.com/scylladb/scylladb/pull/28199. This commit fixes the syntax of the internal links. Fixes https://github.com/scylladb/scylladb/issues/28486 Closes scylladb/scylladb#28487	2026-02-03 06:54:08 +02:00
Botond Dénes	64b38a2d0a	Merge 'Use gossiper scheduling group where needed' from Pavel Emelyanov This is the continuation of #28363 , this time about getting gossiper scheduling group via database. Several places that do it already have gossiper at hand and should better get the group from it. Eventually, this will allow to get rid of database::get_gossip_scheduling_group(). Refining inter-components API, not backporting Closes scylladb/scylladb#28412 * github.com:scylladb/scylladb: gossiper: Export its scheduling group for those who need it migration_manager: Reorder members	2026-02-03 06:51:31 +02:00
Nadav Har'El	48b01e72fa	test/alternator: add test verifying that keys only allow S/B/N type Recently we had a question whether key columns can have any supported type. I knew that actually - they can't, that key columns can have only the types S(tring), B(inary) or N(umber), and that is all. But it turns out we never had a test that confirms this understanding is true. We did have a test for it for GSI key types already, test_gsi.py::test_gsi_invalid_key_types, but we didn't have one for the base table. So in this patch we add this missing test, and confirm that, indeed, both DynamoDB and Alternator refuse a key attribute with any type other than S, B or N. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28479	2026-02-03 06:49:02 +02:00
Andrei Chekun	ed9a96fdb7	test.py: modify logic for adding function_path in JUnit Current way is checking only fail during the test phase, and it will miss the cases when fail happens on another phase. This PR eliminate this, so every phase will have modified node reporter to enrich the JUnit XML report with custom attribute function_path. Closes scylladb/scylladb#28462	2026-02-03 06:42:18 +02:00
Andrei Chekun	3a422e82b4	test.py: fix the file name in test summary Current way is always assumed that the error happened in the test file, but that not always true. This PR will show the error from the boost logger where actually error is happened. Closes scylladb/scylladb#28429	2026-02-03 06:38:21 +02:00
Benny Halevy	84caa94340	gossiper: add_expire_time_for_endpoint: replace fmt::localtime with gmtime in log printout 1. fmt::localtime is deprecated. 2. We should really print times in UTC, especially on the cloud. 3. The current log message does not print the timezone so it'd unclear to anyone reading the lof message if the expiration time is in the local timezone or in GMT/UTC. Fixes the following warning: ``` gms/gossiper.cc:2428:28: warning: 'localtime' is deprecated [-Wdeprecated-declarations] 2428 \| endpoint, fmt::localtime(clk::to_time_t(expire_time)), expire_time.time_since_epoch().count(), \| ^ /usr/include/fmt/chrono.h:538:1: note: 'localtime' has been explicitly marked deprecated here 538 \| FMT_DEPRECATED inline auto localtime(std::time_t time) -> std::tm { \| ^ /usr/include/fmt/base.h:207:28: note: expanded from macro 'FMT_DEPRECATED' 207 \| # define FMT_DEPRECATED [[deprecated]] \| ^ ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#28434	2026-02-03 06:36:53 +02:00
Pavel Emelyanov	8c42704c72	storage_service: Check raft rpc scheduling group from debug namespace Some storage_service rpc verbs may checks that a handler is executed inside gossiper scheduling group. For that, the expected group is grabbed from database. This patch puts the gossiper sched group into debug namespace and makes this check use it from there. It removes one more place that uses database as config provider. Refs #28410 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28427	2026-02-03 06:34:03 +02:00
Asias He	b5c3587588	repair: Add request type in the tablet repair log So we can know if the repair is an auto repair or a user repair. Fixes SCYLLADB-395 Closes scylladb/scylladb#28425	2026-02-03 06:26:58 +02:00
Nadav Har'El	a63ad48b0f	test/cqlpy: remove xfail from tests for fixed issue 7972 The test test_to_json_double used to fail due to #7972, but this issue was already fixed in Scylla 5.1 and we didn't notice. So remove the xfail marker from this test, and also update another test which still xfails but no longer due to this issue. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-02 23:49:32 +02:00
Nadav Har'El	10b81c1e97	test/cqlpy: remove xfail from tests for fixed issue 10358 The tests testWithUnsetValues and testFilteringWithoutIndices used to fail due to #10358, but this issue was already fixed three years ago, when the UNSET-checking code was cleaned up, and the test is now passing. So remove the xfail marker from these tests. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-02 23:49:31 +02:00
Nadav Har'El	508bb97089	test/cqlpy: remove xfail from passing test testInvalidNonFrozenUDTRelation The test testInvalidNonFrozenUDTRelation used to fail due to #10632 (an incorrectly-printed column name in an error message) and was marked "xfail". But this issue has already been fixed two years ago, and the test is now passing. So remove the xfail marker.	2026-02-02 23:49:31 +02:00
Nadav Har'El	3682c06157	test/alternator: remove xfail from passing test_update_item_increases_metrics_for_new_item_size_only The test test_metrics.py::test_update_item_increases_metrics_for_new_item_size_only tests whether the Alternator metrics report the exactly-DynamoDB-compatible WCU number. It is parameterized with two cases - one that uses alternator_force_read_before_write and one which doesn't. The case that uses alternator_force_read_before_write is expected to measure the "accurate" WCU, and currently it doesn't, so the test rightly xfails. But the case that doesn't use alternator_force_read_before_write is not expected to measure the "accurate" WCU and has a different expectation, so this case actually passes. But because the entire test is marked xfail, it is reported as "XPASS" - unexpected pass. Fix this by marking only the "True" case with xfail, while the "False" case is not marked. After this pass, the True case continues to XFAIL and the False case passes normally, instead of XPASS. Also removed a sentence promising that the failing case will be solved "by the next PR". Clearly this didn't happen. Maybe we even have such a PR open (?), but it won't the "the next PR" even if merged today. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-02-02 23:49:31 +02:00
Nadav Har'El	df69dbec2a	Merge ' cql3/statements/describe_statement: hide paxos state tables ' from Michał Jadwiszczak Paxos state tables are internal tables fully managed by Scylla and they shouldn't be exposed to the user nor they shouldn't be backed up. This commit hides those kind of tables from all listings and if such table is directly described with `DESC ks."tbl$paxos"`, the description is generated withing a comment and a note for the user is added. Fixes https://github.com/scylladb/scylladb/issues/28183 LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version. Closes scylladb/scylladb#28230 * github.com:scylladb/scylladb: test/cqlpy: add reproducer for hidden Paxos table being shown by DESC cql3/statements/describe_statement: hide paxos state tables	2026-02-02 21:22:59 +02:00
Nadav Har'El	f23e796e76	alternator: fix typos in comments and variable names Copilot found these typos in comments and variable name in alternator/, so might as well fix them. There are no functional changes in this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28447	2026-02-02 19:16:43 +03:00
Marcin Maliszkiewicz	88c4ca3697	Merge 'test: migrate guardrails_test.py from scylla-dtest' from Andrzej Jackowski This patch series copies `guardrails_test.py` from scylla-dtest, fix it and enables it. The motivation is to unify the test execution of guardrails test, as some tests (`cqlpy/test_guardrail_...`) were already in scylladb repo, and some were in `scylla-dtest`. Fixes: SCYLLADB-255 No backport, just test migration Closes scylladb/scylladb#28454 * github.com:scylladb/scylladb: test: refactor test_all_rf_limits in guardrails_test.py test: specify exceptions being caught in guardrails_test.py test: enable guardrails_test.py test: add wait_other_notice to test_default_rf in guardrails_test.py test: copy guardrails_test.py from scylla-dtest	2026-02-02 16:54:13 +01:00
Avi Kivity	acc54cf304	tools: toolchain: adapt future toolchain to loss of toxiproxy in Fedora Next Fedora will likely not have toxiproxy packaged [1]. Adapt by installing it directly. To avoid changing the current toolchain, add a ./install-dependencies --future option. This will allow us to easily go back to the packages if the Fedora bug is fixed. [1] https://bugzilla.redhat.com/show_bug.cgi?id=2426954 Closes scylladb/scylladb#28444	2026-02-02 17:02:19 +02:00
Avi Kivity	419636ca8f	test: ldap: regularize toxiproxy command-line options Modern toxiproxy interprets `-h` as help and requires the subcommand subject (e.g. the proxy name) to be after the subcommand switches. Arrange the command line in the way it likes, and spell out the subcommands to be more comprehensible. Closes scylladb/scylladb#28442	2026-02-02 17:00:58 +02:00
Botond Dénes	2b3f3d9ba7	Merge 'test.py: support boost labels in test.py' from Artsiom Mishuta related PR: https://github.com/scylladb/scylladb/pull/27527 This PR changes test.py logic of parsing boost test cases to use -- --list_json_content and pass boost labels as pytests markers using -- --list_json_content is not ideal and currenly require to implement severall [workarounds](https://github.com/scylladb/scylladb/pull/27527#issuecomment-3765499812), but having the ability to support boost labels in pytest is worth it. because now we can apply the tiering mechanism for the boost tests as well Fixes SCYLLADB-246 Closes scylladb/scylladb#28232 * github.com:scylladb/scylladb: test: add nightly label test.py: support boost labels in test.py	2026-02-02 16:55:29 +02:00
Dawid Mędrek	68981cc90b	Merge 'raft topology: generate notification about released nodes only once' from Piotr Dulikowski Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it. The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in `44de563` verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed. Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed. Fixes: scylladb/scylladb#28301 Refs: scylladb/scylladb#25031 The log messages added in `44de563` cause a lot of noise during topology operations and tablet migrations, so the fix should be backported to all affected versions (2025.4 and 2026.1). Closes scylladb/scylladb#28367 * github.com:scylladb/scylladb: storage_service: fix indentation after previous patch raft topology: generate notification about released nodes only once raft topology: extract "released" nodes calculation to external function	2026-02-02 15:39:15 +01:00
Jenkins Promoter	c907fc6789	Update pgo profiles - aarch64	2026-02-02 14:56:49 +02:00
Dawid Mędrek	b0afd3aa63	Merge 'storage_service: set up topology properly in maintenance mode' from Patryk Jędrzejczak We currently make the local node the only token owner (that owns the whole ring) in maintenance mode, but we don't update the topology properly. The node is present in the topology, but in the `none` state. That's how it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in `scylla_main`. As a result, the node started in maintenance mode crashes in the following way in the presence of a vnodes-based keyspace with the NetworkTopologyStrategy: ``` scylla: locator/network_topology_strategy.cc:207: locator::natural_endpoints_tracker::natural_endpoints_tracker( const token_metadata &, const network_topology_strategy::dc_rep_factor_map &): Assertion `!_token_owners.empty() && !_racks.empty()' failed. ``` Both `_token_owners` and `_racks` are empty. The reason is that `_tm.get_datacenter_token_owners()` and `_tm.get_datacenter_racks_token_owners()` called above filter out nodes in the `none` state. This bug basically made maintenance mode unusable in customer clusters. We fix it by changing the node state to `normal`. We also extend `test_maintenance_mode` to provide a reproducer for Fixes #27988 This PR must be backported to all branches, as maintenance mode is currently unusable everywhere. Closes scylladb/scylladb#28322 * github.com:scylladb/scylladb: test: test_maintenance_mode: enable maintenance mode properly test: test_maintenance_mode: shutdown cluster connections test: test_maintenance_mode: run with different keyspace options test: test_maintenance_mode: check that group0 is disabled by creating a keyspace test: test_maintenance_mode: get rid of the conditional skip test: test_maintenance_mode: remove the redundant value from the query result storage_proxy: skip validate_read_replica in maintenance mode storage_service: set up topology properly in maintenance mode	2026-02-02 13:28:19 +01:00
Andrzej Jackowski	298aca7da8	test: refactor test_all_rf_limits in guardrails_test.py Before this commit, `test_all_rf_limits` was implemented in a repetitive manner, making it harder to understand how the guardrails were tested. This commit refactors the test to reduce code redundancy and verify the guardrails more explicitly.	2026-02-02 10:49:12 +01:00
Andrzej Jackowski	136db260ca	test: specify exceptions being caught in guardrails_test.py Before this commit, the test caught a broad `Exception`. This change specifies the expected exceptions to avoid a situation where the product or test is broken and it goes undetected.	2026-02-02 10:48:07 +01:00
Patryk Jędrzejczak	ec2f99b3d1	test: pylib: scylla_cluster: set shutdown_announce_in_ms to 0 The usual Scylla shutdown in a cluster test takes ~2.1s. 2s come from ``` co_await sleep(std::chrono::milliseconds(_gcfg.shutdown_announce_ms)); ``` as the default value of `shutdown_announce_in_ms` is 2000. This sleep makes every `server_stop_gracefully` call 2s slower. There are ~300 such calls in cluster tests (note that some come from `rolling_restart`). So, it looks like this sleep makes cluster tests 300 * 2s = 10min slower. Indeed, `./test.py --mode=dev cluster` takes 61min instead of 71min on the potwor machine (the one in the Warsaw office) without it. We set `shutdown_announce_in_ms` to 0 for all cluster tests to make them faster. The sleep is completely unnecessary in tests. Removing it could introduce flakiness, but if that's the case, then the test for which it happens is incorrect in the first place. Tests shouldn't assume that all nodes receive and handle the shutdown message in 2s. They should use functions like `server_not_sees_other_server` instead, which are faster and more reliable.	2026-02-02 10:39:55 +01:00
Patryk Jędrzejczak	1f28a55448	test: test_tablets_parallel_decommission: prevent group0 majority loss Both of the changed test cases stop two out of four nodes when there are three group0 voters in the cluster. If one of the two live nodes is a non-voter (node 1, specifically, as node 0 is the leader), a temporary majority loss occurs, which can cause the following operations to fail. In the case of `test_tablets_are_rebuilt_in_parallel`, the `exclude_node` API can fail. In the case of `test_remove_is_canceled_if_there_is_node_down`, removenode can fail with an unexpected error message: ``` "service::raft_operation_timeout_error (group [46dd9cf1-fe21-11f0-baa0-03429f562ff5] raft operation [read_barrier] timed out)" ``` Somehow, these test cases are currently not flaky, but they become flaky in the following commit. We can consider backporting this commit to 2026.1 to prevent flakiness.	2026-02-02 10:39:55 +01:00
Patryk Jędrzejczak	bcf0114e90	test: delete test_service_levels_work_during_recovery The test becomes flaky in one of the following commits. However, there is no need to fix it, as we should delete it anyway. We are in the process of removing the gossip-based topology from the code base, which includes the recovery mode. We don't have to rewrite the test to use the new Raft-based recovery procedure, as there is nothing interesting to test (no regression to legacy service levels).	2026-02-02 10:39:54 +01:00
Artsiom Mishuta	af2d7a146f	test: add nightly label add nightly label for test test_foreign_reader_as_mutation_source as an example of usinf boost labels pytest as markers command to test : ./tools/toolchain/dbuild pytest --test-py-init --collect-only -q -m=nightly test/boost output: boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.debug.1 boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.release.1 boost/mutation_reader_test.cc::test_foreign_reader_as_mutation_source.dev.1	2026-02-02 10:30:38 +01:00
Gleb Natapov	08268eee3f	topology: disable force-gossip-topology-changes option The patch marks force-gossip-topology-changes as deprecated and removes tests that use it. There is one test (test_different_group0_ids) which is marked as xfail instead since it looks like gossiper mode was used there as a way to easily achieve a certain state, so more investigation is needed if the tests can be fixed to use raft mode instead. Closes scylladb/scylladb#28383	2026-02-02 09:56:32 +01:00
Avi Kivity	ceec703bb7	Revert "main: test: add future and abort_source to after_init_func" This reverts commit `7bf7ff785a`. The commit tried to add clean shutdown to `scylla perf` paths, but forgot at least `scylla perf-alternator --workload wr` which now crashes on uninitialized `c.as`. Fixes #28473 Closes scylladb/scylladb#28478	2026-02-02 09:22:24 +01:00
Avi Kivity	cc03f5c89d	cql3: support literals and bind variables in selectors Add support for literals in the SELECT clause. This allows SELECT fn(column, 4) or SELECT fn(column, ?). Note, "SELECT 7 FROM tab" becomes valid in the grammar, but is still not accepted because of failed type inference - we cannot infer the type of 7, and don't have a favored type for literals (like C favors int). We might relax this later. In the WHERE clause, and Cassandra in the SELECT clause, type hints can also resolve type ambiguity: (bigint)7 or (text)?. But this is deferred to a later patch. A few changes to the grammar are needed on top of adding a `value` alternative to `unaliasedSelector`: - vectorSimilarityArg gained access to `value` via `unaliasedSelector`, so it loses that alternate to avoid ambiguity. We may drop `vectorSimilarityArg` later. - COUNT(1) became ambiguous via the general function path (since function arguments can now be literals), so we remove this case from the COUNT special cases, remaining with count(*). - SELECT JSON and SELECT DISTINCT became "ambiguous enough" for ANTLR to complain, though as far as I can tell `value` does not add real ambiguity. The solution is to commit early (via "=>") to a parsing path. Due to the loss of count(1) recognition in the parser, we have to special-case it in prepare. We may relax it to count any expression later, like modern Cassandra and SQL. Testing is awkward because of the type inference problem in top-level. We test via the set_intersection() function and via lua functions. Example: ``` cqlsh> CREATE FUNCTION ks.sum(a int, b int) RETURNS NULL ON NULL INPUT RETURNS int LANGUAGE LUA AS 'return a + b'; cqlsh> SELECT ks.sum(1, 2) FROM system.local; ks.sum(1, 2) -------------- 3 (1 rows) cqlsh> ``` (There are no suitable system functions!) Fixes https://scylladb.atlassian.net/browse/SCYLLADB-296 Closes scylladb/scylladb#28256	2026-02-02 00:06:13 +02:00
Patryk Jędrzejczak	68b105b21c	db: virtual tables: add the rack column to cluster_status `system.cluster_status` is missing the rack info compared to `nodetool status` that is supposed to be equivalent. It has probably been an omission. Closes scylladb/scylladb#28457	2026-02-01 20:36:53 +01:00
Pavel Emelyanov	6f3f30ee07	storage_service: Use stream_manager group for streaming The hander of raft_topology_cmd::command::stream_ranges switches to streaming scheduling group to perform data streaming in it. It grabs the group from database db_config, which's not great. There's streaming manager at hand in storage service handlers, since it's using its functionality, it should use _its_ scheduling group. This will help splitting the streaming scheduling group into more elaborated groups under the maintenance supergroup: SCYLLADB-351 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28363	2026-02-01 20:42:37 +02:00
Marcin Maliszkiewicz	b8c75673c8	main: remove confusing duplicated auth start message Before we could observe two exactly the same "starting auth service" messages in the log. One from checkpoint() the other from notify(). We remove the second one to stay consistent with other services. Closes scylladb/scylladb#28349	2026-02-01 13:57:53 +02:00
Avi Kivity	6676953555	Merge 'test: perf: add option to write results to json in perf-cql-raw and perf-alternator' from Marcin Maliszkiewicz Adds --json-result option to perf-cql-raw and perf-alternator, the same as perf-simple-query has. It is useful for automating test runs. Related: https://scylladb.atlassian.net/browse/SCYLLADB-434 Bacport: no, original benchmark is not backported Closes scylladb/scylladb#28451 * github.com:scylladb/scylladb: test: perf: add example commands to perf-alternator and perf-cql-raw test: perf: add option to write results to json in perf-cql-raw test: perf: add option to write results to json in perf-alternator test: perf: move write_json_result to a common file	2026-02-01 13:57:10 +02:00
Artsiom Mishuta	e216504113	test.py: support boost labels in test.py related PR: https://github.com/scylladb/scylladb/pull/27527 This PR changes test.py logic of parsing boost test cases to use -- --list_json_content and pass boost labels as pytests markers fixes: https://github.com/scylladb/scylladb/issues/25415	2026-02-01 11:31:26 +01:00
Tomasz Grabiec	b93472d595	Merge 'load_stats: fix problem with load_stats refresh throwing no_such_column_family' from Ferenc Szili When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host. During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail. It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure. This fixes this problem by checking if the table still exists. Fixes: #28359 Closes scylladb/scylladb#28440 * github.com:scylladb/scylladb: test: add test and reproducer for load_stats refresh exception load_stats: handle dropped tables when refreshing load_stats	2026-01-31 21:12:19 +01:00
Ferenc Szili	92dbde54a5	test: add test and reproducer for load_stats refresh exception This patch adds a test and reproducer for the issue where the load_stats refresh procedure throws exceptions if any of the tables have been dropped since load_stats was produced.	2026-01-30 15:11:29 +01:00
Patryk Jędrzejczak	7e7b9977c5	test: test_maintenance_mode: enable maintenance mode properly The same issue as the one fixed in `394207fd69`. This one didn't cause real problems, but it's still cleaner to fix it.	2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak	6c547e1692	test: test_maintenance_mode: shutdown cluster connections Leaked connections are known to cause inter-test issues.	2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak	867a1ca346	test: test_maintenance_mode: run with different keyspace options We extend the test to provide a reproducer for #27988 and to avoid similar bugs in the future. The test slows down from ~14s to ~19s on my local machine in dev mode. It seems reasonable.	2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak	53f58b85b7	test: test_maintenance_mode: check that group0 is disabled by creating a keyspace In the following commit, we make the rest run with multiple keyspaces, and the old check becomes inconvenient. We also move it below to the part of the code that won't be executed for each keyspace. Additionally, we check if the error message is as expected.	2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak	408c6ea3ee	test: test_maintenance_mode: get rid of the conditional skip This skip has already caused trouble. After `0668c642a2`, the skip was always hit, and the test was silently doing nothing. This made us miss #26816 for a long time. The test was fixed in `222eab45f8`, but we should get rid of the skip anyway. We increase the number of writes from 256 to 1000 to make the chance of not finding the key on server A even lower. If that still happens, it must be due to a bug, so we fail the test. We also make the test insert rows until server A is a replica of one row. The expected number of inserted rows is a small constant, so it should, in theory, make the test faster and cleaner (we need one row on server A, so we insert exactly one such row). It's possible to make the test fully deterministic, by e.g., hardcoding the key and tokens of all nodes via `initial_token`, but I'm afraid it would make the test "too deterministic" and could hide a bug.	2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak	c92962ca45	test: test_maintenance_mode: remove the redundant value from the query result	2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak	9d4a5ade08	storage_proxy: skip validate_read_replica in maintenance mode In maintenance mode, the local node adds only itself to the topology. However, the effective replication map of a keyspace with tablets enabled contains all tablet replicas. It gets them from the tablets map, not the topology. Hence, `network_topology_strategy::sanity_check_read_replicas` hits ``` throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace())); ``` for tablet replicas other than the local node. As a result, all requests to a keyspace with tablets enabled and RF > 1 fail in debug mode (`validate_read_replica` does nothing in other modes). We don't want to skip maintenance mode tests in debug mode, so we skip the check in maintenance mode. We move the `is_debug_build()` check because: - `validate_read_replicas` is a static function with no access to the config, - we want the `!_db.local().get_config().maintenance_mode()` check to be dropped by the compiler in non-debug builds. We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.	2026-01-30 12:55:17 +01:00
Patryk Jędrzejczak	a08c53ae4b	storage_service: set up topology properly in maintenance mode We currently make the local node the only token owner (that owns the whole ring) in maintenance mode, but we don't update the topology properly. The node is present in the topology, but in the `none` state. That's how it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in `scylla_main`. As a result, the node started in maintenance mode crashes in the following way in the presence of a vnodes-based keyspace with the NetworkTopologyStrategy: ``` scylla: locator/network_topology_strategy.cc:207: locator::natural_endpoints_tracker::natural_endpoints_tracker( const token_metadata &, const network_topology_strategy::dc_rep_factor_map &): Assertion `!_token_owners.empty() && !_racks.empty()' failed. ``` Both `_token_owners` and `_racks` are empty. The reason is that `_tm.get_datacenter_token_owners()` and `_tm.get_datacenter_racks_token_owners()` called above filter out nodes in the `none` state. This bug basically made maintenance mode unusable in customer clusters. We fix it by changing the node state to `normal`. We also update its rack, datacenter, and shards count. Rack and datacenter are present in the topology somehow, but there is nothing wrong with updating them again. The shard count is also missing, so we better update it to avoid other issues. Fixes #27988	2026-01-30 12:55:16 +01:00
Andrzej Jackowski	625f292417	test: enable guardrails_test.py After guardrails_test.py has been migrated to test.py and fixed in previous commits of this patch series, it can finally be enabled. Fixes: SCYLLADB-255	2026-01-30 11:51:46 +01:00
Andrzej Jackowski	576ad29ddb	test: add wait_other_notice to test_default_rf in guardrails_test.py This commit adds `wait_other_notice=True` to `cluster.populate` in `guardrails_test.py`. Without this, `test_default_rf` sometimes fails because `NetworkTopologyStrategy` setting fails before the node knows about all other DCs. Refs: SCYLLADB-255	2026-01-30 11:51:46 +01:00
Andrzej Jackowski	64c774c23a	test: copy guardrails_test.py from scylla-dtest This commit copies guardrails_test.py from dtest repository and (temporarily) disables it, as it requires improvement in following commits of this patch series before being enabled. Refs: SCYLLADB-255	2026-01-30 11:51:40 +01:00
Marcin Maliszkiewicz	e18b519692	cql3: remove find_schema call from select check_access Schema is already a member of select statement, avoiding the call saves around 400 cpu instructions on a select request hot path. Closes scylladb/scylladb#28328	2026-01-30 11:49:09 +01:00
Ferenc Szili	71be10b8d6	load_stats: handle dropped tables when refreshing load_stats When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host. During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail. It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure. This patch fixes this problem by checking if the table still exists.	2026-01-30 09:48:59 +01:00
Marcin Maliszkiewicz	80e627c64b	test: perf: add example commands to perf-alternator and perf-cql-raw	2026-01-30 08:48:19 +01:00
Pawel Pery	f49c9e896a	vector_search: allow full secondary indexes syntax while creating the vector index Vector Search feature needs to support creating vector indexes with additional filtering column. There will be two types of indexes: global which indexes vectors per table, and local which indexes vectors per partition key. The new syntaxes are based on ScyllaDB's Global Secondary Index and Local Secondary Index. Vector indexes don't use secondary indexes functionalities in any way - all indexing, filtering and processing data will be done on Vector Store side. This patch allows creating vector indexes using this CQL syntax: ``` CREATE TABLE IF NOT EXISTS cycling.comments_vs ( commenter text, comment text, comment_vector VECTOR <FLOAT, 5>, created_at timestamp, discussion_board_id int, country text, lang text, PRIMARY KEY ((commenter, discussion_board_id), created_at) ); CREATE CUSTOM INDEX IF NOT EXISTS global_ann_index ON cycling.comments_vs(comment_vector, country, lang) USING 'vector_index' WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' }; CREATE CUSTOM INDEX IF NOT EXISTS local_ann_index ON cycling.comments_vs((commenter, discussion_board_id), comment_vector, country, lang) USING 'vector_index' WITH OPTIONS = { 'similarity_function': 'DOT_PRODUCT' }; ``` Currently, if we run these queries to create indexes we will receive such errors: ``` InvalidRequest: Error from server: code=2200 [Invalid query] message="Vector index can only be created on a single column" InvalidRequest: Error from server: code=2200 [Invalid query] message="Local index definition must contain full partition key only. Redundant column: XYZ" ``` This commit refactors `vector_index::check_target` to correctly validate columns building the index. Vector-store currently support filtering by native types, so the type of columns is checked. The first column from the list must be a vector (to build index based on these vectors), so it is also checked. Allowed types for columns are native types without counter (it is not possible to create a table with counter and vector) and without duration (it is not possible to correctly compare durations, this type is even not allowed in secondary indexes). This commits adds cqlpy test to check errors while creating indexes. Fixes: SCYLLADB-298 This needs to be backported to version 2026.1 as this is a fix for filtering support. Closes scylladb/scylladb#28366	2026-01-30 01:14:31 +02:00
Avi Kivity	3d1558be7e	test: remove xfail markers from SELECT JSON count(*) tests These were marked xfail due to #8077 (the column name was wrong), but it was fixed long ago for 5.4 (exact commit not known). Remove the xfail markers to prevent regressions. Closes scylladb/scylladb#28432	2026-01-29 21:56:00 +02:00
Piotr Dulikowski	f150629948	Merge 'auth: switch find_record to use cache' from Marcin Maliszkiewicz This series optimizes role lookup by moving find_record into standard_role_manager and switching it to use the auth cache. This allows reverting can_login to its original simpler form, ensuring hot paths are properly cached while maintaining consistency via group0_guard. Backport: no, it's not a bug fix. Closes scylladb/scylladb#28329 * github.com:scylladb/scylladb: auth: bring back previous version of standard_role_manager::can_login auth: switch find_record to use cache auth: make find_record and callers standard_role_manager members	2026-01-29 17:25:42 +01:00
Avi Kivity	7984925059	Merge 'Use coroutine::switch_to() in table::try_flush_memtable_to_sstable' from Pavel Emelyanov The method was coroutinized by `6df07f7ff7`. Back then thecoroutine::switch_to() wasn't available, and the code used with_scheduling_group() to call coroutinized lambdas. Those lambdas were implemented as on-stack variables to solve the capture list lifetime problems. As a result, the code looks like ``` auto flush = [] { ... // do the flushing auto post_flush = [] { ... // do the post-flushing } co_return co_await with_scheduling_group(group_b, post_flush); }; co_return co_await with_scheduling_group(group_a, flush); ``` which is a bit clumsy. Now we have switch_to() and can make the code flow of this method more readable, like this ``` co_await switch_to(group_a); ... // do the flushing co_await switch_to(group_b); ... // do the post-flushing ``` Code cleanup, not backporting Closes scylladb/scylladb#28430 * github.com:scylladb/scylladb: table: Fix indentation after previous patch table: Use coroutine::switch_to() in try_flush_memtable_to_sstable()	2026-01-29 18:12:35 +02:00
Nadav Har'El	a6fdda86b5	Merge 'test: test_alternator_proxy_protocol: fix race between node startup and test start' from Avi Kivity test_alternator_proxy_protocol starts a node and connects via the alternator ports. Starting a node, by default, waits until the CQL ports are up. This does not guarantee that the alternator ports are up (they will be up very soon after this), so there is a short window where a connection to the alternator ports will fail. Fix by adding a ServerUpState=SERVING mode, which waits for the node to report to its supervisor (systemd, which we are pretending to be) that its ports are open. The test is then adjusted to request this new ServerUpState. Fixes #28210 Fixes #28211 Flaky tests are only in master and branch-2026.1, so backporting there. Closes scylladb/scylladb#28291 * github.com:scylladb/scylladb: test: test_alternator_proxy_protocol: wait for the node to report itself as serving test: cluster_manager: add ability to wait for supervisor STATUS=serving	2026-01-29 16:18:26 +02:00
Pavel Emelyanov	56e212ea8d	table: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-29 15:02:33 +03:00
Pavel Emelyanov	258a1a03e3	table: Use coroutine::switch_to() in try_flush_memtable_to_sstable() It allows dropping the local lambdas passed into with_scheduling_group() calls. Overall the code flow becomes more readable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-29 15:01:27 +03:00
Marcin Maliszkiewicz	ea29e4963e	test: perf: add option to write results to json in perf-cql-raw	2026-01-29 10:56:03 +01:00
Marcin Maliszkiewicz	d974ee1e21	test: perf: add option to write results to json in perf-alternator	2026-01-29 10:55:52 +01:00
Marcin Maliszkiewicz	a74b442c65	test: perf: move write_json_result to a common file The implementation is going to be shared with perf-alternator and perf-cql-raw.	2026-01-29 10:54:11 +01:00
Botond Dénes	3158e9b017	doc: reorganize properties in config.cc and config.hh This commit moves the "Ungrouped properties" category to the end of the properties list. The properties are now published in the documentation, and it doesn't look good if the list starts with ungrouped properties. This patch was taken over from Anna Stuchlik <anna.stuchlik@scylladb.com>. Closes scylladb/scylladb#28343	2026-01-29 11:27:42 +03:00
Pavel Emelyanov	937d008d3c	Merge 'Clean up partition_snapshot_reader' from Botond Dénes Move to `replica/`, drop `flat` from name and drop unused usages as well as unused includes. Code cleanup, no backport Closes scylladb/scylladb#28353 * github.com:scylladb/scylladb: replica/partition_snapshot_reader: remove unused includes partition_snapshot_reader: remove "flat" from name mv partition_snapshot_reader.hh -> replica/	2026-01-29 11:22:15 +03:00
Botond Dénes	f6d7f606aa	memtable_test: disable flushing_rate_is_reduced_if_compaction_doesnt_keep_up for debug This test case was observed to take over 2 minutes to run on CI machines, contributing to already bloated CI run times. Disable this test in debug mode. This test checks for memtable flush being slowed down when compaction can't keep up. So this test needs to overwhelm the CPU by definition. On the other hand, this is not a correctness test, there are such tests for the memtable and compaction already, so it is not critical to run this in debug mode, it is not expected to catch any use-after-free and such. Closes scylladb/scylladb#28407	2026-01-29 11:13:22 +03:00
Jakub Smolar	e978cc2a80	scylla_gdb: use persistent GDB - decrease test execution time This commit replaces the previous approach of running pytest inside GDB’s Python interpreter. Instead, tests are executed by driving a persistent GDB process externally using pexpect. - pexpect: Python library for controlling interactive programs (used here to send commands to GDB and capture its output) - persistent GDB: keep one GDB session alive across multiple tests instead of starting a new process for each test Tests can now be executed via `./test.py gdb` or with `pytest test/scylla_gdb`. This improves performance and makes failures easier to debug since pytest no longer runs hidden inside GDB subprocesses. Closes scylladb/scylladb#24804	2026-01-29 10:01:39 +02:00
Avi Kivity	347c69b7e2	build: add clang-tools-extra (for clang-include-cleaner) to frozen toolchain clang-include-cleaner is used in the iwyu.yaml github workflow (include- what-you-use). Add it to the frozen toolchain so it can be made part of the regular build process. The corresponding install command is removed from iwyu.yaml. Regenerated frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz Closes scylladb/scylladb#28413	2026-01-29 08:44:49 +02:00
Botond Dénes	482ffe06fd	Merge 'Improve load shedding on the replica side' from Łukasz Paszkowski When reads arrive, they have to wait for admission on the reader concurrency semaphore. If the node is overloaded, the reads will be queued. They can time out while in the queue, but will not time out once admitted. Once the shard is sufficiently loaded, it is possible that most queued reads will time out, because the average time it takes to for a queued read to be admitted is around that of the timeout. If a read times out, any work we already did, or are about to do on it is wasted effort. Therefore, the patch tries to prevent it by checking if an admitted read has a chance to complete in time and abort it if not. It uses the following criteria: if read's remaining time <= read's timeout when arrived to the semaphore * live updateable preemptive_abort_factor; the read is rejected and the next one from the wait list is considered. Fixes https://github.com/scylladb/scylladb/issues/14909 Fixes: SCYLLADB-353 Backport is not needed. Better to first observe its impact. Closes scylladb/scylladb#21649 * github.com:scylladb/scylladb: reader_concurrency_semaphore: Check during admission if read may timeout permit_reader::impl: Replace break with return after evicting inactive permit on timeout reader_concurrency_semaphore: Add preemptive_abort_factor to constructors config: Add parameters to control reads' preemptive_abort_factor permit_reader: Add a new state: preemptive_aborted reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit reader_concurrency_semaphore: Remove cpu_concurrency's default value	2026-01-29 08:27:22 +02:00
Botond Dénes	a8767f36da	Merge 'Improve load balancer logging and other minor cleanups' from Tomasz Grabiec Contains various improvements to tablet load balancer. Batched together to save on the bill for CI. Most notably: - Make plan summary more concise, and print info only about present elements. - Print rack name in addition to DC name when making a per-rack plan - Print "Not possible to achieve balance" only when this is the final plan with no active migrations - Print per-node stats when "Not possible to achieve balance" is printed - amortize metrics lookup cost - avoid spamming logs with per-node "Node {} does not have complete tablet stats, ignoring" Backport to 2026.1: since the changes enhance debuggability and are relatively low risk Fixes #28423 Fixes #28422 Closes scylladb/scylladb#28337 * github.com:scylladb/scylladb: tablets: tablet_allocator.cc: Convert tabs to spaces tablets: load_balancer: Warn about incomplete stats once for all offending nodes tablets: load_balancer: Improve node stats printout tablets: load_balancer: Warn about imbalance only when there are no more active migrations tablets: load_balancer: Extract print_node_stats() tablet: load_balancer: Use empty() instead of size() where applicable tablets: Fix redundancy in migration_plan::empty() tablets: Cache pointer to stats during plan-making tablets: load_balancer: Print rack in addition to DC when giving context tablets: load_balancer: Make plan summary concise tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan()	2026-01-29 08:25:17 +02:00
Amnon Heiman	f2e142ac6e	test/boost/estimated_histogram_test.cc: Switch to real Sum Now that the sum function in the histogram uses true values instead of an estimate, the test should reflect that. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-01-28 23:19:00 +02:00
Piotr Dulikowski	ec6a2661de	Merge 'Keep view_builder background fiber in maintenance scheduling group' from Pavel Emelyanov In fact, it's partially there already. When view_builder::start() is called is first calls initialization code (the start_in_background() method), then kicks do_build_step() that runs a background fiber to perform build steps. The starting code inherits scheduling group from main(). And the step fiber code needs to run itself in a maintenance scheduling group, so it explicitly grabs one via database->db_config. This PR mainly gets rid of the call to database::get_streaming_scheduling_group() from do_build_step() as preparation to splitting the streaming scheduling group into parts (see SCYLLADB-351). To make it happen the do_build_step() is patched to inherit its scheduling group from view_builder::start() and the start() itself is called by main from maintenance scheduling group (like for other view building services). New feature (nested scheduling group), not backporting Closes scylladb/scylladb#28386 * github.com:scylladb/scylladb: view_builder: Start background in maintenance group view_builder: Wake-up step fiber with condition variable	2026-01-28 20:49:19 +01:00
Pavel Emelyanov	cb1d05d65a	streaming: Get streaming sched group from debug:: namespace In a lambda returned from make_streaming_consumer() there's a check for current scheudling group being streaming one. It came from #17090 where streaming code was launched in wrong sched group thus affecting user groups in a bad way. The check is nice and useful, but it abuses replica::database by getting unrelated information from it. To preserve the check and to stop using database as provider of configs, keep the streaming scheduling group handle in the debug namespace. This emphasises that this global variable is purely for debugging purposes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28410	2026-01-28 19:14:59 +02:00
Marcin Maliszkiewicz	5d4e2ec522	Merge 'docs: add documentation for automatic repair' from Botond Dénes Explain what automatic repair is and how to configure it. While at it, improve the existing repair documentation a bit. Fixes: SCYLLADB-130 This PR missed the 2026.1 branch date, so it needs backport to 2026.1, where the auto repair feature debuts. Closes scylladb/scylladb#28199 * github.com:scylladb/scylladb: docs: add feature page for automatic repair docs: inter-link incremental-repair and repair documents docs: incremental-repair: fix curl example	2026-01-28 17:46:53 +01:00
Nadav Har'El	1454228a05	test/cqlpy: fix "assertion rewriting" in translated Cassandra tests One of the best features of the pytest framework is "assertion rewriting": If your test does for example "assert a + 1 == b", the assertion is "rewritten" so that if it fails it tells you not only that "a+1" and "b" are not equal, what the non-equal values are, how they are not equal (e.g., find different elements of arrays) and how each side of the equality was calculated. But pytest can only "rewrite" assertion that it sees. If you call a utility function checksomething() from another module and that utility function calls assert, it will not be able to rewrite it, and you'll get ugly, hard-to-debug, assertion failures. This problem is especially noticable in tests we translated from Cassandra, in test/cqlpy/cassandra_tests. Those tests use a bunch of assertion-performing utility functions like assertRows() et al. Those utility functions are defined in a separate source file, porting.py, so by default do not get their assertions rewritten. We had a solution for this: test/cqlpy/cassandra_test/__init__.py had: pytest.register_assert_rewrite("cassandra_tests.porting") This tells pytest to rewrite assertions in porting.py the first time that it is imported. It used to work well, but recently it stopped working. This is because we change the module paths recently, and it should be written as test.cqlpy.cassandra_tests.porting. I verified by editing one of the cassandra_tests to make a bad check that indeed this statement stopped working, and fixing the module path in this way solves it, and makes assertion rewriting work again. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28411	2026-01-28 18:34:57 +02:00
Pavel Emelyanov	3ebd02513a	view_builder: Start background in maintenance group Currently view_builder::start() is called in default scheduling group. Once it initializes itself, it wakes up the step fiber that explicitly switches to maintenance scheduling group. This explicit switch made sence before previous patch, when the fiber was implemented as a serialized action. Now the fiber starts directly from .start() method and can inherit scheduling group from it. Said that, main code calls view_builder::start() in maintenance scheduling group killing two birds with one stone. First, the step fiber no longer needs borrow its scheduling group indirectly via database. Second, the start_in_background() code itself runs in a more suitable scheduling group. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-28 18:34:59 +03:00
Pavel Emelyanov	2439d27b60	view_builder: Wake-up step fiber with condition variable View builder runs a background fiber that perform build steps. To kick the fiber it uses serizlized action, but it's an overkill -- nobody waits for the action to finish, but on stop, when it's joined. This patch uses condition variable to kick the fiber, and starts it instantly, in the place where serialized action was first kicked. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-28 18:34:58 +03:00
Pavel Emelyanov	5ce12f2404	gossiper: Export its scheduling group for those who need it There are several places in the code that need to explicitly switch into gossiper scheduling group. For that they currently call database to provide the group, but it's better to get gossiper sched group from gossiper itself, all the more so all those places have gossiper at hand. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-28 18:29:33 +03:00
Pavel Emelyanov	0da1a222fc	migration_manager: Reorder members This is to initialize dependency references, in particular gossiper&, before _group0_barrier. The latter will need to access this->_gossiper in the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-28 18:29:33 +03:00
Botond Dénes	1713d75c0d	docs: add feature page for automatic repair Explain what the feature is and how to confiture it. Inter-link all the repair related pages, so one can discover all about repair, regardless of which page they land on.	2026-01-28 16:45:57 +02:00
Łukasz Paszkowski	7e1bbbd937	reader_concurrency_semaphore: Check during admission if read may timeout When a shard on a replica is overloaded, it breaks down completely, throughput collapses, latencies go through the roof and the node/shard can even become completely unresponsive to new connection attempts. When reads arrive, they have to wait for admission on the reader concurrency semaphore. If the node is overloaded, the reads will be queued and thus they can time out while being in the queue or during the execution. In the latter case, the timeout does not always result in the read being aborted. Once the shard is sufficiently loaded, it is possible that most queued reads will time out, because the average time it takes for a queued read to be admitted is around that of the timeout. If a read times out, any work we already did, or are about to do on it is wasted effort. Therefore, the patch tries to prevent it by checking if an admitted read has a chance to complete in time and abort it if not. It uses the following cryteria: if read's remaining time <= read's timeout when arrived to the semaphore * preemptive factor; the read is rejected and the next one from the wait list is considered.	2026-01-28 14:24:45 +01:00
Łukasz Paszkowski	8a613960af	permit_reader::impl: Replace break with return after evicting inactive permit on timeout Evicting an inactive permit destroyes the permit object when the reader is closed, making any further member access invalid. Switch from break to an early return to prevent any possible use-after-free after evict() in the state::inactive timeout path.	2026-01-28 14:24:33 +01:00
Łukasz Paszkowski	fde09fd136	reader_concurrency_semaphore: Add preemptive_abort_factor to constructors The new parameter parametrizes the factor used to reject a read during admission. Its value shall be between 0.0 and 1.0 where + 0.0 means a read will never get rejected during admission + 1.0 means a read will immediatelly get rejected during admission Although passing values outside the interaval is possible, they will have the exact same effects as they were clamped to [0.0, 1.0].	2026-01-28 14:20:01 +01:00
Łukasz Paszkowski	21348050e8	config: Add parameters to control reads' preemptive_abort_factor	2026-01-28 14:20:01 +01:00
Łukasz Paszkowski	2d3a40e023	permit_reader: Add a new state: preemptive_aborted A permit gets into the preemptive_aborted state when: - times out; - gets rejected from execution due to high chance its execution would not finalize on time; Being in this state means a permit was removed from the wait list, its internal timer was canceled and semaphore's statistic `total_reads_shed_due_to_overload` increased.	2026-01-28 14:20:01 +01:00
Łukasz Paszkowski	5a7cea00d0	reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit Add a defensive check in dequeue_permit() to avoid underflowing _stats.waiters and report an internal error if the stats are already inconsistent.	2026-01-28 14:19:53 +01:00
Tomasz Grabiec	df949dc506	Merge 'topology_coordinator: make cleanup reliable on barrier failures' from Łukasz Paszkowski Fix a subtle but damaging failure mode in the tablet migration state machine: when a barrier fails, the follow-up barrier is triggered asynchronously, and cleanup can get skipped for that iteration. On the next loop, the original failure may no longer be visible (because the failing node got excluded), so the tablet can incorrectly move forward instead of entering `cleanup_target`. To make cleanup reliable this PR: Adds an additional “fallback cleanup” stage - `write_both_read_old_fallback_cleanup` that does not modify read/write selectors. This stage is safe to enter immediately after a barrier failure, and it funnels the tablet into cleanup with the required barriers. Avoids changing both read and write selectors in a single step transitioning from `write_both_read_new` to `cleanup_target`. The fallback path updates selectors in a safe order: read first, then write. Allows a direct no-barrier transition from `allow_write_both_read_old` to `cleanup_target` after failure, because in that specific case `cleanup_target` doesn’t change selectors and the hop is safe. No need for backport. It's an improvement. Currently, tablets transition to `cleanup_target` eventually via failed streaming. Closes scylladb/scylladb#28169 * github.com:scylladb/scylladb: topology_coordinator: add write_both_read_old_fallback_cleanup state topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old	2026-01-28 13:33:39 +01:00
Amnon Heiman	3175540e87	transport/server: to bytes_histogram This patch replaces simple counters with bytes_histogram for tracking CQL request and response sizes, enabling better visibility into message size distribution. Changes: - Replace request_size and response_size metrics with bytes_histogram in cql_sg_stats::request_kind_stats - Per-shard metrics continue to be reported as before - QUERY, EXECUTE, and BATCH operations now report per-node, per-scheduling-group histograms of bytes sent and received, providing detailed insight into these operations Other CQL operations (e.g., PREPARE, OPTIONS) are not included in per-node histogram reporting as they are less performance-critical, but can be added in the future if proven useful. Metrics example: ``` # HELP scylla_transport_cql_request_bytes Counts the total number of received bytes in CQL messages of a specific kind. # TYPE scylla_transport_cql_request_bytes counter scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808 scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409 scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631 scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809 scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079 scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98 scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432 # HELP scylla_transport_cql_request_histogram_bytes A histogram of received bytes in CQL messages of a specific kind and specific scheduling group. # TYPE scylla_transport_cql_request_histogram_bytes histogram scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079 scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57 scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57 ```	2026-01-28 13:53:47 +02:00
Amnon Heiman	5875bcca23	approx_exponential_histogram: Add sum() method for accurate value tracking Previously, histogram sums were estimated by multiplying bucket offsets by their counts, which produces inaccurate results - typically too high when using upper limits or too low when using lower limits. This patch adds accurate sum tracking to approx_exponential_histogram: - Adds a _sum member variable to track the actual sum of all values - Implements sum() method to return the accumulated total - Updates add() to increment _sum for each value - Modifies to_metrics_histogram() helper to use the new sum() method This change is important as histograms will be used instead of counters for byte statistics, where accurate totals are essential for metrics reporting. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-01-28 13:39:46 +02:00
Amnon Heiman	2fd453f4ec	utils/estimated_histogram.hh: Add bytes_histogram For various use cases, we need to report byte histograms, such as for request and reply message sizes. This patch introduce bytes_histogram as a type alias for approx_exponential_histogram configured to track byte values from 1KB to 1GB with power-of-2 buckets (Precision=1). This provides a convenient, performance-efficient histogram for measuring message sizes, payload sizes, and other byte-based metrics. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-01-28 13:31:39 +02:00
Botond Dénes	ee631f31a0	Merge 'Do not export system keyspace from raft_group0_client' from Pavel Emelyanov There are few places that use raft_group0_client as a way to get to system_keyspace. Mostly they can live without it -- either the needed reference is already at hand, or it's (ab)used to get to the database reference. The only place that really needs the system keyspace is the state merger code that needs last state ID. For that, the explicit helper method is added to group0_client. Refining API between components, not backporting Closes scylladb/scylladb#28387 * github.com:scylladb/scylladb: raft_group0_client: Dont export system keyspace raft_group0_client: Add and use get_last_group0_state_id() group0_state_machine: Call ensure_group0_sched() with data_dictionary view_building_worker: Use its own system_keyspace& reference	2026-01-28 13:24:32 +02:00
Yaron Kaikov	7c49711906	test/cqlpy: Remove redundant pytest.register_assert_rewrite call During test.py run, noticed this warning: ``` 10:38:22 test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: 32 warnings 10:38:22 /jenkins/workspace/releng-testing/scylla-ci/scylla/test/cqlpy/cassandra_tests/validation/operations/insert_update_if_condition_test.py:14: PytestAssertRewriteWarning: Module already imported so cannot be rewritten: test.cqlpy.cassandra_tests.porting 10:38:22 pytest.register_assert_rewrite('test.cqlpy.cassandra_tests.porting') ``` The insert_update_if_condition_test.py was calling pytest.register_assert_rewrite() for the porting module, but this registration is already handled by cassandra_tests/__init__.py which is automatically loaded before any test runs. Closes scylladb/scylladb#28409	2026-01-28 13:17:05 +02:00
Avi Kivity	42fdea7410	github: fix iwyu workflow permissions The include-what-you-use workflow fails with ``` Invalid workflow file: .github/workflows/iwyu.yaml#L25 The workflow is not valid. .github/workflows/iwyu.yaml (Line: 25, Col: 3): Error calling workflow 'scylladb/scylladb/.github/workflows/read-toolchain.yaml@257054deffbef0bde95f0428dc01ad10d7b30093'. The nested job 'read-toolchain' is requesting 'contents: read', but is only allowed 'contents: none'. ``` Fix by adding the correct permissions. Closes scylladb/scylladb#28390	2026-01-28 12:38:54 +02:00
Jakub Smolar	e1f623dd69	skip_mode: Allow multiple build modes in pytest skip_mode marker Enhance the skip_mode marker to accept either a single mode string or a list of modes, allowing tests to be skipped across multiple build configurations with a single marker. Before: @pytest.mark.skip_mode("dev", reason="...") @pytest.mark.skip_mode("debug", reason="...") After: @pytest.mark.skip_mode(["dev", "debug"], reason="...") This reduces duplication when the same skip condition applies to multiple build modes. Closes scylladb/scylladb#28406	2026-01-28 12:27:41 +02:00
Patryk Jędrzejczak	a2c1569e04	test: test_gossiper_orphan_remover: get host ID of the bootstrapping node before it crashes The test is currently flaky. It tries to get the host ID of the bootstrapping node via the REST API after the node crashes. This can obviously fail. The test usually doesn't fail, though, as it relies on the host ID being saved in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()` repeatedly called in `ScyllaServer.start()`. However, with a very fast crash and unlucky timings, no such call may succeed. We deflake the test by getting the host ID before the crash. Note that at this point, the bootstrapping node must be serving the REST API requests because `await log.wait_for("finished do_send_ack2_msg")` above guarantees that the node has started the gossip shadow round, which happens after starting the REST API. Fixes #28385 Closes scylladb/scylladb#28388	2026-01-28 10:54:22 +02:00
Avi Kivity	8d2689d1b5	build: avoid sccache by default for Rust targets A bug[1] in sccache prevents correct distributed compilation of wasmtime. Disable it by default for now, but allow users to enable it. [1] https://github.com/mozilla/sccache/issues/2575 Closes scylladb/scylladb#28389	2026-01-28 10:36:49 +02:00
Pavel Emelyanov	2ffe5b7d80	tablet_allocator: Have its own explicit background scheduling group Currently, tablet_allocator switches to streaming scheduling group that it gets from database. It's not nice to use database as provider of configs/scheduling_groups. This patch adds a background scheduling group for tablet allocator configured via its config and sets it to streaming group in main.cc code. This will help splitting the streaming scheduling group into more elaborated groups under the maintenance supergroup: SCYLLADB-351 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28356	2026-01-28 10:34:28 +02:00
Avi Kivity	47315c63dc	treewide: include Seastar headers with angle brackets Seastar is a "system" library from our point of view, so should be included with angle brackets. Closes scylladb/scylladb#28395	2026-01-28 10:33:06 +02:00
Botond Dénes	b7dccdbe93	Merge 'test/storage: speed up out-of-space prevention tests' from Łukasz Paszkowski This PR reduces the runtime of `test_out_of_space_prevention.py` by addressing two main sources of overhead: slow “critical utilization” setup and delayed tablet load stats propagation. Combined, these changes cut the module’s total execution time from 324s to 185s. Improvements. No backup is required. Closes scylladb/scylladb#28396 * github.com:scylladb/scylladb: test/storage: speed up out-of-space prevention tests by using smaller volumes test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests	2026-01-28 10:28:20 +02:00
Marcin Maliszkiewicz	931a38de6e	service: remove unused has_schema_access It became unused after we dropped support for thrift in `ad649be1bf` Closes scylladb/scylladb#28341	2026-01-28 10:18:26 +02:00
Pavel Emelyanov	834921251b	test: Replace memory_data_source with seastar::util::as_input_stream The existing test-only implementation is a simplified version of the generic one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28339	2026-01-28 10:15:03 +02:00
Andrei Chekun	335e81cdf7	test.py: migrate nodetool to run by pytest As a next step of migration to the pytest runner, this PR moves responsibility for nodetool tests execution solely to the pytest. Closes scylladb/scylladb#28348	2026-01-28 09:49:59 +02:00
Tomasz Grabiec	8e831a7b6d	tablets: tablet_allocator.cc: Convert tabs to spaces	2026-01-28 01:32:01 +01:00
Tomasz Grabiec	9715965d0c	tablets: load_balancer: Warn about incomplete stats once for all offending nodes To reduce log spamming when all nodes are missing stats.	2026-01-28 01:32:01 +01:00
Tomasz Grabiec	ef0e9ad34a	tablets: load_balancer: Improve node stats printout Make it more concise: - reduce precision for load to 6 fractional digits - reduce precision for tablets/shard to 3 fractional digits - print "dc1/rack1" instead of "dc=dc1 rack=rack1", like in other places - print "rd=0 wr=0" instead of "stream_read=0 stream_write=0" Example: load_balancer - Node 477569c0-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=170.666667 tablets=1 shards=12 tablets/shard=0.083 state=normal cap=64424509440 stream: rd=0 wr=0 load_balancer - Node 47678711-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0 load_balancer - Node 47832560-f937-11f0-ab6f-541ce4a00601: dc10/rack10c load=0.000000 tablets=0 shards=12 tablets/shard=0.000 state=normal cap=64424509440 stream: rd=0 wr=0	2026-01-28 01:32:01 +01:00
Tomasz Grabiec	4a161bff2d	tablets: load_balancer: Warn about imbalance only when there are no more active migrations Otherwise, it may be only a temporary situation due to lack of candidates, and may be unnecessarily alerting. Also, print node stats to allow assessing how bad the situation is on the spot. Those stats can hint to a cause of imbalance, if balancing is per-DC and racks have different capacity.	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	7228bd1502	tablets: load_balancer: Extract print_node_stats()	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	615b86e88b	tablet: load_balancer: Use empty() instead of size() where applicable	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	12fdd205d6	tablets: Fix redundancy in migration_plan::empty()	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	0d090aa47b	tablets: Cache pointer to stats during plan-making Saves on lookup cost, esp. for candidate evaluation. This showed up in perf profile in the past. Also, lays the ground for splitting stats per rack.	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	f2b0146f0f	tablets: load_balancer: Print rack in addition to DC when giving context Load-balancing can be now per-rack instead of per-DC. So just printing "in DC" is confusing. If we're balancing a rack, we should print which rack is that.	2026-01-28 01:32:00 +01:00
Tomasz Grabiec	df32318f66	tablets: load_balancer: Make plan summary concise Before: load_balancer - Prepared 1 migration plans, out of which there were 1 tablet migration(s) and 0 resize decision(s) and 0 tablet repair(s) and 0 rack-list colocation(s) After: load_balancer - Prepared plan: migrations: 1 We print only stats about elements which are present.	2026-01-28 01:32:00 +01:00
Emil Maskovsky	834961c308	db/view: add missing include for coroutine::all to fix build without precompiled headers When building with `--disable-precompiled-header`, view.cc failed to compile due to missing <seastar/coroutine/all.hh> include, which provides `coroutine::all`. The problem doesn't manifest when precompiled headers are used, which is the default. So that's likely why it was missed by the CI. Adding the explicit include fixes the build. Fixes: scylladb/scylladb#28378 Ref: scylladb/scylladb#28093 No backport: This problem is only present in master. Closes scylladb/scylladb#28379	2026-01-27 18:56:56 +01:00
Calle Wilund	87aa6c8387	utils/gcp/object_storage: URL-encode object names in URL:s Fixes #28398 When used as path elements in google storage paths, the object names need to be URL encoded. Due to a.) tests not really using prefixes including non-url valid chars (i.e. / etc) and the mock server used for most testing not enforcing this particular aspect, this was missed. Modified unit tests to use prefixing for all names, so when run in real GS, any errors like this will show.	2026-01-27 18:01:21 +01:00
Calle Wilund	a896d8d5e3	utils::gcp::object_storage: Fix list object pager end condition detection Fixes #28399 When iterating with pager, the mock server and real GCS behaves differently. The latter will not give a pager token for last page, only penultimate. Need to handle.	2026-01-27 17:57:17 +01:00
Pavel Emelyanov	02af292869	Merge 'Introduce TTL and retries to address resolution' from Ernest Zaslavsky In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop. Introduce RR TTL and connection failure retry to - re-resolve the RR in a timely manner - forcefully reset and re-resolve addresses - add a special case when the TTL is 0 and the record must be resolved for every request Fixes: CUSTOMER-96 Fixes: CUSTOMER-139 Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3 Closes scylladb/scylladb#27891 * github.com:scylladb/scylladb: connection_factory: includes cleanup dns_connection_factory: refine the move constructor connection_factory: retry on failure connection_factory: introduce TTL timer connection_factory: get rid of shared_future in dns_connection_factory connection_factory: extract connection logic into a member connection_factory: remove unnecessary `else` connection_factory: use all resolved DNS addresses s3_test: remove client double-close	2026-01-27 18:45:43 +03:00
Avi Kivity	59f2a3ce72	test: test_alternator_proxy_protocol: wait for the node to report itself as serving Use the new ServerUpState=SERVING mechanism to wait to the alternator ports to be up, rather than relying on the default waiting for CQL, which happens earlier and therefore opens a window where a connection to the alternator ports will fail.	2026-01-27 17:25:59 +02:00
Avi Kivity	ebac810c4e	test: cluster_manager: add ability to wait for supervisor STATUS=serving When running under systemd, ScyllaDB sends a STATUS=serving message to systemd. Co-opt this mechanism by setting up NOTIFY_SOCKET, thus making the cluster manager pretend it is systemd. Users of the cluster manager can now wait for the node to report itself up, rather than having to parse log files or retry connections.	2026-01-27 17:24:55 +02:00
Botond Dénes	7ac32097da	docs/cql/ddl.rst: Tombstones GC: explain propagation delay This parameter was not mentioned at all anywhere in the documentation. Add an explanation of this parameter: why we need it, what is the default and how it can be changed. Closes scylladb/scylladb#28132	2026-01-27 16:05:52 +01:00
Tomasz Grabiec	32b336e062	tablets: load_balancer: Move "tablet_migration_bypass" injection point to make_plan() Just a cleanup. After this, we don't have a new scope in the outmost make_plan() just for injection handling.	2026-01-27 16:01:36 +01:00
Piotr Dulikowski	29da20744a	storage_service: fix indentation after previous patch	2026-01-27 15:49:01 +01:00
Piotr Dulikowski	d28c841fa9	raft topology: generate notification about released nodes only once Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it. The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in `44de563` verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed. Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed. Fixes: #28301 Refs: #25031	2026-01-27 15:48:05 +01:00
Łukasz Paszkowski	8829098e90	reader_concurrency_semaphore: Remove cpu_concurrency's default value The commit `59faa6d`, introduces a new parameter called cpu_concurrency and sets its default value to 1 which violates the commit `fbb83dd` that removes all default values from constructors but one used by the unit tests. The patch removes the default value of the cpu_concurrency parameter and alters tests to use the test dedicated reader_concurrency_semaphore constructor wherever possible.	2026-01-27 15:40:11 +01:00
Łukasz Paszkowski	3ef594f9eb	test/storage: speed up out-of-space prevention tests by using smaller volumes Tests in test_out_of_space_prevention.py spend a large fraction of time creating a random “blob” file to cross the 0.8 critical disk utilization threshold. With 100MB volumes this requires writing ~70–80MB of data, which is slow inside Docker/Podman-backed volumes. Most tests only use ~11MB of data, so large volumes are unnecessary. Reduce the test volume size to 20MB so the critical threshold is reached at ~16MB and the blob file is much smaller. This cuts ~5–6s per test.	2026-01-27 15:28:59 +01:00
Łukasz Paszkowski	0f86fc680c	test/storage: reduce tablet load stats refresh interval to speed up OOS prevention tests Set `--tablet-load-stats-refresh-interval-in-seconds=1` for this module’s clusters applicable to all tests. This significantly reduces runtime for the slowest cases: - test_reject_split_compaction: 75.62s -> 23.04s - test_split_compaction_not_triggered: 69.36s -> 22.98s	2026-01-27 15:28:59 +01:00
Piotr Dulikowski	10e9672852	raft topology: extract "released" nodes calculation to external function In the following commits we will need to compare the set of released nodes before and after reload of raft topology state. Moving the logic that calculates such a set to a separate function will make it easier to do.	2026-01-27 14:37:43 +01:00
Pavel Emelyanov	87920d16d8	raft_group0_client: Dont export system keyspace Now system_keyspace reference is used internally by the client code itself, no need to encourage other services abuse it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-27 14:51:40 +03:00
Pavel Emelyanov	966119ce30	raft_group0_client: Add and use get_last_group0_state_id() There are several places that want to get last state id and for that they make raft_group0_client() export system_keyspace reference. This patch adds a helper method to provide the needed ID. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-27 14:50:25 +03:00
Pavel Emelyanov	dded1feeb7	group0_state_machine: Call ensure_group0_sched() with data_dictionary There's a validation for tables being used by group0 commands are marked with the respective prop. For it the caller code needs to provide database reference and it gets one from client -> system_keyspace chain. There's more explicit way -- get the data_dictionary via proxy. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-27 14:48:22 +03:00
Pavel Emelyanov	20a2b944df	view_building_worker: Use its own system_keyspace& reference Some code in the worker need to mess with system_keyspace&. While there's a reference on it from the worker object, it gets one via group0 -> group0_client, which is a bit an overkill. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-27 14:46:48 +03:00
Avi Kivity	16b56c2451	Merge 'Audit: avoid dynamic_cast on a hot path' from Marcin Maliszkiewicz This patch set eliminates special audit info guard used before for batch statements and simplifies audit::inspect function by returning quickly if audit is not needed. It saves around 300 instructions on a request's hot path. Related: https://github.com/scylladb/scylladb/issues/27941 Backport: no, not a bug Closes scylladb/scylladb#28326 * github.com:scylladb/scylladb: audit: replace batch dynamic_cast with static_cast audit: eliminate dynamic_cast to batch_statement in inspect audit: cql: remove create_no_audit_info audit: add batch bool to audit_info class	2026-01-27 12:54:16 +02:00
Pavel Emelyanov	c61d855250	hints: Provide explicit scheduling group for hint_sender Currently it grabs one from database, but it's not nice to use database as config/sched-groups provider. This PR passes the scheduling group to use for sending hints via manager which, in turn, gets one from proxy via its config (proxy config already carries configuration for hints manager). The group is initialized in main.cc code and is set to the maintenance one (nowadays it's the same as streaming group). This will help splitting the streaming scheduling group into more elaborated groups under the maintenance supergroup: SCYLLADB-351 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28358	2026-01-27 12:50:11 +02:00
Gleb Natapov	9daa109d2c	test: get rid of consistent_cluster_management usage in test consistent_cluster_management is deprecated since scylla-5.2 and no longer used by Scylladb, so it should not be used by test either. Closes scylladb/scylladb#28340	2026-01-27 11:31:30 +01:00
Avi Kivity	fa5ed619e8	Merge 'test: perf: add perf-cql-raw benchmarking tool' from Marcin Maliszkiewicz The tool supports: - auth or no auth modes - simple read and write workloads - connection pool or connection per request modes - in-process or remote modes, remote may be usefull to assess tool's overhead or use it as bigger scale benchmark - multi table mode - non superuser mode It could support in the future: - TLS mode - different workloads - shard awareness Example usage: > build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 2 --cpus 0,1 \ --developer-mode 1 --workload read --duration 5 2> /dev/null > Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0} Pre-populated 10000 partitions 97438.42 tps (269.2 allocs/op, 1.1 logallocs/op, 35.2 tasks/op, 113325 insns/op, 80572 cycles/op, 0 errors) 102460.77 tps (261.1 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108222 insns/op, 75447 cycles/op, 0 errors) 95707.93 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108443 insns/op, 75320 cycles/op, 0 errors) 102487.87 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 107956 insns/op, 74320 cycles/op, 0 errors) 100409.60 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108337 insns/op, 75262 cycles/op, 0 errors) throughput: mean= 99700.92 standard-deviation=3039.28 median= 100409.60 median-absolute-deviation=2759.85 maximum=102487.87 minimum=95707.93 instructions_per_op: mean= 109256.53 standard-deviation=2281.39 median= 108337.37 median-absolute-deviation=1034.83 maximum=113324.69 minimum=107955.97 cpu_cycles_per_op: mean= 76184.36 standard-deviation=2493.46 median= 75320.20 median-absolute-deviation=922.09 maximum=80572.19 minimum=74320.00 Backports: no, new tool Closes scylladb/scylladb#25990 * github.com:scylladb/scylladb: test: perf: reuse stream id main: test: add future and abort_source to after_init_func test: perf: add option to stress multiple tables in perf-cql-raw test: perf: add perf-cql-raw benchmarking tool test: perf: move cut_arg helper func to common code	2026-01-27 12:23:25 +02:00
Yaron Kaikov	3f10f44232	.github/workflows/backport-pr-fixes-validation: support Atlassian URL format in backport PR fixes validation Add support for matching full Atlassian JIRA URLs in the format https://scylladb.atlassian.net/browse/SCYLLADB-400 in addition to the bare JIRA key format (SCYLLADB-400). This makes the validation more flexible by accepting both formats that developers commonly use when referencing JIRA issues. Fixes: https://github.com/scylladb/scylladb/issues/28373 Closes scylladb/scylladb#28374	2026-01-27 10:59:21 +02:00
Avi Kivity	f1c6094150	Merge 'Remove buffer_input_stream and limiting_input_stream from core code' from Pavel Emelyanov These two streams mostly play together. The former provides an input_stream from read from in-memory temporary buffers, the latter wraps it to limit the size of provided temporary buffers. Both are used to test contiguous data consumer, also the buffer_input_stream has a caller in sstables reversing reader. This PR removes the buffer_input_stream in favor of seastar memory_data_source, and moves the limiting_input_stream into test/lib. Enanching testing code, not backporting Closes scylladb/scylladb#28352 * github.com:scylladb/scylladb: code: Move limiting data source to test/lib util: Simplify limiting_data_source API util: Remove buffer_input_stream test: Use seastar::util::temporary_buffer_data_source in data consumer test sstables: Use seastar::util::as_input_stream() in mx reader	2026-01-26 22:05:59 +02:00
Raphael S. Carvalho	0e07c6556d	test: Remove useless compaction group testing in database_test This compaction group testing is useless because the machinery for it to work was removed. This was useful in the early tablet days, where we wanted to test compaction groups directly. Today groups are stressed and tested on every tablet test. I see a ~40% reduction time after this patch, since database_test is one of the most (if not the most) time consuming in boost suite. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#28324	2026-01-26 19:16:27 +02:00
Marcin Maliszkiewicz	19af46d83a	audit: replace batch dynamic_cast with static_cast Since we know already it's a batch we can use static cast now.	2026-01-26 18:14:38 +01:00
Anna Stuchlik	edc291961b	doc: update the GPG keys Update the keys in the installation instructions (linux packages). Fixes https://github.com/scylladb/scylladb/issues/28330 Closes scylladb/scylladb#28357	2026-01-26 19:13:18 +02:00
Piotr Dulikowski	5d5e829107	Merge 'db: view: refactor usage and building of semaphore in create and drop views plus change continuation to co routine style' from Alex Dathskovsky db: view: refactor semaphore usage in create/drop view paths Refactor the construction and usage of semaphore units in the create and drop view flows. The previous semaphore handling was hard to follow (as noted while working on https://github.com/scylladb/scylladb/pull/27929), so this change restructures unit creation and movement to follow a clearer and symmetric pattern across shards. The semaphore usage model is now documented with a detailed in-code comment to make the intended behavior and invariants explicit. As part of the refactor, the control flow is modernized by replacing continuation-based logic with coroutine-style code, improving readability and maintainability. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-250 backport: not required, this is a refactor Closes scylladb/scylladb#28093 * github.com:scylladb/scylladb: db: view: extend try/catch scope in handle_create_view_local The try/catch region is extended to cover step functions and inner helpers, which may throw or abort during view creation. This change is safe because we are just swolowing more parts that may throw due to semaphore abortion or any other abortion request, and doesnt change the logic db: view: refine create/drop coroutine signatures Refactor the create/drop coroutine interfaces to accept parameters as const references, enabling a clearer workflow and safer data flow. db: view: switch from continuations to coroutines Refactor the flow and style of create and drop view to use coroutines instead of continuations. This simplifies the logic, improves readability, and makes the code easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit. this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type db: view: introduce helper to acquire or reuse semaphore units Introduce a small helper that acquires semaphore units when needed or reuses units provided by the caller. This centralizes semaphore handling, simplifies the current logic, and enables refactoring the view create/drop path to a coroutine-based implementation instead of continuation-style code. db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0	2026-01-26 17:16:01 +01:00
Marcin Maliszkiewicz	55d246ce76	auth: bring back previous version of standard_role_manager::can_login Previously, we wanted to make minimal changes with regards to the new unified auth cache. However, as a result, some calls on the hot path were missed. Now we have switched the underlying find_record call to use the cache. Since caching is now at a lower level, we bring back the original code.	2026-01-26 16:04:11 +01:00
Marcin Maliszkiewicz	3483452d9f	auth: switch find_record to use cache Since every write-type auth statement takes group0_guard at the beginning, we hold read_apply_mutex and cannot have a running raft apply during our operation. Therefore, the auth cache and internal CQL reads return the same, consistent results. This makes it safe to read via cache instead of internal CQL. LDAP is an exception, but it is eventually consistent anyway.	2026-01-26 16:04:11 +01:00
Botond Dénes	755e8319ee	replica/partition_snapshot_reader: remove unused includes	2026-01-26 16:52:46 +02:00
Botond Dénes	756837c5b4	partition_snapshot_reader: remove "flat" from name The "flat" migration is long done, this distinction is no longer meaningful.	2026-01-26 16:52:46 +02:00
Botond Dénes	9d1933492a	mv partition_snapshot_reader.hh -> replica/ The partition snapshot lives in mutation/, however mutation/ is a lower level concept than a mutation reader. The next best place for this reader is the replica/ directory, where the memtable, its main user, also lives. Also move the code to the replica namespace. test/boost/mvcc_test.cc includes this header but doesn't use anything from it. Instead of updating the include path, just drop the unused include.	2026-01-26 16:52:08 +02:00
Avi Kivity	32cc593558	Merge 'tools/scylla-sstable: introduce filter command' from Botond Dénes Filter the content of sstable(s), including or excluding the specified partitions. Partitions can be provided on the command line via `--partition`, or in a file via `--partitions-file`. Produces one output sstable per input sstable -- if the filter selects at least one partition in the respective input sstable. Output sstables are placed in the path provided via `--oputput-dir`. Use `--merge` to filter all input sstables combined, producing one output sstable. Fixes: #13076 New functionality, no backport. Closes scylladb/scylladb#27836 * github.com:scylladb/scylladb: tools/scylla-sstable: introduce filter command tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir tools/scylla-sstable: make partition_set ordered tools/scylla-stable: remove unused boost/algorithm/string.hpp include	2026-01-26 16:32:38 +02:00
Ernest Zaslavsky	912c48a806	connection_factory: includes cleanup	2026-01-26 15:15:21 +02:00
Ernest Zaslavsky	3a31380b2c	dns_connection_factory: refine the move constructor Clean up the awkward move constructor that was declared in the header but defaulted in a separate compilation unit, improving clarity and consistency.	2026-01-26 15:15:15 +02:00
Ernest Zaslavsky	a05a4593a6	connection_factory: retry on failure If connecting to a provided address throws, renew the address list and retry once (and only once) before giving up.	2026-01-26 15:14:18 +02:00
Ernest Zaslavsky	6eb7dba352	connection_factory: introduce TTL timer Add a TTL-based timer to connection_factory to automatically refresh resolved host name addresses when they expire.	2026-01-26 15:11:49 +02:00
Nadav Har'El	25ff4bec2a	Merge 'Refactor streams' from Radosław Cybulski Refactor streams.cc - turn `.then` calls into coroutines. Reduces amount of clutter, lambdas and referenced variables. Fixes #28342 Closes scylladb/scylladb#28185 * github.com:scylladb/scylladb: alternator: refactor streams.cc alternator: refactor streams.cc	2026-01-26 14:31:15 +02:00
Andrei Chekun	3d3fabf5fb	test.py: change the name of the test in failed directory Generally square brackets are non allowed in URI, while pytest uses it the test name to show that there were additional parameters for the same test. When such a test fail it shows the directory correctly in Jenkins, however attempt to download only this will fail, because of the square brackets in URI. This change substitute the square brackets with round brackets. Closes scylladb/scylladb#28226	2026-01-26 13:29:45 +01:00
Łukasz Paszkowski	f06094aa95	topology_coordinator: add write_both_read_old_fallback_cleanup state Yet another barrier-failure scenario exists in the `write_both_read_new` state. When the barrier fails, the tablet is expected to transition to `cleanup_target`, but because barrier execution is asynchronous, the cleanup transition can be skipped entirely and the tablet may continue forward instead. Both `write_both_read_new` and `cleanup_target` modify read and write selectors. In this situation, a barrier is required, and transitioning directly between these states without one is unsafe. Introduce an intermediate `write_both_read_old_fallback_cleanup` state that modifies only a read selector and can be entered without a barrier (there is no need to wait for all nodes to start using the "new" read selector). From there, the tablet can proceed to `cleanup_target`, where the required barriers are enforced. This also avoids changing both selectors in a single step. A direct transition from `write_both_read_new` to `cleanup_target` updates both selectors at once, which can leave coordinators using the old selector for writes and the new selector for reads, causing reads to miss preceding writes. By routing through the fallback state, selectors are updated in order—read first, then write—preserving read-after-write correctness.	2026-01-26 13:14:37 +01:00
Łukasz Paszkowski	0a058e53c7	topology_coordinator: allow cleanup_target transition from streaming/rebuild_repair without barrier In both `streaming` and `rebuild_repair` stages, the read/write selectors are unchanged compared to the preceding stage. Because entry into these stages is already fenced by a barrier from `write_both_read_old`, and the `cleanup_target` itself requires barrier, rolling back directly to `cleanup_target` is safe without an additional barrier.	2026-01-26 13:14:36 +01:00
Łukasz Paszkowski	b12f46babd	topology_coordinator: allow cleanup_target transition without barrier after failure in write_both_read_old A similar barrier-failure scenario exists in the `write_both_read_old` state. If the barrier fails, the tablet is expected to transition to `cleanup_target`, but due to the barrier being evaluated asynchronously the cleanup path can be skipped and the tablet may continue forward instead. In `write_both_read_old`, we already switched group0 writes from old to both, while the barrier may not have executed yet. As a result, nodes can be at most one step apart (some still use old, others use both). Transitioning to `cleanup_target` reverts the write selector back to old. Nodes still differ by at most one step (old vs both), so the transition is safe without an additional barrier. This prevents cleanup from being skipped while keeping selector semantics and barrier guarantees intact.	2026-01-26 13:14:36 +01:00
Łukasz Paszkowski	7c331b7319	topology_coordinator: allow cleanup_target transition without barrier after failure in allow_write_both_read_old When a tablet is in `allow_write_both_read_old`, progressing normally requires a barrier. If this first barrier fails, the tablet is supposed to transition to `cleanup_target` on the next iteration: ``` case locator::tablet_transition_stage::allow_write_both_read_old: if (action_failed(tablet_state.barriers[trinfo.stage])) { if (check_excluded_replicas()) { transition_to_with_barrier(locator::tablet_transition_stage::cleanup_target); break; } } if (do_barrier()) { ... } break; ``` That transition itself requires a barrier, which is executed asynchronously. Because the barrier runs in the background, the cleanup logic is skipped in that iteration. On the following iteration, `action_failed(barriers[stage])` no longer returns true, since the node that caused the original barrier failure has been excluded. The barrier is therefore observed as successful, and the tablet incorrectly proceeds to the next stage instead of entering `cleanup_target`. Since `cleanup_target` does not modify read/write selectors, the transition can be done safely without a barrier, simplifying the state machine and ensuring cleanup is not skipped. Without it, the tablet would still eventually reach `cleanup_target` via `write_both_read_old` and `streaming`, but that path is unnecessary.	2026-01-26 13:14:36 +01:00
Marcin Maliszkiewicz	440590f823	auth: make find_record and callers standard_role_manager members It will be usefull for following commit. Those methods are anyway class specific	2026-01-26 13:10:11 +01:00
Anna Stuchlik	84281f900f	doc: remove the troubleshooting section on upgrades from OSS This commit removes a document originally created to troubleshoot upgrades from Open Source to Enterprise. Since we no longer support Open Source, this document is now redundant. Fixes https://github.com/scylladb/scylladb/issues/28246 Closes scylladb/scylladb#28248	2026-01-26 14:02:53 +02:00
Anna Stuchlik	c25b770342	doc: add the version name to the Install pages This is a follow-up to https://github.com/scylladb/scylladb/pull/28022 It adds the version name to more install pages. Closes scylladb/scylladb#28289	2026-01-26 13:11:14 +02:00
Alex	954d18903e	db: view: extend try/catch scope in handle_create_view_local The try/catch region is extended to cover step functions and inner helpers, which may throw or abort during view creation. This change is safe because we are just swolowing more parts that may throw due to semaphore abortion or any other abortion request, and doesnt change the logic	2026-01-26 13:10:37 +02:00
Alex	2c3ab8490c	db: view: refine create/drop coroutine signatures Refactor the create/drop coroutine interfaces to accept parameters as const references, enabling a clearer workflow and safer data flow.	2026-01-26 13:10:34 +02:00
Alex	114f88cb9b	db: view: switch from continuations to coroutines Refactor the flow and style of create and drop view to use coroutines instead of continuations. This simplifies the logic, improves readability, and makes the code easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit. this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type	2026-01-26 13:08:24 +02:00
Alex	87c1c6f40f	db: view: introduce helper to acquire or reuse semaphore units Introduce a small helper that acquires semaphore units when needed or reuses units provided by the caller. This centralizes semaphore handling, simplifies the current logic, and enables refactoring the view create/drop path to a coroutine-based implementation instead of continuation-style code.	2026-01-26 13:03:26 +02:00
Avi Kivity	ec70cea2a1	test/cqlpy: restore LWT tests marked XFAIL for tablets Commit `0156e97560` ("storage_proxy: cas: reject for tablets-enabled tables") marked a bunch of LWT tests as XFAIL with tablets enabled, pending resolution of #18066. But since that event is now in the past, we undo the XFAIL markings (or in some cases, use an any-keyspace fixture instead of a vnodes-only fixture). Ref #18066. Closes scylladb/scylladb#28336	2026-01-26 12:27:19 +02:00
Pavel Emelyanov	77435206b9	code: Move limiting data source to test/lib Only two tests use it now -- the limit-data-source-test iself and a test that validates continuous_data_consumer template. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-26 12:49:42 +03:00
Pavel Emelyanov	111b376d0d	util: Simplify limiting_data_source API The source maintains "limit generator" -- a function that returns the maximum size of bytes to return from the next buffer. Currently all callers just return constant numbers from it. Passing a function that returns non-constant one can, probably, be used for a fuzzy test, but even the limiting-data-source-test itself doesn't do it, so what's the point... Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-26 12:46:37 +03:00
Pavel Emelyanov	e297ed0b88	util: Remove buffer_input_stream It's now unused. All the users had been patched to use seastar memory data source implementation. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-26 12:46:10 +03:00
Pavel Emelyanov	4639681907	test: Use seastar::util::temporary_buffer_data_source in data consumer test The test creates buffer_data_source_impl and wraps it with limiting data source. The former data_source duplicates the functionality of the existing seastar temporary_buffer_data_source. This patch makes the test code use seastar facility. The buffer_data_source_impl will be removed soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-26 12:44:33 +03:00
Pavel Emelyanov	2399bb8995	sstables: Use seastar::util::as_input_stream() in mx reader Right now the code uses make_buffer_input_stream() helper that creates an input stream with buffer_data_source_impl inside which, in turn, provides the data_source_impl API over a single temporary_buffer. Seastar has the very same facility, it's better to use it. Eventually the buffer_data_source_impl will be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-26 12:43:17 +03:00
Marcin Maliszkiewicz	6f32290756	audit: eliminate dynamic_cast to batch_statement in inspect This is costly and not needed we can use a simple bool flag for such check. It burns around 300 cpu instructions on a hot request's path.	2026-01-26 10:18:38 +01:00
Marcin Maliszkiewicz	a93ad3838f	audit: cql: remove create_no_audit_info We don't need a special guard value, it's only being filled for batch statements for which we can simply ignore the value. Not having special value allows us to return fast when audit is not enabled.	2026-01-26 10:18:38 +01:00
Marcin Maliszkiewicz	02d59a0529	audit: add batch bool to audit_info class In the following commit we'll use this field instead of costly dynamic_cast when emitting audit log.	2026-01-26 10:18:38 +01:00
Botond Dénes	57b2cd2c16	docs: inter-link incremental-repair and repair documents The user can now discover the general explanatio of repair when reading about incremental repair, useful if they don't know what repair is. The user can now discover incremental repair while reading the generic repair procedure document.	2026-01-26 09:55:54 +02:00
Botond Dénes	a84b1b8b78	docs: incremental-repair: fix curl example Currently it is regular text, make it a code block so it is easier to read and copy+paste.	2026-01-26 09:55:54 +02:00
Pavel Emelyanov	1796997ace	Merge 'restore: Enable direct download of fully contained SSTables' from Ernest Zaslavsky This PR refactors the streaming subsystem to support direct download of fully contained sstables. Instead of streaming these files, they are downloaded and attached directly to their corresponding tables. This approach reduces overhead, simplifies logic, and improves efficiency. Expected node scope restore performance improvement: ~4 times faster in best case scenario when all sstables are fully contained. 1. Add storage options field to sstable Introduce a data member to store storage options, enabling distinction between local and object storage types. 2. Add method to create component source Extend the storage interface with a public method to create a data_source for any sstable component. 3. Inline streamer instance creation Remove make_sstable_streamer and inline its usage to allow different sets of arguments at call sites. 4. Skip streaming empty sstable sets Avoid unnecessary streaming calls when the sstable set is empty. 5. Enable direct download of contained sstables Replace streaming of fully contained sstables with direct download, attaching them to their corresponding table. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-200 Refs: https://github.com/scylladb/scylladb/issues/23908 No need to backport as this code targets 2026.2 release (for tablet-aware restore) Closes scylladb/scylladb#26834 * github.com:scylladb/scylladb: tests: reuse test_backup_broken_streaming streaming: enable direct download of contained sstables storage: add method to create component source streaming: keep sharded database reference on tablet_sstable_streamer streaming: skip streaming empty sstable sets streaming: inline streamer instance creation tests: fix incorrect backup/restore test flow	2026-01-26 10:22:34 +03:00
Ernest Zaslavsky	cb2aa85cf5	aws_error: handle all restartable nested exception types Previously we only inspected std::system_error inside std::nested_exception to support a specific TLS-related failure mode. However, nested exceptions may contain any type, including other restartable (retryable) errors. This change unwraps one nested exception per iteration and re-applies all known handlers until a match is found or the chain is exhausted. Closes scylladb/scylladb#28240	2026-01-26 10:19:57 +03:00
Avi Kivity	55422593a7	Merge 'test/lib: Fix bugs in `boost_test_tree_lister.cc`' from Dawid Mędrek In this PR, we fix two bugs present in `boost_test_tree_lister` that affected the output of `--list_json_content` added in scylladb/scylladb@afde5f668a: * The labels test units use were duplicated in the output. * If a test suite or a test file didn't contain any tests, it wasn't listed in the output. Refs scylladb/scylladb#25415 Backport: not needed. The code hasn't been used anywhere yet. Closes scylladb/scylladb#28255 * github.com:scylladb/scylladb: test/lib/boost_test_tree_lister.cc: Record empty test suites test/lib/boost_test_tree_lister.cc: Deduplicate labels	2026-01-25 21:34:32 +02:00
Andrei Chekun	cc5ac75d73	test.py: remove deprecated skip_mode decorator Finishing the deprecation of the skip_mode function in favor of pytest.mark.skip_mode. This PR is only cleaning and migrating leftover tests that are still used and old way of skip_mode. Closes scylladb/scylladb#28299	2026-01-25 18:17:27 +02:00
Ernest Zaslavsky	66a33619da	connection_factory: get rid of shared_future in dns_connection_factory Move state management from dns_connection_factory into state class itself to encapsulate its internal state and stop managing it from the `dns_connection_factory`	2026-01-25 16:12:29 +02:00
Ernest Zaslavsky	5b3e513cba	connection_factory: extract connection logic into a member extract connection logic into a private member function to make it reusable	2026-01-25 15:42:48 +02:00
Ernest Zaslavsky	ce0c7b5896	connection_factory: remove unnecessary `else`	2026-01-25 15:42:48 +02:00
Ernest Zaslavsky	359d0b7a3e	connection_factory: use all resolved DNS addresses Improve dns_connection_factory to iterate over all resolved addresses instead of using only the first one.	2026-01-25 15:42:48 +02:00
Ernest Zaslavsky	bd9d5ad75b	s3_test: remove client double-close `test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call	2026-01-25 15:42:48 +02:00
Alex	1aadedc596	db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0	2026-01-25 14:29:09 +02:00
Ernest Zaslavsky	70f5bc1a50	tests: reuse test_backup_broken_streaming reuse the `test_backup_broken_streaming` test to check for direct sstable download	2026-01-25 13:27:44 +02:00
Ernest Zaslavsky	13fb605edb	streaming: enable direct download of contained sstables Instead of streaming fully contained sstables, download them directly and attach them to their corresponding table. This simplifies the process and avoids unnecessary streaming overhead.	2026-01-25 13:27:44 +02:00
Yaron Kaikov	3dea15bc9d	Update ScyllaDB version to: 2026.2.0-dev	2026-01-25 11:09:17 +02:00
Botond Dénes	edda66886e	Merge 'Replace sstable_stream_source_impl's internal sink and source with seastar utilities' from Pavel Emelyanov The class in question has internal implementations of in-memory data_sink_impl and data_source_impl. In seastar there's generic implementation of the same facilities. From the "code re-use" perspective it makes sense to use both. TODO-s in Scylla code supports that. Using newer seastar facilities, not backporting. Closes scylladb/scylladb#28321 * github.com:scylladb/scylladb: sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl	2026-01-23 16:55:18 +02:00
Pavel Emelyanov	3e09d3cc97	test: Keep test_gossiper_live_endpoints checks togethger There are two checks for live endpoints performed in test_gossiper.py, but one of those sits in test_gossiper_unreachable_endpoints somehow. This patch moves live endpoints check into live endpoints test. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28224	2026-01-23 16:53:48 +02:00
Piotr Dulikowski	3ec4f67407	Merge 'vector_index: Implement rescoring' from Szymon Malewski This series implements rescoring algorithm. Index options allowing to enable this functionality were introduced in earlier PR https://github.com/scylladb/scylladb/pull/28165. When Vector Index has enabled quantization, Vector Store uses reduced vector representation to save memory, but it may degrade correctness of ANN queries. For quantized index we can enable rescoring algorithm, which recalculates similarity score from full vector representation stored in Scylla and reorder returned result set. It works also with oversampling - we fetch more candidates from Vector Store, rescore them at Scylla and return only requested number of results. Example: Creating a Vector Index with Rescoring ```sql -- Create a table with a vector column CREATE TABLE ks.products ( id int PRIMARY KEY, embedding vector<float, 128> ); -- Create a vector index with rescoring enabled CREATE INDEX products_embedding_idx ON ks.products (embedding) USING 'vector_index' WITH OPTIONS = { 'similarity_function': 'cosine', 'quantization': 'i8', 'oversampling': '2.0', 'rescoring': 'true' }; ``` 1. Quantization (`i8`) compresses vectors in the index, reducing memory usage but introducing precision loss in distance calculations 2. Oversampling (`2.0`) retrieves 2× more candidates than requested from the vector store (e.g., `LIMIT 10` fetches 20 candidates) 3. Rescoring (`true`) recalculates similarity scores using full-precision (`f32`) vectors from the base table and re-ranks results Query example: ```sql -- Find 10 most similar products SELECT id, similarity_cosine(embedding, [0.1, 0.2, ...]) AS score FROM ks.products ORDER BY embedding ANN OF [0.1, 0.2, ...] LIMIT 10; ``` With rescoring enabled, the query: 1. Fetches 20 candidates from the quantized index (due to oversampling=2.0) 2. Reads full-precision embeddings from the base table 3. Recalculates similarity scores with full precision 4. Re-ranks and returns the top 10 results In this implementation we use CQL similarity function implementation to calculate new score values and use them in post query ordering. We add that column manually to selection, but it has to be removed from the final response. Follow-up https://github.com/scylladb/scylladb/pull/28165 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83 New feature - doesn't need backport. Closes scylladb/scylladb#27769 * github.com:scylladb/scylladb: vector_index: rescoring: Fetch oversampled rows vector_index: rescoring: Sort by similarity column select_statement: Modify `needs_post_query_ordering` condition vector_index: rescoring: Add hidden similarity score column vector_index: Refactor extracting ANN query information	2026-01-23 15:20:10 +01:00
Patryk Jędrzejczak	a41d7a9240	Merge 'test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection' from Petr Gusev The storage_proxy::stop() is not called by main (it is commented out due to #293), so the corresponding message injection is never hit. When the test releases paxos_state_learn_after_mutate, shutdown may already be in progress or even completed by the time we try to trigger the storage_proxy::stop injection, which makes the test flaky. Fix this by completely removing the storage_proxy::stop injection. The injection is not required for test correctness. Shutdown must wait for the background LWT learn to finish, which is released via the paxos_state_learn_after_mutate injection. The shutdown process blocks on in-flight HTTP requests through seastar::httpd::http_server::stop and its _task_gate, so the HTTP request that releases paxos_state_learn_after_mutate is guaranteed to complete before the node is shut down. Fixes scylladb/scylladb#28260 backport: 2025.4, the `test_lwt_shutdown` test was introduced in this version Closes scylladb/scylladb#28315 * https://github.com/scylladb/scylladb: storage_proxy: drop stop() method test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection	2026-01-23 15:18:17 +01:00
Avi Kivity	30d6f3b8e0	test: test_proxy_protocol: bump timeout It was observed twice that the test times out in debug mode. Fix by increasing the timeout. The test never expects a timeout, so increasing it won't increase the test duration. Fixes #28028 Closes scylladb/scylladb#28272	2026-01-23 15:37:00 +02:00
Łukasz Paszkowski	09fde82a33	test/scylla_gdb: fix coro_task request usage, rename duplicate test - Pass pytest request fixture into coro_task (used for scylla_tmp_dir and core dump path) - Rename duplicate `test_sstable_summary` that runs sstable-index-cache to `test_sstable_index_cache` so both tests are collected Refs https://github.com/scylladb/scylladb/issues/22501 Closes scylladb/scylladb#28286	2026-01-23 15:25:58 +02:00
Pavel Emelyanov	4a307d931a	sstable: Replace buffer_data_sink_impl with seastar::util::basic_memory_data_sink The former accumulates sstable writer writes into a vector of temporary buffers. In seastar there's a generic memory data sink that provides a sink to accumulate stream of bytes into any container. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-23 14:22:22 +03:00
Pavel Emelyanov	97b1340a68	sstables: Use seastar::util::as_input_stream() and remove buffer_data_source_impl The latter is used to wrap vector of buffers into an input_stream. Seastar already provides the very same functionality with the convenience as_input_stream() helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-23 14:21:14 +03:00
Piotr Dulikowski	fe9237fdc9	Merge 'alternator: don't require rf_rack flag for indexes, validate instead' from Michael Litvak In `8df61f6d99` we changed the requirements for creating materialized views and MV-based indexes - instead of requiring the rf_rack_valid_keyspaces flag to be set, we now require the keyspace to be RF-rack-valid at the time of creation, and it is enforced to remain RF-rack-valid while the MV exists. This validation is done in the cql create view/index statements. The same should be done also for alternator - when creating a table with GSI or LSI, or when adding a GSI to an existing table, previously we required the flag rf_rack_valid_keyspaces to be set. Now we change it to instead check if the keyspace is RF-rack-valid, and if not the operation fails with an appropriate error. Fixes https://github.com/scylladb/scylladb/issues/28214 backport to 2025.4 to add RF-rack-valid enforcements in alternator Closes scylladb/scylladb#28154 * github.com:scylladb/scylladb: locator: document the exception type of assert_rf_rack_valid_keyspace alternator: don't require rf_rack flag for indexes, validate instead	2026-01-23 11:49:02 +01:00
Botond Dénes	c4c2f87be7	Merge 'db: fail reads and writes with local consistency level to a DC with RF=0' from null When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception. Scylla allowed such read operations and failed write operations with a cryptic: "broken promise" error. This occured because the initial availability check passed (quorum of 0 requires 0 replicas), but execution failed later when no replicas existed to process the mutation. This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that throws before attempting operation execution. The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be upgraded. This testcase was asserting somewhat similar scenario, but wasn't taking into account the whole matrix of combinations: - scenarios: successful vs unsuccesful operation outcome - local consistency levels: LOCAL_QUORUM & LOCAL_ONE - operations: SELECT (read) & INSERT (write) and so it's been extended to cover both the pre-existing and the current issues and the whole matrix of combinations. Fixes: scylladb/scylladb#27893 A minor change, no need to backport. Closes scylladb/scylladb#27894 * github.com:scylladb/scylladb: db: fail reads and writes with local consistencty level to a DC with RF=0 db: consistency_level: split `local_quorum_for()` db: consistency_level: fix nrs -> nts abbreviation	2026-01-23 12:36:20 +02:00
Petr Gusev	c45244b235	storage_proxy: drop stop() method It's not called by main.cc and can be confusing.	2026-01-23 11:22:03 +01:00
Petr Gusev	f5ed3e9fea	test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection storage_proxy::stop() is not called by main (it is commented out due to #293), so the corresponding message injection is never hit. When the test releases paxos_state_learn_after_mutate, shutdown may already be in progress or even completed by the time we try to trigger the storage_proxy::stop injection, which makes the test flaky. Fix this by completely removing the storage_proxy::stop injection. The injection is not required for test correctness. Shutdown must wait for the background LWT learn to finish, which is released via the paxos_state_learn_after_mutate injection. The shutdown process blocks on in-flight api HTTP requests through seastar::httpd::http_server::stop and its _task_gate, so the shutdown will not prevent the HTTP request that released the paxos_state_learn_after_mutate from completing successfully. Fixes scylladb/scylladb#28260	2026-01-23 11:20:36 +01:00
Botond Dénes	35c9a00275	Merge 'test.py: pass correctly extra cmd line arguments' from Andrei Chekun During rewrite --extra-scylla-cmdline-options was missed and it was not passed to the tests that are using pytest. The issue that there were no possibility to pass these parameters via cmd to the Scylla, while tests were not affected because they were using the parameters from the yaml file. This PR fixes this issue so it will be easier to modify the Scylla start parameters without modifying code. No backport needed, only framework enhancement. Closes scylladb/scylladb#28156 * github.com:scylladb/scylladb: test.py: do not crash when there is no boost log test.py: pass correctly extra cmd line arguments	2026-01-23 11:26:01 +02:00
Andrzej Jackowski	c493a66668	test: check cql_requests_count instead of tasks_processed in SL Before this change, the test function `_verify_tasks_processed_metrics` verified that after service level reconfiguration, a given number of `scylla_scheduler_tasks_processed` were processed by a given scheduling group. Moreover, the check verified that another scheduling group didn't process a high number of requests. The second check was vulnerable to flakiness, because sometimes additional load caused extensive work in the second scheduling group (e.g. password hashing in `sl:driver` due to new connections being created). To avoid test failures, this commit changes which metric is verified: instead of `scylla_scheduler_tasks_processed`, the metric `scylla_transport_cql_requests_count` is checked. This prevents similar problems, because there is no reason for a high number of requests to be processed by the second scheduling group. Moreover, it allows decreasing the number of requests that are sent for verification, and thus speeds up the test. Fixes: scylladb/scylladb#27715 Closes scylladb/scylladb#28318	2026-01-23 10:19:29 +01:00
Patryk Jędrzejczak	4e984139b2	Merge 'strongly consistent tables: basic implementation' from Petr Gusev In this PR we add a basic implementation of the strongly-consistent tables: * generate raft group id when a strongly-consistent table is created * persist it into system.tables table * start raft groups on replicas when a strongly-consistent tablet_map reaches them * add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods * the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()` * strongly-consistent versions of the `select_statement` and `modification_statement` are added * a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion. Limitations: * for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups. * No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately. * No tablet balancing -- migration/split/merges require separate complicated code. The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set. Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290) One can check the strongly consistent writes and reads locally via cqlsh: scylla.yaml: ``` experimental_features: - strongly-consistent-tables ``` cqlsh: ``` CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local'; CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int); INSERT INTO my_ks.test (pk, c) VALUES (10, 20); SELECT * FROM my_ks.test WHERE pk = 10; ``` Fixes SCYLLADB-34 Fixes SCYLLADB-32 Fixes SCYLLADB-31 Fixes SCYLLADB-33 Fixes SCYLLADB-56 backport: no need Closes scylladb/scylladb#27614 * https://github.com/scylladb/scylladb: test_encryption: capture stderr test/cluster: add test_strong_consistency.py raft_group_registry: disable metrics for non-0 groups strong consistency: implement select_statement::do_execute() cql: add select_statement.cc strong consistency: implement coordinator::query() cql: add modification_statement cql: add statement_helpers strong consistency: implement coordinator::mutate() raft.hh: make server::wait_for_leader() public strong_consistency: add coordinator modification_statement: make get_timeout public strong_consistency: add groups_manager strong_consistency: add state_machine and raft_command table: add get_max_timestamp_for_tablet tablets: generate raft group_id-s for new table tablet_replication_strategy: add consistency field tablets: add raft_group_id modification_statement: remove virtual where it's not needed modification_statement: inline prepare_statement() system_keyspace: disable tablet_balancing for strongly_consistent_tables cql: rename strongly_consistent statements to broadcast statements	2026-01-23 09:52:33 +01:00
Anna Stuchlik	c681b3363d	doc: remove an outdated KB This page was created for very outdated versions of ScyllaDB (5.1 (ScyllaDB Open Source) and 2022.2 (ScyllaDB Enterprise) to ensure smooth upgrade to that versions. We no longer need that document. Fixes https://github.com/scylladb/scylladb/issues/28264 Closes scylladb/scylladb#28266	2026-01-22 18:27:14 +01:00
Szymon Wasik	927aebef37	Add vector search documentation links to CQL docs This patch adds links to the Vector Search documentation that is hosted together with Scylla Cloud docs to the CQL documentation. It also make the note about supported capabilities consistent and removes the experimental label as the feature is GAed. Fixes: SCYLLADB-371 Closes scylladb/scylladb#28312	2026-01-22 16:46:38 +01:00
Botond Dénes	f375288b58	tools/scylla-sstable: introduce filter command Filter the content of sstable(s), including or excluding the specified partitions. Partitions can be provided on the command line via `--partition`, or in a file via `--partitions-file`. Produces one output sstable per input sstable -- if the filter selects at least one partition in the respective input sstable. Output sstables are placed in the path provided via `--oputput-dir`. Use `--merge` to filter all input sstables combined, producing one output sstable.	2026-01-22 17:20:07 +02:00
Michael Litvak	d5009882c6	locator: document the exception type of assert_rf_rack_valid_keyspace The function assert_rf_rack_valid_keyspace uses the exception type std::invalid_argument when the RF-rack validation fails. Document it and change all callers to catch this specific exception type when checking for RF-rack validation failures, so that other exception types can be propagated properly.	2026-01-22 16:11:35 +01:00
Michael Litvak	1f7a65904e	alternator: don't require rf_rack flag for indexes, validate instead In `8df61f6d99` we changed the requirements for creating materialized views and MV-based indexes - instead of requiring the rf_rack_valid_keyspaces flag to be set, we now require the keyspace to be RF-rack-valid at the time of creation, and it is enforced to remain RF-rack-valid while the MV exists. This validation is done in the cql create view/index statements. The same should be done also for alternator - when creating a table with GSI or LSI, or when adding a GSI to an existing table, previously we required the flag rf_rack_valid_keyspaces to be set. Now we change it to instead check if the keyspace is RF-rack-valid, and if not the operation fails with an appropriate error.	2026-01-22 16:11:35 +01:00
Szymon Malewski	29d090845a	vector_index: rescoring: Fetch oversampled rows So far with oversampling the extended set of keys was returned from VS, but query to the base table was still limited by the query `limit`. Now for rescoring we want to fetch rows for all the keys returned from VS. However later we need to restore the command limit, to trim result_set accordingly. For non-rescoring scenarios we trim directly keys set returned from VS if it happens to exceed query limit. With this change rescoring validation tests (except `no_nulls_in_rescored_results`) pass fully. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-83	2026-01-22 15:38:44 +01:00
Szymon Malewski	0bc95bcf87	vector_index: rescoring: Sort by similarity column This patch implements second part of rescoring - ordering results by similarity column added in earlier patch. For this purpose in this patch we define `_ordering_comparator`, which enables pre-existing post-query ordering functionality. However, none additional test passes yet, as they include ovesampling, which will be the subject of following patches.	2026-01-22 15:38:44 +01:00
Szymon Malewski	57e7a4fa4f	select_statement: Modify `needs_post_query_ordering` condition Our plan for rescoring is to use the existing post-query ordering mechanism to sort (and trim) result_set by similarity column. For general SELECT case this ordering is permitted only for queries with IN on the partition key and an ORDER BY, which is checked in `needs_post_query_ordering`. Recently this check was overriden for ANN queries in https://github.com/scylladb/scylladb/pull/28109 to enable IN queries handled by VS without excessive post-processing. In this patch we revert that change - ANN case will be handled by general check. However we change the condition - we will enable post processing anytime `_ordering_comparator` is set. In current implementation `_ordering_comparator` is created only in `select_statement::prepare` with `get_ordering_comparator`, only for the same conditions as were checked in `needs_post_query_ordering`, so this change should be transparent for general SELECT. For ANN query it is also not set (yet), so it will not influence ANN filtering, but we confirm that this functionality still works by adding filtering test: `test/vector_search/filter_test.cc::vector_store_client_test_filtering_ann_cql`. Rescoring ordering for ANN queries will be enabled when we add `_ordering_comparator` in following patch.	2026-01-22 15:38:44 +01:00
Szymon Malewski	c89957b725	vector_index: rescoring: Add hidden similarity score column Rescoring consist of recalculating similarity score and reordering results based on it. In this patch we add calculation of similarity score as a hidden (non-serialized) column and following patch will add reordering. Normal ordering uses `add_column_for_post_processing`, however this works only for regular columns, not function. So we create it together with user requested columns (this also forces the use of `selection_with_processing`) and hide the column later. This also requires special handling for 'SELECT *' case - we need to manually add all columns before adding similarity column. In case user already asks for similarity score in the SELECT clause, this value will be calculated twice - is should be optimized in future patches.	2026-01-22 15:38:40 +01:00
Anna Stuchlik	0aa881f190	doc: add the info about Alternator ports to the Admin Guide Fixes https://github.com/scylladb/scylladb/issues/23706 Closes scylladb/scylladb#27724	2026-01-22 16:10:58 +03:00
Israel Fruchter	6b3ce5fdcc	dist: scylla_coredump_setup: force unmount /var/lib/systemd/coredump before setup When setting up coredump handling, if there are old mounts in a deleted state (e.g. from an older installation), systemd might fail to activate the new `.mount` unit properly because it assumes the path is already mounted. Explicitly unmount `/var/lib/systemd/coredump` before proceeding with the setup to ensure a clean state. Fix: scylladb/scylla-enterprise#5692 Closes scylladb/scylladb#28300	2026-01-22 14:35:26 +02:00
Pavel Emelyanov	cb6ee05391	Merge 'Extend snapshot manifest.json with tablet-aware metadata' from Benny Halevy This series extends the json manifest file we create when taking snapshots. It adds the following metadata: - manifesr version and scope - snapshot name - created_at and expires_at timestamps (#24061) - node metadata (host_id, dc, rack) - keyspace and table metadat - tablet_count (#26352) - per-sstable metadata (#26352) Fixes [SCYLLADB-189](https://scylladb.atlassian.net/browse/SCYLLADB-189) Fixes [SCYLLADB-195](https://scylladb.atlassian.net/browse/SCYLLADB-195) Fixes [SCYLLADB-196](https://scylladb.atlassian.net/browse/SCYLLADB-196) * Enhancement, no backport needed [SCYLLADB-189]: https://scylladb.atlassian.net/browse/SCYLLADB-189?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [SCYLLADB-195]: https://scylladb.atlassian.net/browse/SCYLLADB-195?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ [SCYLLADB-196]: https://scylladb.atlassian.net/browse/SCYLLADB-196?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#27945 * github.com:scylladb/scylladb: snapshot: keep per-sstable metadata in manifest.json snapshot: add table info and tablet_count to manifest.json snapshot: add basic support for snapshot ttl in manifest.json table: snapshot_on_all_shards: take snapshot_options db: snapshot_ctl: move skip_flush to struct snapshot_options snapshot: add snapshot name in manifest.json test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces snapshot: add node info to manifest.json snapshot: add manifest info to manifest.json test: database_test: snapshot_works: add validate_manifest	2026-01-22 15:19:11 +03:00
Patryk Jędrzejczak	67045b5f17	Merge 'raft_topology, tablets: Drain tablets in parallel with other topology operations' from Tomasz Grabiec Allows other topology operations to execute while tablets are being drained on decommission. In particular, bootstrap on scale-out. This is important for elasticity. Allows multiple decommission/removenode to happen in parallel, which is important for efficiency. Flow of decommission/removenode request: 1) pending and paused, has tablet replicas on target node. Tablet scheduler will start draining tablets. 2) No tablets on target node, request is pending but not paused 3) Request is scheduled, node is in transition 4) Request is done Nodes are considered draining as soon as there is a leave or remove request on them. If there are tablet replicas present on the target node, the request is in a paused state and will not be picked by topology coordinator. The paused state is computed from topology state automatically on reload. When request is not paused, its execution starts in write_both_read_old state. The old tablet_draining state is not entered (it's deprecated now). Tablet load balancing will yield the state machine as soon as some request is no longer paused and ready to be scheduled, based on standard preemption mechanics. Fixes #21452 Closes scylladb/scylladb#24129 * https://github.com/scylladb/scylladb: docs: Document parallel decommission and removenode and relevant task API test: Add tests for parallel decommission/removenode test: util: Introduce ensure_group0_leader_on() test: tablets: Check that there are no migrations scheduled on draining nodes test: lib: topology_builder: Introduce add_draining_request() topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization tablets: topology_coordinator: Refactor to propagate reason for migration rollback tablet_allocator: Skip co-location on draining nodes node_ops: task_manager_module: Populate entity field also for active requests tasks: node_ops: Put node id in the entity field tasks, node_ops: Unify setting of task_stats in get_status() and get_stats() topology: Protect against empty cancelation reason tasks, topology: Make pending node operations abortable doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged raft_topology, tablets: Drain tablets in parallel with other topology operations virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node locator: topology: Add "draining" flag to a node topology_coordinator: Extract generate_cancel_request_update() storage_service: Drop dependency in topology_state_machine.hh in the header locator: Extract common code in assert_rf_rack_valid_keyspace() topology_coordinator, storage_service: Validate node removal/decommission at request submission time	2026-01-22 13:06:53 +01:00
Botond Dénes	21900c55eb	tools/scylla-sstable: remove --unsafe-accept-nonempty-output-dir This flag was added to operations which have an --output-dir command-line arguments. These operations write sstables and need a directory where to write them. Back in the numeric-generation world this posed a problem: if the directory contained any sstable, generation clash was almost guaranteed, because each scylla-sstable command invokation would start output generations from 1. To avoid this, empty output directory was a requirement, with the --unsafe-accept-nonempty-output-dir allowing for a force-override. Now in the timeuuid generation days, all this is not necessary anymore: generations are unique, so it is not a problem if the output directory already contains sstables: the probability of generation clash is almost 0. Even if it happens, the tool will just simply fail to write the new sstable with the clashing generation. Remove this historic relic of a flag and the related logic, it is just a pointless nuissance nowadays.	2026-01-22 13:55:59 +02:00
Botond Dénes	a1ed73820f	tools/scylla-sstable: make partition_set ordered Next patch will want partitions to be ordered. Remove unused partition_map type.	2026-01-22 13:55:59 +02:00
Botond Dénes	d228e6eda6	tools/scylla-stable: remove unused boost/algorithm/string.hpp include	2026-01-22 13:55:59 +02:00
Piotr Smaron	d4c28690e1	db: fail reads and writes with local consistencty level to a DC with RF=0 When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception. Scylla allowed such read operations and failed write operations with a cryptic: "broken promise" error. This occured because the initial availability check passed (quorum of 0 requires 0 replicas), but execution failed later when no replicas existed to process the mutation. This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that throws before attempting operation execution. The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be upgraded. This testcase was asserting somewhat similar scenario, but wasn't taking into account the whole matrix of combinations: - scenarios: successful vs unsuccesful operation outcome - local consistency levels: LOCAL_QUORUM & LOCAL_ONE - operations: SELECT (read) & INSERT (write) and so it's been extended to cover both the pre-existing and the current issues and the whole matrix of combinations. Fixes: scylladb/scylladb#27893	2026-01-22 12:49:45 +01:00
Piotr Smaron	9475659ae8	db: consistency_level: split `local_quorum_for()` The core of `local_quorum_for()` has been extracted to `get_replication_factor_for_dc()`, which is going to be used later, while `local_quorum_for()` itself has been recreated using the exracted part.	2026-01-22 12:49:23 +01:00
Piotr Smaron	0b3ee197b6	db: consistency_level: fix nrs -> nts abbreviation `network_topology_strategy` was abbreviated with `nrs`, and not `nts`. I think someone incorrectly assumed it's 'network Replication strategy', hence nrs.	2026-01-22 12:48:37 +01:00
Marcin Maliszkiewicz	32543625fc	test: perf: reuse stream id When one request is super slow and req/s high in theory we have a collision on id, this patch avoids that by reusing id and aborting when there is no free one (unlikely).	2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz	7bf7ff785a	main: test: add future and abort_source to after_init_func This commit avoids leaking seastar::async future from two benchmark tools: perf-alternator and perf-cql-raw. Additionally it adds abort_source for fast and clean shutdown.	2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz	0d20300313	test: perf: add option to stress multiple tables in perf-cql-raw	2026-01-22 12:26:50 +01:00
Marcin Maliszkiewicz	a033b70704	test: perf: add perf-cql-raw benchmarking tool The tool supports: - auth or no auth modes - simple read and write workloads - connection pool or connection per request modes - in-process or remote modes, remote may be usefull to assess tool's overhead or use it as bigger scale benchmark - uses prepared statements by default - connection only mode, for testing storms It could support in the future: - TLS mode - different workloads - shard awareness Example usage: > build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 2 --cpus 0,1 \ --developer-mode 1 --workload read --duration 5 2> /dev/null Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0} Pre-populated 10000 partitions 97438.42 tps (269.2 allocs/op, 1.1 logallocs/op, 35.2 tasks/op, 113325 insns/op, 80572 cycles/op, 0 errors) 102460.77 tps (261.1 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108222 insns/op, 75447 cycles/op, 0 errors) 95707.93 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108443 insns/op, 75320 cycles/op, 0 errors) 102487.87 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 107956 insns/op, 74320 cycles/op, 0 errors) 100409.60 tps (261.0 allocs/op, 0.0 logallocs/op, 31.7 tasks/op, 108337 insns/op, 75262 cycles/op, 0 errors) throughput: mean= 99700.92 standard-deviation=3039.28 median= 100409.60 median-absolute-deviation=2759.85 maximum=102487.87 minimum=95707.93 instructions_per_op: mean= 109256.53 standard-deviation=2281.39 median= 108337.37 median-absolute-deviation=1034.83 maximum=113324.69 minimum=107955.97 cpu_cycles_per_op: mean= 76184.36 standard-deviation=2493.46 median= 75320.20 median-absolute-deviation=922.09 maximum=80572.19 minimum=74320.00	2026-01-22 12:26:50 +01:00
Patryk Jędrzejczak	e11450ccca	test: test_raft_recovery_user_data: prevent repeated ALTER KEYSPACE request The test is currently flaky. With `remove_dead_nodes_with == "remove"`, it sends several ALTER KEYSPACE requests. The request performed just after adding 3 new nodes can unexpectedly be sent twice to two different nodes by the driver. The second receiver rejects the request through the new guardrail added in `2e7ba1f8ce`, and the test fails. This has been acknowledged as a bug in the Python driver. It shouldn't retry non-idempotent requests with the default retry policy. There could be one more bug in the driver, as it looks like the driver decides to resend the request after it disconnects from the first receiver. The first receiver has just bootstrapped, so the driver shouldn't disconnect. We deflake the test by reconnecting the driver before performing the problematic ALTER KEYSPACE request. The change has been tested in byo, as the failure reproduces only in CI. Without the change, the test fails once in ~250 runs in dev mode. With the change, more than 1000 runs passed. Fixes #27862 No backport needed as `2e7ba1f8ce` is only in master. Closes scylladb/scylladb#28290	2026-01-22 14:13:42 +03:00
Nadav Har'El	9baaddb613	test/cqlpy: add reproducer for hidden Paxos table being shown by DESC This patch adds a reproducer test showing issue #28183 - that when LWT is used, hidden tables "...$paxos" are created but they are unexpectedly shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE. The new test was failing (in three places) on Scylla, as those internal (and illegally-named) tables are listed, and passes on Cassandra (which doesn't add hidden tables for LWT). The commit also contains another test, which verifies if direct description of paxos state table is wrapped in comment. Refs #28183. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-01-22 10:43:59 +01:00
Szymon Malewski	e0cc6ca7e6	vector_index: Refactor extracting ANN query information For the purpose of rescoring we will need information if the query is an ANN query and the access to index option earlier in the `select_statement::prepare` than it happened before. This patch refactors extracting this information to new helper structure `ann_ordering_info` and is consistently using it.	2026-01-22 10:00:47 +01:00
Botond Dénes	7d2e6c0170	Merge 'config: add enforce_rack_list option' from Aleksandra Martyniuk Add enforce_rack_list option. When the option is set to true, all tablet keyspaces have rack list replication factor. When the option is on: - CREATE STATEMENT always auto-extends rf to rack lists; - ALTER STATEMENT fails when there is numeric rf in any DC. The flag is set to false by default and a node needs to be restarted in order to change its value. Starting a node with enforce_rack_list option will fail, if there are any tablet keyspaces with numeric rf in any DC. enforce_rack_list is a per-node option and a user needs to ensure that no tablet keyspace is altered or created while nodes in the cluster don't have the consistent value. Mark rf_rack_valid_keyspaces as deprecated. Fixes: https://github.com/scylladb/scylladb/issues/26399. New feature; no backport needed Closes scylladb/scylladb#28084 * github.com:scylladb/scylladb: test: add test for enforce_rack_list option db: mark rf_rack_valid_keyspaces as deprecated config: add enforce_rack_list option Revert "alternator: require rf_rack_valid_keyspaces when creating index"	2026-01-22 10:27:35 +02:00
Benny Halevy	d6557764b9	snapshot: keep per-sstable metadata in manifest.json Adds a "sstables" array member to manifest.json. For each sstables, keep the following metadata: id - a uuid for the sstable (the sstable identifier if the use-sstable-identifier option was used, otherwise the sstable uuid generation) toc_name - the name of the TOC.txt file data_size and index_size - in bytes first_token and last_token - of the sstable first and last keys. Fixes: SCYLLADB-196 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:42:52 +02:00
Benny Halevy	dc9093303d	snapshot: add table info and tablet_count to manifest.json Add a table member to manifest.json with the keyspace_name, table_name, table_id, tablets_type, and, for tablets-enabled tables, get tablet_count on each shard and write the minimum to manifest.json. For vnodes-based tables, tablet_count=0. For now, `tablets_type` may be either `none` for vnodes tables, or `powof2` for tablets tables. In the future, when we support arbitrary tablt boundaries, this will be reflected here, and it is likely we would backup the whole tablets map sperately to get all tablet boundaries. Fixes SCYLLADB-195 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:36:52 +02:00
Benny Halevy	91df129e21	snapshot: add basic support for snapshot ttl in manifest.json Store the snapshot `created_at` time and an optional `expires_at` time. Fixes SCYLLADB-189 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	5e90fbb9d2	table: snapshot_on_all_shards: take snapshot_options And keep the options for now in the local_snapshot_writer. The options will be used by following patches to pass extra metadata like the snapshot creation time, expiration time, etc. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	49a3e0914d	db: snapshot_ctl: move skip_flush to struct snapshot_options So we can easily extend it and add more options. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	d9fc3b1c11	snapshot: add snapshot name in manifest.json Store the snapshot tag in the manifest file. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	0f014e2916	test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces If tablets are enabled via db::config add the `tablet = {'enabled': true}' option when creating a keyspace, even if `cql_test_config.initial_tablets` is disengaged. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	0d82e56078	snapshot: add node info to manifest.json Add metadata about the node: host_id, datacenter, and rack. This enables dc- or rack- aware restore. Today this information is "encoded" into the snapshot hierarchy prefixes, but if all manifest files would be stored in a flat directory, we'd need to encode that metadata in the object name, but it'd be better for the manifest contents to be self descriptive. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	24040efc54	snapshot: add manifest info to manifest.json Add metadata about the manifest itself: A version and the manifest scope (currently "node", but in the future, may also be "shard", or "tablet") Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Benny Halevy	9e0f5410ae	test: database_test: snapshot_works: add validate_manifest Validate the manifest.json format by loading it using rjson::parse and then validate its contents to ensure it lists exactly the SSTables present in the snapshot directory. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-22 09:12:56 +02:00
Patryk Jędrzejczak	e503340efc	test: test_raft_recovery_during_join: get host ID of the bootstrapping node before it crashes The test is currently flaky. It tries to get the host ID of the bootstrapping node via the REST API after the node crashes. This can obviously fail. The test usually doesn't fail, though, as it relies on the host ID being saved in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()` repeatedly called in `ScyllaServer.start()`. However, with a very fast crash and unlucky timings, no such call may succeed. We deflake the test by getting the host ID before the crash. Note that at this point, the bootstrapping node must be serving the REST API requests because `await coordinator_log.wait_for("delay_node_bootstrap: waiting for message")` above guarantees that the node has submitted the join topology request, which happens after starting the REST API. Fixes #28227 Closes scylladb/scylladb#28233	2026-01-22 06:55:16 +02:00
Botond Dénes	4281d18c2e	Merge 'schema: Apply `sstable_compression_user_table_options` to CQL aux and Alternator tables' from Nikos Dragazis In PR `5b6570be52` we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams). This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (`adf9c426c2`). This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config. Fixes #26914. Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults. Closes scylladb/scylladb#27204 * github.com:scylladb/scylladb: db/config: Update sstable_compression_user_table_options description schema: Add initializer for compression defaults schema: Generalize static configurators into schema initializers schema: Initialize static properties eagerly db: config: Add accessor for sstable_compression_user_table_options test: Check that CQL and Alternator tables respect compression config	2026-01-22 06:50:48 +02:00
Nadav Har'El	5c1e525618	Merge 'vector_search: cache restrictions JSON at prepare time ' from Dawid Pawlik Add `prepared_filter` class which handles the preparation, construction, and caching of Vector Search filtering compatible JSON object. If no bind markers found in primary key restrictions, the JSON object will be built once at prepare time and cached for use during execution calls. Additionally, this patch moves the filter functions from `cql3::restrictions` to `vector_search` namespace and does some renaming to make the purpose of those functions clear. Follow-up: https://github.com/scylladb/scylladb/pull/28109 Fixes: [SCYLLADB-299](https://scylladb.atlassian.net/browse/SCYLLADB-299) [SCYLLADB-299]: https://scylladb.atlassian.net/browse/SCYLLADB-299?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#28276 * github.com:scylladb/scylladb: test/vector_search: add filter tests with bind variables vector_search: cache restrictions JSON at prepare time refactor: vector_search: move filter logic to vector_search namespace	2026-01-21 19:03:35 +02:00
Ernest Zaslavsky	51285785fa	storage: add method to create component source Extend the storage interface with a public method to create a `data_source` for any sstable component.	2026-01-21 16:40:12 +02:00
Ernest Zaslavsky	757e9d0f52	streaming: keep sharded database reference on tablet_sstable_streamer	2026-01-21 16:40:12 +02:00
Ernest Zaslavsky	4ffa070715	streaming: skip streaming empty sstable sets Avoid invoking streaming when the sstable set is empty. This prevents unnecessary calls and simplifies the streaming logic.	2026-01-21 16:40:12 +02:00
Ernest Zaslavsky	0fcd369ef2	streaming: inline streamer instance creation Remove the `make_sstable_streamer` function and inline its usage where needed. This change allows passing different sets of arguments directly at the call sites.	2026-01-21 16:40:12 +02:00
Ernest Zaslavsky	32173ccfe1	tests: fix incorrect backup/restore test flow When working directly with sstable components, the provided name should be only the file name without path prefixes. Any prefixing tokens belong in the 'prefix' argument, as the name suggests.	2026-01-21 16:40:12 +02:00
Patryk Jędrzejczak	1f0f694c9e	test: test_zero_token_nodes_multidc: properly handle reads with CL=LOCAL_ONE The test is currently flaky. It incorrectly assumes that a read with CL=LOCAL_ONE will see the data inserted by a preceding write with CL=LOCAL_ONE in the same datacenter with RF=2. The same issue has already been fixed for CL=ONE in `21edec1ace`. The difference is that for CL=LOCAL_ONE, only dc1 is problematic, as dc2 has RF=1. We fix the issue for CL=LOCAL_ONE by skipping the check for dc1. Fixes #28253 The fix addresses CI flakiness and only changes the test, so it should be backported. Closes scylladb/scylladb#28274	2026-01-21 15:17:42 +01:00
Petr Gusev	843da2bbb8	test_encryption: capture stderr The check_output function doesn't capture stderr by default, making it impossible to debug if something goes wrong.	2026-01-21 14:56:01 +01:00
Petr Gusev	925d86fefc	test/cluster: add test_strong_consistency.py Add a basic test that creates a strongly consistent keyspace and table, writes some data, and verifies that the same data can be read back. Since Scylla-side request proxying is not yet implemented, writes are handled only on the leader node. The test uses the existing `/raft/leader_host` REST endpoint to determine the leader of the tablets Raft group.	2026-01-21 14:56:01 +01:00
Petr Gusev	59a876cebb	raft_group_registry: disable metrics for non-0 groups The `raft::server` registers metrics using the `server_id` label. When both a group0 Raft server and the tablets Raft server are created on the same node/shard, duplicate metrics cause conflicts. This commit temporarily disables metrics for non-0 groups. A proper fix will likely require adding a `group_id` label in the future.	2026-01-21 14:56:01 +01:00
Petr Gusev	1f170d2566	strong consistency: implement select_statement::do_execute()	2026-01-21 14:56:01 +01:00
Petr Gusev	a5d611866e	cql: add select_statement.cc	2026-01-21 14:56:01 +01:00
Petr Gusev	493ebe2add	strong consistency: implement coordinator::query()	2026-01-21 14:56:01 +01:00
Petr Gusev	ccf90cfde8	cql: add modification_statement We use decoration instead of inheritance, since inheritance already serves to differentiate statement types (modification_statement has update_statement and delete_statement as descendants). A better solution would likely involve refactoring modification_statement and extracting the mutation-generation logic into a reusable component shared by both eventual and strongly consistent statements.	2026-01-21 14:56:01 +01:00
Petr Gusev	989566e8a3	cql: add statement_helpers Introduce two helper methods that will be used for strongly consistent select_statement and modification_statement. redirect_statement() forwards the request to another shard or node. Currently, only shard forwarding is implemented; node-level proxying will be added in follow-up PRs. is_strongly_consistent() will be used in the prepare() method of raw statements to determine whether a strongly consistent statement should be created for the given CQL statement.	2026-01-21 14:56:01 +01:00
Petr Gusev	b94fd11939	strong consistency: implement coordinator::mutate() To guarantee monotonic mutation timestamps, we compute the maximum timestamp used so far for the current tablet. This is done by calling read_barrier() on the tablet’s Raft group server and extracting the maximum timestamp from the local database via table::get_max_timestamp_for_tablet(). Because read_barrier() may take a while, we perform it proactively in a dedicated fiber, leader_info_updater, rather than during the mutation request. This fiber is started when the Raft group server starts for a tablet. It reacts to wait_for_state_change(), computes the maximum timestamp, and stores it per term. The new groups_manager::begin_mutate() function checks whether the maximum timestamp has already been computed for the current term. If not, it asks the client to wait. This two-step interface (synchronous begin_mutate() + asynchronous wait on the need_wait_for_leader future) is needed because the term can change at any asynchronous point. If begin_mutate() were asynchronous, the client would need to recheck the term after `co_await begin_mutate()`. We currently do not handle raft::commit_status_unknown. We rethrow it to the CQL client, which must check whether the command was applied and retry if necessary. Handling this inside Scylla would require persisting a deduplication key after applying the mutation, which introduces write amplification. Additionally, connection breaks between Scylla and the driver can always occur, so the client must be prepared to verify the command status regardless.	2026-01-21 14:56:01 +01:00
Petr Gusev	801bf82a34	raft.hh: make server::wait_for_leader() public When a strongly consistent request arrives at a node, we need to know which replica is the leader, since such requests are generally executed only on the leader. If a leader has not yet been elected, we must wait. This commit exposes wait_for_leader() so it can be used for that purpose. We cannot rely solely on wait_for_state_change(), because it does not trigger when some other node becomes a leader.	2026-01-21 14:56:01 +01:00
Petr Gusev	7d111f2396	strong_consistency: add coordinator Add the `coordinator` class, which will be responsible for coordinating reads and writes to strongly consistent tables. This commit includes only the boilerplate; the methods will be implemented in separate commits.	2026-01-21 14:56:01 +01:00
Petr Gusev	4413142f25	modification_statement: make get_timeout public We'll need to access this method in a new strong_consistency/modification_statement class.	2026-01-21 14:56:00 +01:00
Petr Gusev	4902186ede	strong_consistency: add groups_manager This class is reponsible for managing raft groups for strongly-consistent tablets.	2026-01-21 14:56:00 +01:00
Petr Gusev	4eee5bc273	strong_consistency: add state_machine and raft_command These commands will be used by strongly consistent tablets to submit mutations to Raft. A simple state_machine implementation is introduced to apply these commands. We apply commands in batches to reduce commitlog I/O overhead. The batched variant of database::apply has known atomicity issues. For example, it does not guarantee atomicity under memory pressure: some mutations may be published to the memtable while others are blocked in run_when_memory_available. We will address these issues later.	2026-01-21 14:56:00 +01:00
Petr Gusev	a8350b274e	table: add get_max_timestamp_for_tablet Strongly consistent writes require knowing the maximum timestamp of locally applied mutations to guarantee monotonically increasing timestamps for subsequent writes. This commit adds a function that returns the maximum timestamp for a given tablet. Why it is safe to use this function with deleted cells: * Tombstones are included in memtable.get_max_timestamp() calculations. * The maximum timestamp of a memtable is used to initialize the maximum timestamp of the resulting sstable. * During compaction, a new sstable’s maximum timestamp is initialized as the maximum of the contributing sstables.	2026-01-21 14:56:00 +01:00
Petr Gusev	dc4896b69b	tablets: generate raft group_id-s for new table We assign shard to zero because raft is currently only supported on the zero shard.	2026-01-21 14:56:00 +01:00
Petr Gusev	6dd74be684	tablet_replication_strategy: add consistency field This commit adds a `consistency` field to `tablet_replication_strategy`. In upcoming commits we'll use this field to determine if a `raft_group_id` should be generated for a new table.	2026-01-21 14:56:00 +01:00
Petr Gusev	53f93eb830	tablets: add raft_group_id Add a `raft_group_id` column to `system.tablets` and to the `tablet_map` class. The column is populated only when the `strongly_consistent_tables` feature is enabled. This feature is currently disabled by default and is enabled only when the user sets the `STRONGLY_CONSISTENT_TABLES` experimental flag. The `raft_group_id` column is added to `system.tablets` only when this flag is set. This allows the schema to evolve freely while the feature is experimental, without requiring complex migrations.	2026-01-21 14:56:00 +01:00
Petr Gusev	cab3e1eea5	modification_statement: remove virtual where it's not needed This is a refactoring/simplification commit.	2026-01-21 14:56:00 +01:00
Petr Gusev	9015bed794	modification_statement: inline prepare_statement() This is a refactoring/simplification commit. There are many 'prepare' functions in this class that don't meaningfully differ from each other. The prepare_statement() adds accidental complexity by adding a level of indirection -- the reader has to jump between the call site and the function body to reconstruct the full picture.	2026-01-21 14:56:00 +01:00
Petr Gusev	998ee5b7fb	system_keyspace: disable tablet_balancing for strongly_consistent_tables We don't yet support tablet balancing for strongly consistent tables.	2026-01-21 14:56:00 +01:00
Petr Gusev	6b0d757f28	cql: rename strongly_consistent statements to broadcast statements In preparation for upcoming work on strongly consistent queries in Scylla, this commit renames the existing `strongly_consistent` statements to `broadcast_statements` to avoid confusion. The old code paths are kept temporarily, as they may be useful for reference or for copying parts during the implementation of the new strongly consistent statements.	2026-01-21 14:56:00 +01:00
Pavel Emelyanov	18b5a49b0c	Populate all sl:* groups into dedicated top-level supergroup This patch changes the layout of user-facing scheduling groups from / `- statement `- sl:default `- sl:* `- other groups (compaction, streaming, etc.) into / `- user (supergroup) `- statement `- sl:default `- sl:* `- other groups (compaction, streaming, etc.) The new supergroup has 1000 static shares and is name-less, in a sense that it only have a variable in the code to refer to and is not exported via metrics (should be fixed in seastar if we want to). The moved groups don't change their names or shares, only move inside the scheduling hierarchy. The goal of the change is to improve resource consumption of sl:* groups. Right now activities in low-shares service levels are scheduled on-par with e.g. streaming activity, which is considered to be low-prio one. By moving all sl:* groups into their own supergroup with 1000 shares changes the meaning of sl:* shares. From now on these shares values describe preirities of service level between each-other, and the user activities compete with the rest of the system with 1000 shares, regardless of how many service levels are there. Unit tests keep their user groups under root supergroup (for simplicity) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28235	2026-01-21 14:14:48 +02:00
Botond Dénes	cf4133c62d	Merge 'service: pass topology guard to RBNO' from Aleksandra Martyniuk Currently, raft-based node operations with streaming use topology guards, but repair-based don't. Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes. Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards. Fixes: https://github.com/scylladb/scylladb/issues/27759 No topology_guard in any version; needs backport to all versions Closes scylladb/scylladb#27839 * github.com:scylladb/scylladb: service: use session variable for streaming service: pass topology guard to RBNO	2026-01-21 12:47:28 +02:00
Robert Bindar	ea8a661119	reduce test_backup.py and test_refresh.py datasets backup and restore tests. This made the testing times explode with both cluster/object_store/test_backup.py and cluster/test_refresh.py taking more than an hour each to complete under test.py and around 14min under pytest directly. This was painful especially in CI because it runs tests under test.py which suffers from the issue of not being able to run test cases from within the same file in parallel (a fix is attempted in 27618). This patch reduces the dataset of these tests to the minimum and gets rid of one of the tested topology as it was redundant. The test times are reduced to 2min under pytest and 14 mins under test.py. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#28280	2026-01-21 10:47:36 +02:00
Nadav Har'El	8962093d90	Merge 'vector_index: Introduce `rescoring` index option' from Szymon Malewski This series introduces `rescoring` index option. There is no rescoring algorithm implementation yet. This series prepares it by: - adding new index option - adding documentation - adding tests for option handling - adding tests for rescoring implementation - at this point they report errors and are marked that this is expected, because rescoring is not implemented. Follow-up https://github.com/scylladb/scylladb/pull/27677 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293 Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294 No backporting - it is a new feature. Closes scylladb/scylladb#28165 * github.com:scylladb/scylladb: vector_search: Add more rescoring validation tests vector_search: Add rescoring validation test vector_search: doc: Document new index option vector_search: test: Add `rescoring` index option test vector_index: introduce rescoring option vector_index: improve options validation	2026-01-21 10:46:22 +02:00
Avi Kivity	ecb6fb00f0	streamed_mutation_freezer: use chunked_vector instead of std::deque for clustering rows The streamed_mutation_freezer class uses a deque to avoid large allocations, but fails as seen in the referenced issue when the vector backing the deque grows too large. This may be a problem in itself, but the issue doesn't provide enough information to tell. Fix the immediate problem by switching to chunked_vector, which is better in avoiding large allocations. We do lose some early-free in serialize_mutation_fragments(), but since most of the memory should be in the clustering row itself, not in the deque/chunked_vector holding it, it should not be a problem. Fixes #28275 Closes scylladb/scylladb#28281	2026-01-21 10:13:44 +02:00
Botond Dénes	09d3b6c98b	Merge 'auth: use paged internal queries during migration' from Piotr Smaron Auth v2 migration uses non-paged queries via `execute_internal` API. This commit changes it to use `query_internal` instead, which uses paging under the hood. Fixes: https://github.com/scylladb/scylladb/issues/27577 A minor enhancement, no need to backport. Closes scylladb/scylladb#25395 * github.com:scylladb/scylladb: auth: use paged internal queries during migration auth: move some code in migrate_to_auth_v2 up auth: re-align pieces of migrate_to_auth_v2 cql: extend `query_internal` with `query_state` param	2026-01-21 09:58:06 +02:00
Raphael S. Carvalho	d16f9c821d	Revert "api: storage_service/tablets/repair: disable incremental repair by default" This reverts commit `c8cff94a5a`. Re-enabling incremental repair on master with "Aborting on shard 0 during scaleout + repair #26041" and "Failure to attach sstables in streaming consumer leaves sealed sstables on disk #27414" fixed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#28120	2026-01-21 08:50:13 +02:00
Dawid Pawlik	4a7e20953a	test/cqlpy: remove the xfail mark from already passing tests Since #28109 was merged, those tests started to pass as we allow the filtering on primary key columns within ANN vector queries. Closes scylladb/scylladb#28231	2026-01-21 08:47:20 +02:00
Botond Dénes	7a3f51b304	Update seastar submodule * seastar e00f1513...f55dc7eb (8): > build: detect io_uring correctly > ioinfo: report actual I/O latency goal instead of configured value > core/loop: repeater,repeat_until_value: don't rethrow exceptions > seastar::test: Add exception interception to extract nested info > bool_class: add operator<=> > Merge 'Perftune.py: generic virtual device with lower-level slave network devices support' from Vladislav Zolotarov scripts/perftune.py: NetPerfTuner: rename "hardware interface" methods to "tunable interface" methods scripts/perftune.py: NetPerfTuner: refactor NIC handling to support any virtual interface with slaves scripts/perftune.py: NetPerfTuner: rename methods to private for better encapsulation > scripts/perftune.py: add fast path queues IRQ filtering and indexing for Microsoft Azure Network Adapter (MANA) NICs > file: Finally override dma_read_bulk() in posix_file_impl Closes scylladb/scylladb#28262	2026-01-21 08:44:20 +02:00
Szymon Malewski	d6226500f6	vector_search: Add more rescoring validation tests Adding tests for specific cases of rescoring processing: - wildcard selection - "SELECT * ..." is a case with slightly different path of rescoring processing. We want to confirm that it is handled correctly. - calculating similarity with other vectors in SELECT clause should not influence ANN ordering. - NULL handling - results that for any reason have NULL in a score should be filtered out. As rescoring is not implemented yet, the tests use boost::unit_test::expected_failures to indicate that the test reports errors.	2026-01-20 21:01:45 +01:00
Karol Nowacki	376c70be75	vector_search: Add rescoring validation test Verify that vector store results will be correctly rescored and reordered according to the rescoring algorithm. As rescoring is not implemented yet, the tests use `boost::unit_test::expected_failures` to indicate that they report errors. First test checks rescoring with a simple selection list. Second makes sure that rescoring is not triggered for quantization=f32 - full representation of vectors. Third repeats the first one, but adds to it returning of similarity score value.	2026-01-20 21:01:45 +01:00
Karol Nowacki	41dcf12463	vector_search: doc: Document new index option Adds documentation for the rescoring option for vector search indexes.	2026-01-20 21:01:45 +01:00
Karol Nowacki	b268eda67e	vector_search: test: Add `rescoring` index option test Add tests to validate `rescoring` index options. It also improves tests for related `oversampling` option validation.	2026-01-20 21:01:45 +01:00
Szymon Malewski	c5945b1ef4	vector_index: introduce rescoring option This patch adds vector index option allowing to enable rescoring - recalculation of similarity metric and re-ranking of quantized VS candidates. Quantization is a necessary condition to run rescoring - checked in convenience function `is_rescoring_enabled`. Rescoring itself is not implemented - it will come in following patches. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-294	2026-01-20 21:01:45 +01:00
Szymon Malewski	262a8cef0b	vector_index: improve options validation In this patch we enhance validation of option by: - giving context (option name) in error messages - listing supported values in error messages of enumerated options - avoiding using templates Fixes https://scylladb.atlassian.net/browse/SCYLLADB-293 Follow-up: https://github.com/scylladb/scylladb/pull/27677	2026-01-20 21:01:41 +01:00
Dawid Pawlik	f27ef79d0d	test/vector_search: add filter tests with bind variables Add tests that check if preparation of the filter does work and not produce cache when the restrictions consist of bind variables.	2026-01-20 18:17:46 +01:00
Dawid Pawlik	e62cb29b7d	vector_search: cache restrictions JSON at prepare time Add `prepared_filter` class which handles the preparation, construction and caching of Vector Search filtering compatible JSON object. If no bind markers found in SELECT statement, the JSON object will be built once at prepare time and cached for use during execution calls. Adjust tests accordingly to use prepared filters. Follow-up: #28109 Fixes: SCYLLADB-299	2026-01-20 17:15:52 +01:00
Michał Jadwiszczak	f89a8c4ec4	cql3/statements/describe_statement: hide paxos state tables Paxos state tables are internal tables fully managed by Scylla and they shouldn't be exposed to the user nor they shouldn't be backed up. This commit hides those kind of tables from all listings and if such table is directly described with `DESC ks."tbl$paxos"`, the description is generated withing a comment and a note for the user is added. Fixes scylladb/scylladb#28183	2026-01-20 15:58:08 +01:00
Andrei Chekun	91e2b027ce	test.py: do not crash when there is no boost log Small fix to not crash the whole process when boost tests failed to start and do not produced the log file that can be parsed.	2026-01-20 15:52:40 +01:00
Andrei Chekun	58d3052ad4	test.py: pass correctly extra cmd line arguments During rewrite --extra-scylla-cmdline-options was missed and it was not passed to the tests that are using pytest. The issue that there were no possibility to pass these parameters via cmd to the Scylla, while tests were not affected because they were using the parameters from the yaml file. This PR fixes this issue so it will be easier to modify the Scylla start parameters without modifying code.	2026-01-20 15:52:40 +01:00
Anna Stuchlik	84e9b94503	doc: fix the default compaction strategy for Materialized Views Fixes https://github.com/scylladb/scylladb/issues/24483 Closes scylladb/scylladb#27725	2026-01-20 15:41:03 +01:00
Dawid Pawlik	f54a4010c0	refactor: vector_search: move filter logic to vector_search namespace Move Vector Search filter functions from `cql3::restrictions` to `vector_search` namespace as it's a better place according to it's purpose. The effective name has now changed from `cql3::restrictions::to_json` to `vector_search::to_json` which clearly mentions that the JSON object will be used for Vector Search. Rename the auxilary functions to use `to_json` suffix instead of variety of verbs as those functions logic focuses on building JSON object from different structures. The late naming emphasized too much distinction between those functions, while they do pretty much the same thing. Follow-up: #28109	2026-01-20 13:13:43 +01:00
Botond Dénes	a53f989d2f	db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot The API contract in partition_version.hh states that when dealing with evictable entries, a real cache tracker pointer has to be passed to all methods that ask for it. The nonpopulating reader violates this, passing a nullptr to the snapshot. This was observed to cause a crash when a concurrent cache read accessed the snapshot with the null tracker. A reproducer is included which fails before and passes after the fix. Fixes: #26847 Closes scylladb/scylladb#28163	2026-01-20 12:34:37 +01:00
Avi Kivity	c7dda5500c	database: simplify apply_counter_update exception handling Use coroutine::try_future to exit the coroutine immediately on error instead of explict checks. Closes scylladb/scylladb#28257	2026-01-20 11:13:49 +02:00
Wojciech Mitros	fc2aecea69	idl: don't redefine bound_weight and partition_region in paging_state.idl.hh Bound_weight and partition_region are defined in both paging_state.idl.hh and position_in_partition.idl.hh. This isn't currently causing any issues, but if a future RPC uses both the paging_state and position_in_partition, after including both files we'll get a duplicate error. In this patch we prevent this by removing the definitions from paging_state.idl.hh and including position_in_partition.idl.hh in their place. Closes scylladb/scylladb#28228	2026-01-20 11:12:47 +02:00
Aleksandra Martyniuk	2be5ee9f9d	service: use session variable for streaming Use session that was retrieved at the beginning of the handler for node operations with streaming to ensure that the session id won't change in between.	2026-01-20 10:06:34 +01:00
Aleksandra Martyniuk	3fe596d556	service: pass topology guard to RBNO Currently, raft-based node operations with streaming use topology guards, but repair-based don't. Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes. Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards. Fixes: https://github.com/scylladb/scylladb/issues/27759	2026-01-20 10:06:34 +01:00
Aleksandra Martyniuk	f0dbf6135d	test: add test for enforce_rack_list option	2026-01-20 10:01:15 +01:00
Aleksandra Martyniuk	7dc371f312	db: mark rf_rack_valid_keyspaces as deprecated Mark rf_rack_valid_keyspaces option as deprecated. User should use enforce_rack_list option instead. The option can still be used and it does not change it's behavior. Docs is updated accordingly.	2026-01-20 09:58:57 +01:00
Aleksandra Martyniuk	761ace4f05	config: add enforce_rack_list option Add enforce_rack_list option. When the option is set to true, all tablet keyspaces have rack list replication factor. When the option is on: - CREATE STATEMENT always auto-extends rf to rack lists; - ALTER STATEMENT fails when there is numeric rf in any DC. The flag is set to false by default and a node needs to be restarted in order to change its value. Starting a node with enforce_rack_list option will fail, if there are any tablet keyspaces with numeric rf in any DC. enforce_rack_list is a per-node option and a user needs to ensure that no tablet keyspace is altered or created while nodes in the cluster don't have the consistent value.	2026-01-20 09:58:51 +01:00
Michael Litvak	e7ec87382e	Revert "alternator: require rf_rack_valid_keyspaces when creating index" This reverts commit `4b26a86cb0`. The rf_rack_valid_keyspaces option is now not required for creating MVs.	2026-01-20 09:56:48 +01:00
Pavel Emelyanov	8ecd4d73ac	test: Update cluster/object_store/ tests to use new S3 config format Currently the suite generates config in old format, and only a single test validates that using new format "works". This change updates the suite (mainly the MinioServer::create_conf() method) to generate endpoint confit in new format. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28113	2026-01-20 10:53:34 +02:00
Pavel Emelyanov	5e369c0439	audit: Stop using deprecated seastar UDP sending API The datagram_channel::send() method that sends net::packet-s is deprecated in favor of using span<temporary_buffer> one. Auditing code still uses the former one -- it constructs a packet by using formatted string by copying the string into the packet's fragment, then sends it. This patch releases string into temporary_buffer and then passes one-element span to send(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28198	2026-01-20 10:51:23 +02:00
Avi Kivity	874322f95e	multishard_query: simplify do_query() coroutine/continuation complexity do_query() is a coroutine but uses some continuations to take advantage of exceptions being propagated via future::then() without being thrown. We can accomplish the same thing with a nested coroutine and coroutine::try_future(), simplifying the code. While this area isn't performance intensive, we're not adding allocations. The coroutine frame may add an allocation, but since read_page() certainly does not return immediately, the following then() will allocate as well. Since we eliminated that then(), the change is at least neutral allocation-wise. Closes scylladb/scylladb#28258	2026-01-20 10:45:10 +02:00
Łukasz Paszkowski	e07fe2536e	test/pylib/util.py: Add retries and additional logging to start_writes() Consider the following scenario: 1. Let nodes A,B,C form a cluster with RF=3 2. Write query with CL=QUORUM is submitted and is acknowledged by nodes B,C 3. Follow-up read query with CL=QUORUM is sent to verify the write from the previous step 4. Coordinator sends data/digest requests to the nodes A,B. Since the node A is missing data, digest mismatches and data reconciliation is triggered 5. The node A or B fails, becomes unavailable, etc 6. During reconciliation, data requests are sent to node A,B and fail failing the entire read query When the above scenario happens, the tests using `start_writes()` fail with the following stacktrace: ``` ... > await finish_writes() test/cluster/test_tablets_migration.py:259: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ test/pylib/util.py:241: in finish await asyncio.gather(*tasks) test/pylib/util.py:227: in do_writes raise e _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ worker_id = 1 ... > rows = await cql.run_async(rd_stmt, [pk]) E cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1} ``` Note that when a node failure happens before/during a read query, there is no test failure as the speculative retries are enabled by default. Hence an additional data/digest read is sent to the third remaining node. However, the same speculative read is cancelled the moment, the read query reaches CL which may trigger a read-repair. This change: - Retries the verification read in start_writes() on failure to mitigate races between reads and node failures - Adds additional logging to correlate Python exceptions with Scylla logs Fixes https://github.com/scylladb/scylladb/issues/27478 Fixes https://github.com/scylladb/scylladb/issues/27974 Fixes https://github.com/scylladb/scylladb/issues/27494 Fixes https://github.com/scylladb/scylladb/issues/23529 Note that this change test flakiness observed during tablet transitions. However, it serves as a workaround for a higher-level issue https://github.com/scylladb/scylladb/issues/28125 Closes scylladb/scylladb#28140	2026-01-20 10:38:20 +02:00
Piotr Smaron	6084f250ae	auth: use paged internal queries during migration Auth v2 migration uses non-paged queries via `execute_internal` API. This commit changes it to use `query_internal` instead, which uses paging under the hood. Fixes: scylladb/scylladb#27577	2026-01-20 09:32:21 +01:00
Piotr Smaron	345458c2d8	auth: move some code in migrate_to_auth_v2 up Just move the touched code above so the next commit is more readable. But this has a drawback: previously, if the returned rows were empty, this code was not executed, but now is independently of the query results. This shouldn't be a big deal, though, as auth shouldn't be empty.	2026-01-20 09:15:53 +01:00
Piotr Smaron	97fe3f2a2c	auth: re-align pieces of migrate_to_auth_v2 This is needed to make next commits more readable. Just moved the touched code 4 spaces to the right.	2026-01-20 09:15:53 +01:00
Piotr Smaron	3ca6b59f80	cql: extend `query_internal` with `query_state` param This later is going to be used to pass a query timeout via `qs` to `query_internal`.	2026-01-20 09:15:48 +01:00
Nadav Har'El	70b3cd0540	Merge 'vector_index: introduce `quantization` and `oversampling` options' from Szymon Malewski This patch adds vector index options allowing to enable quantization and oversampling. Specific quantization value will be used internally by vector store. In the current implementation, get_oversampling allows us to decide how many times more candidates to retrieve from vector store - final response is still trimmed to the given limit. It is a first step to allow rescoring - recalculation of similarity metric and re-ranking. Without rescoring oversampling will be also further optimized to happen internally in vector store. `test/vector_search/rescoring_test.cc` implements basic tests of added functionality. New options are documented in `docs/cql/secondary-indexes.rst`. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82 Ref https://scylladb.atlassian.net/browse/SCYLLADB-83 New feature - no backporting Closes scylladb/scylladb#27677 * github.com:scylladb/scylladb: vector_search: doc: Document new index options vector_search: test: Test oversampling vector_search: test: Add rescoring index options test vector_search: test: Extract Configure utility to shared header vector_index: introduce `quantization` and `oversampling` options	2026-01-20 08:50:46 +02:00
Avi Kivity	36347c3ce9	Merge 'db/system_keyspace: remove namespace v3' from Botond Dénes Cassandra changed their system tables in 3.0. We migrated to the new system table layout in 2017, in ScyllaDB 2.0. System tables introduced in Cassandra 3.0, as well as the 3.0 variant of pre-existing system tables were added to the db::system_table::v3 namespace. We ended up adding some new ScyllaDB-only system tables to this namespace as well. As the dust settled, most of the v3 system tables ended up being either simple aliases to non-v3 tables, or new tables. Either way, the codebase uses just one variant of each table for a long time now the v3:: distinction is pointless. Remove the v3 namespace and unify the table listing under the top-level db::system_keyspace scope. Code cleanup, no backport Closes scylladb/scylladb#28146 * github.com:scylladb/scylladb: db/system_keyspace: move remining tables out of v3 keyspace db/system_keyspace: relocate truncated() and commitlog_cleanups() db/system_keyspace: drop v3::local() db/system_keyspace: remove duplicate table names from v3	2026-01-19 20:54:38 +02:00
Dawid Mędrek	3b8bf85fbc	test/lib/boost_test_tree_lister.cc: Record empty test suites Before this commit, if a test file or a test suite didn't include any actual test cases, it was ignored by `boost_test_tree_lister`. However, this information is useful; for example, it allows us to tell if the test file the user wants to run doesn't exist or simply doesn't contain any tests. The kind of error we would return to them should be different depending on which situation we're dealing with. We start including those empty suites and files in the output of `--list_json_content`. --- Examples (with additional formatting): * Consider the following test file, `test/boost/dummy_test.cc` [1]: ``` BOOST_AUTO_TEST_SUITE(dummy_suite1) BOOST_AUTO_TEST_SUITE(dummy_suite2) BOOST_AUTO_TEST_SUITE_END() BOOST_AUTO_TEST_SUITE_END() BOOST_AUTO_TEST_SUITE(dummy_suite3) BOOST_AUTO_TEST_SUITE_END() ``` Before this commit: ``` $ ./build/debug/test/boost/dummy_test -- --list_json_content [{"file": "test/boost/dummy_test.cc", "content": {"suites": [], "tests": []}}] ``` After this commit: ``` $ ./build/debug/test/boost/dummy_test -- --list_json_content [{"file":"test/boost/dummy_test.cc", "content": {"suites": [ {"name": "dummy_suite1", "suites": [ {"name": "dummy_suite2", "suites": [], "tests": []} ], "tests": []}, {"name": "dummy_suite3", "suites": [], "tests": []} ], "tests": []}}] ``` * Consider the same test file as in Example 1, but also assume it's compiled into `test/boost/combined_tests`. Before this commit: ``` $ ./build/debug/test/boost/combined_tests -- --list_json_content \| grep dummy $ ``` After this commit: ``` $ ./build/debug/test/boost/combined_tests -- --list_json_content [..., {"file": "test/boost/dummy_test.cc", "content": {"suites": [ {"name": "dummy_suite1", "suites": [{"name": "dummy_suite2", "suites": [], "tests": []}], "tests": []}, {"name": "dummy_suite3", "suites": [], "tests": []}], "tests":[]}}, ...] ``` [1] Note that the example is simplified. As of now, it's not possible to use `--list_json_content` with a file without any Boost tests. That will result in the following error: `Test setup error: test tree is empty`. Refs scylladb/scylladb#25415	2026-01-19 18:03:24 +01:00
Dawid Mędrek	1129599df8	test/lib/boost_test_tree_lister.cc: Deduplicate labels In scylladb/scylladb@afde5f668a, we implemented custom collection of information about Boost tests in the repository. The solution boiled down to traversing through the test tree via callbacks provided by Boost.Test and calling that code from a global fixture. This way, the code is called automatically by the framework. Unfortunately, for an unknown reason, this leads to labels of test units being duplicated. We haven't found the root cause yet and so we deduplicate the labels manually. --- Example (with additional formatting): Consider the following test in the file `test/boost/dummy_test.cc`: ``` SEASTAR_TEST_CASE(dummy_case, *boost::unit_test::label("mylabel1")) { return make_ready_future(); } ``` Before this commit: ``` $ ./build/dev/test/boost/dummy_test -- --list_json_content [{"file": "test/boost/dummy_test.cc", "content": {"suites": [], "tests": [{"name": "dummy_case", "labels": "mylabel1,mylabel1"}]} }] ``` After this commit: ``` $ ./build/dev/test/boost/dummy_test -- --list_json_content [{"file": "test/boost/dummy_test.cc", "content": {"suites": [], "tests": [{"name": "dummy_case", "labels": "mylabel1"}]} }] ``` Refs scylladb/scylladb#25415	2026-01-19 18:01:14 +01:00
Nikos Dragazis	4cde34f6f2	storage_service: Remove redundant yields The loops in `ongoing_rf_change()` perform explicit yields, but they also perform coroutine operations which can yield implicitly. The explicit yields are redundant. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-01-19 16:18:49 +02:00
Marcin Maliszkiewicz	1318ff5a0d	test: perf: move cut_arg helper func to common code It will be reused later.	2026-01-19 14:33:10 +01:00
Tomasz Grabiec	7977c97694	Merge 'db: add effective_capacity to load_per_node virtual table' from Ferenc Szili `effective_capacity` is a value used in size based load balancing. It contains the sum of available disk space of a node and all the tablet sizes. This change adds this value to the virtual table `system.load_per_node`. This can be useful for debugging size based load balancing. Size based load balancing is currently only on master, so no backport is needed. Closes scylladb/scylladb#28220 * github.com:scylladb/scylladb: docs: add effective_capacity to system keyspace docs virtual_table: add effective_capacity to load_per_node	2026-01-19 13:17:28 +01:00
Marcin Maliszkiewicz	be8a30230b	Merge 'test/cluster/dtest: import scrub_test.py' from Botond Dénes This test has to be adjusted in lock-step with scylladb.git, due to changes in https://github.com/scylladb/scylladb/pull/27836. It is simpler to just take the time and import it, so https://github.com/scylladb/scylladb/pull/27836 can patch all the affected tests, including this one. All code is imported verbatim, then patched later, such that the series remains bisectable. dtest import, no backport needed Closes scylladb/scylladb#28085 * github.com:scylladb/scylladb: test/cluster/dtest: remove is_win() and users test/cluster/dtest/scrub_test.py: add license blurb test/cluster/dtest: import scrub_test.py test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable()	2026-01-19 12:14:08 +01:00
Botond Dénes	2e4d0e42f0	test/cluster/dtest: remove is_win() and users ScyllaDB and its tests never run on windows, this function is not needed, patch it out.	2026-01-19 12:56:57 +02:00
Botond Dénes	8953a143e5	test/cluster/dtest/scrub_test.py: add license blurb The original scrub test was done by the Cassandra project, hence there is two Licenses notices: one for the original work by Cassandra (2015) and one for our modifications on top (2021).	2026-01-19 12:55:59 +02:00
Botond Dénes	d2c266eb47	test/cluster/dtest: import scrub_test.py Import the test verbatim. Requires adding is_win() to ccmlib/common.py, with a dummy implementation.	2026-01-19 12:52:44 +02:00
Botond Dénes	99e8a92aef	test/cluster/dtest/ccmlib: scylla_node.py: adapt run_scylla_sstable() at al To work in the local test.py context.	2026-01-19 12:52:44 +02:00
Botond Dénes	807da53583	test/cluster/dtest/ccmlib: scylla_node.py: import run_scylla_sstable() And dependencies: get_sstables() and __gather_sstables(). Code is importend verbatim, but doesn't work yet (no users yet either). Will be patched to work in the next commit.	2026-01-19 12:52:44 +02:00
Botond Dénes	e01041d3ee	db/system_keyspace: move remining tables out of v3 keyspace The last remining tables in the v3 keyspace are those that are genuinely distinct -- added by Cassandra 3.0 or >= ScyllaDB 2.0. Move these out of the v3 keyspace too, with this the v3 keyspace is defunct and removed.	2026-01-19 12:32:21 +02:00
Botond Dénes	ce57ef94bd	db/system_keyspace: relocate truncated() and commitlog_cleanups() The name variables of these tables is outside the v3 namespace but the method defining their schema is in the v3 namespace. Relocate the methods out from the v3 namespace, to the scope where the name variables live. The methods are moved to the private: part of system_keyspace, as they don't have external users currently.	2026-01-19 12:32:21 +02:00
Botond Dénes	2ccb8ff666	db/system_keyspace: drop v3::local() It is unused, the non-v3 variant is used instead.	2026-01-19 12:32:21 +02:00
Botond Dénes	b52a3f3a43	db/system_keyspace: remove duplicate table names from v3 Those table names that are effectively just an alias of the their counterpart outside of the v3 namespace (struct). scylla_local() is made public. Currently it is private, but it has external users, working around the private designation by using the public v3::scylla_local() alias. This change just makes the existing status clear.	2026-01-19 12:32:21 +02:00
Karol Nowacki	324b829263	vector_search: doc: Document new index options Adds documentation for the `quantization` and `oversampling` options for vector search indexes.	2026-01-19 10:28:46 +01:00
Karol Nowacki	bca17290f4	vector_search: test: Test oversampling Add test to verify that Scylla correctly oversamples the limit according to the oversampling option.	2026-01-19 10:28:46 +01:00
Karol Nowacki	e347f6d0d4	vector_search: test: Add rescoring index options test Add tests to validate quantization and oversampling index options.	2026-01-19 10:28:44 +01:00
Karol Nowacki	24b037e8e3	vector_search: test: Extract Configure utility to shared header Move Configure test utility to dedicated file for reuse across test suites.	2026-01-19 10:21:44 +01:00
Szymon Malewski	b8e91ee6ae	vector_index: introduce `quantization` and `oversampling` options This patch adds vector index options allowing to enable quantization and oversampling. Specific quantization value will be used internally by vector store. In the current implementation, `get_oversampling` allows us to decide how many times more candidates to retrieve from vector store - final response is still trimmed to the given limit. It is a first step to allow rescoring - recalculation of similarity metric and re-ranking. Without rescoring oversampling will be also further optimized to happen internally in vector store. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-82 Ref https://scylladb.atlassian.net/browse/SCYLLADB-83	2026-01-19 10:21:43 +01:00
Tomasz Grabiec	dd0fc35c63	lsa: Export metrics for reclaim/evict/compact time Currently, we only know about long reclaims from lsa-timing stall reports. Shorter reclaims can go under the radar. Those metrics will help to asses increase in LSA activity, which translates to higher CPU cost of a workload. reclaim tracks memory which goes to the standard allocator, e.g. when entering and allocating_section or in the background reclaimer. evict/compact count activity towrads building LSA reserve, in allocating_section entry, or naked LSA allocation. Closes scylladb/scylladb#27774	2026-01-19 12:08:16 +03:00
Nadav Har'El	3e270a49f7	test/cqlpy: remove test_describe.py from cluster reuse blacklist The way that test.py runs test/cqlpy tests requires that tests end their session with all keyspaces deleted. If we forget to delete a keyspace, test.py suspects some test fails and reports a failure. As reported in issue #26291, the test file test/cqlpy/test_describe.py caused this check to trigger, so this file was added to the blacklist "dirties_cluster" in suite.yaml to force test.py to ignore this problem. I believe the cause of the problem was as follows: test_describe.py didn't really leave any undeleted keyspace. Rather, test_describe.py had one test which used "USE" and this broke DESC KEYSPACES (Refs #26334) - which test.py used to see which keyspaces remained. We solved this problem not just once, but twice: 1. In pull request #26345, I fixed the test not to use "USE" on the main CQL session. 2. In pull request #27971, I fixed DESC KEYSPACES implementation so even if "USE" was in effect, it will return the correct results. I checked manually, and after removing test_describe.py from the dirties_cluster blacklist, all cqlpy tests now pass, without spurious failures in the test following test_describe.py. So it's time to remove it from the blacklist. Fixes #26291 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27973	2026-01-19 12:02:00 +03:00
Dario Mirovic	823d1b9c03	audit: fix start_audit init sequence placement Commit `d54d409` (audit: write out to both table and syslog) unified create_audit and start_audit, which moved the audit service creation later in the startup sequence. This broke startup when audit is enabled because view_builder prepares CQL queries before start_audit runs, and query preparation calls audit_instance().local_is_initialized() which crashes on the non-existent sharded service. Move start_audit to run before view_builder::start() and other components that may prepare CQL queries during their initialization. Fixes SCYLLADB-252 Closes scylladb/scylladb#28139	2026-01-19 11:57:39 +03:00
Botond Dénes	6f5f42305a	docs: make the glossary more tablet inclusive Our glossary is stuck in the past, still discussing token ownership in terms of vnodes and cluster synchronization in terms of gossip. This patch tries to improve this a bit, although much more work needs to be done. The term `Tablet` is added and the definition of `Token` and `Token Range` is rephrased to be tablet inclusive. The term `Cluster` is changed to mention raft as the synchronization mechanism instead of gossip. One oustanding problem is that our general architecture page describing the ring acrhitecture is still Vnode only. We have a seprate Tablets page, but the two don't link to each other and most documentation refer only to the former. A casual reader might be able to spend a a lot of time on our documentation page, without even seeing the word: tablets. Closes scylladb/scylladb#28170	2026-01-19 11:50:13 +03:00
Ernest Zaslavsky	829bd9b598	aws_error: fix nested exception handling The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary Closes scylladb/scylladb#28143	2026-01-19 11:41:47 +03:00
Botond Dénes	b7bc48e7b7	reader_concurrency_semaphore: improve handling of base resources reader_permit::release_base_resources() is a soft evict for the permit: it releases the resources aquired during admission. This is used in cases where a single process owns multiple permits, creating a risk for deadlock, like it is the case for repair. In this case, release_base_resources() acts as a manual eviction mechanism to prevent permits blockings each other from admission. Recently we found a bad interaction between release_base_resources() and permit eviction. Repair uses both mechanism: it marks its permits as inactive and later it also uses release_base_resources(). This partice might be worth reconsidering, but the fact remains that there is a bug in the reader permit which causes the base resources to be released twice when release_base_resources() is called on an already evicted permit. This is incorrect and is fixed in this patch. Improve release_base_resources(): * make _base_resources const * move signal call into the if (_base_resources_consumed()) { } * use reader_permit::impl::signal() instead of reader_concurrency_semaphore::signal() * all places where base resources are released now call release_base_resources() A reproducer unit test is added, which fails before and passes after the fix. Fixes: #28083 Closes scylladb/scylladb#28155	2026-01-19 11:37:51 +03:00
Nadav Har'El	d86d5b33aa	test/cqlpy: translate Cassandra's unit tests for LWT This is a translation of Cassandra's CQL unit test source file validation/operations/InsertUpdateIfConditionTest.java into our cqlpy framework. This test file checks various LWT conditional updates. After that file became too big, the Cassandra developers split parts from it - moving tests for LWT with collections, UDTs, and static columns to separate test files - which I already translated (pull request #13663). This patch translates the remaining, main, LWT tests. Strangely, this test file also has, in the middle of the file, several tests for conditional schema changes, like CREATE KEYSPACE IF NOT EXISTS, a feature which has nothing to do with LWT so really didn't belong in this file. But I translated those as well. These new tests all pass on both ScyllaDB and Cassandra, and have not uncovered any new bug. However these tests do demonstrate yet again something that users and developers of ScyllaDB's LWT must be aware of: Whereas usually ScyllaDB's goal has been compatiblity with Cassandra's CQL, in LWT this has not been the case: ScyllaDB deviated from Cassandra's behavior in its LWT implementation in several places. These intentional deviations were documented in docs/kb/lwt-differences.rst. Accordingly, the tests here include almost a hundred (!) modificatons (search for "if is_scylla") to allow the same test to pass on both ScyllaDB and Cassandra, as well as many comments explaining the types of differences we're seeing. Although these deviations from Cassandra compatibility are known and intentional, it's worth listing here the ones re-discovered by these new tests: 1. On a successful conditional write, Cassandra returns just true, Scylla also returns the old contents of the row. 2. Similarly, in an IF EXISTS write that failed (the row did not exist), Cassandra returns just false, Scylla also returns extra null values for each and every column of the row. 3. Cassandra allows in "IF v IN (?, ?)" to bind individual values to UNSET_VALUE and skips them, Scylla treats this as an error. Refs #13659. 4. When there are static columns, Scylla's LWT response returns the static column first, Cassandra returns the modified column first. Since both also say which columns they return, neither is more correct than the other, a normally users will address specific columns by name, not by position. 5. docs/kb/lwt-differences.rst explains that "the returned result set contains an old row for every conditional statement in the batch". Beyond this different, actually non-conditional updates in the batch will also get a row in Scylla's result. Refs #27955. 6. For batch statement, ScyllaDB allows mixing `IF EXISTS`, `IF NOT EXISTS`, and other conditions for the same row. Cassandra doesn't, so checks that these combinations are not allowed were commented out. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27961	2026-01-19 09:46:04 +02:00
Botond Dénes	19efd7f6f9	Merge 'The system_replicated_keys should be mark as a system keyspace' from Amnon Heiman This PR marks system_replicated_keys as a system keyspace. It was missing when the keyspace was added. A side effect of that is that metrics that are not supposed to be reported are. Fixes #27903 Closes scylladb/scylladb#27954 * github.com:scylladb/scylladb: distributed_loader: system_replicated_keys as system keyspace replicated_key_provider: make KSNAME public	2026-01-19 09:37:41 +02:00
Aleksandra Martyniuk	65cba0c3e7	service: node_ops: remove coroutine::lambda wrappers In storage_service::raft_topology_cmd_handler we pass a lambda wrapped in coroutine::lambda to a function that creates streaming_task_impl. The lambda is kept in streaming_task_impl that invokes it in its run method. The lambda captures may be destroyed before the lambda is called, leading to use after free. Do not wrap a lambda passed to streaming_task_impl into coroutine::lambda. Use this auto dissociate the lambda lifetime from the calling statement. Fixes: https://github.com/scylladb/scylladb/issues/28200. Closes scylladb/scylladb#28201	2026-01-19 09:19:53 +02:00
Botond Dénes	c8811387e1	Merge 'service: do not change the schema while pausing the rf change ' from Aleksandra Martyniuk Currently, if a rf change request is paused, it immediately changes the system_schema.keyspaces to use rack list for this keyspace. If the request is aborted, the co-location might not be finished. Hence, we can end up with inconsistent schema and tablet replica state. Update the system_schema.keyspaces only after the co-location is done (and not when it's started). Fixes: https://github.com/scylladb/scylladb/issues/28167 No backport needed; changes that introduced a bug are only on master Closes scylladb/scylladb#28168 * github.com:scylladb/scylladb: service: fin indentation test: add test_numeric_rf_to_rack_list_conversion_abort service: tasks: fix type of global_topology_request_virtual_task service: do not change the schema while pausing the rf change	2026-01-19 09:15:20 +02:00
Botond Dénes	7d637b14e8	erge 'test/cluster/test_internode_compression: Transpose test from dtest' from Calle Wilund Refs #27429 Re-implement the dtest with same name as a scylla pytest, using a python level network proxy instead of tcpdump etc. Both to avoid sudo and also to ensure we don't race. Juggles different listen_address and broadcast_address values to insert a proxy measuring RPC traffic. Note: the measuring relies on python network IO not splitting data chunks, since we don't really have packet-level view of the connections. Note that a scylla change is required to make the ip address magic work, otherwise topology mechanism gets confused. This should maybe at some point be looked into more, since we should be more resilient against various services in scylla binding to different addresses. When this test is merged, we can drop the flaky test from dtest. And hope no new flakiness comes from this one... Closes scylladb/scylladb#28133 * github.com:scylladb/scylladb: test/cluster/test_internode_compression: Transpose test from dtest gossiper/main: Extend special treatment of node ID resolve for rpc_address	2026-01-19 08:34:31 +02:00
Ferenc Szili	1136a3f398	docs: add effective_capacity to system keyspace docs This adds the description of effective_capacity to the documentation of the system keyspace.	2026-01-18 16:57:08 +01:00
Ferenc Szili	3e0362ec67	virtual_table: add effective_capacity to load_per_node This change adds effective_capacity to the virtual table load_per_node. This value can be useful for debugging size based load balancing.	2026-01-18 16:52:13 +01:00
Nadav Har'El	3e138a2685	test/cqlpy: Add our copyright/license to translated Cassandra tests All the tests under test/cqlpy/cassandra_tests/ were translated from Cassandra's unit tests originally written in Java into our own test framework, and accordinly carry a clear mention of their origin and original license. However, we did modify these original tests - even if the modification was slight and mostly straightforward. Therefore I was asked to also mention our own copyright (and license) for these modifications. So this patch adds to every file in test/cqlpy/cassandra_tests/ text like: # Modifications: Copyright 2026-present ScyllaDB # SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0 with the appropriate year instead of 2026. Fixes #28215 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28216	2026-01-18 16:25:28 +01:00
Tomasz Grabiec	3b0df29ceb	docs: Document parallel decommission and removenode and relevant task API	2026-01-18 15:36:08 +01:00
Tomasz Grabiec	85140cdf7e	test: Add tests for parallel decommission/removenode	2026-01-18 15:36:08 +01:00
Tomasz Grabiec	5c93e12373	test: util: Introduce ensure_group0_leader_on() Many tests want to assume that group0 leader runs on a particualr server, typically the first server in the list. And they cannot be easily made to work with arbitrary leader, becuase they setup a particular topology and then stop particular nodes, and want to assume the leader is stable. They open leader's log and expect things to appear in that log. It's much easier to ensure the leader, than to prepare tests to handle failovers.	2026-01-18 15:36:07 +01:00
Tomasz Grabiec	478b8f09df	test: tablets: Check that there are no migrations scheduled on draining nodes In case of decommission, it's not desirable because it's less urgent. In case of removenode, it leads to failure of removenode operation because scheduled co-locating migration will fail if the destination is on the excluded node, and this failure will be interpreted as drain failure and coordinator will cancel the request. Not a problem before "parallel decommission" because this failure is only a streaming failure, not a barrier failure, so exception doesn't escape into the catch clause in transition stage handler, and the migration is simply rolled back. Once draining happens in the tablet migration track, streaming failure will be interpreted as drain failure and cancel the request.	2026-01-18 15:36:07 +01:00
Tomasz Grabiec	e082e32cc7	test: lib: topology_builder: Introduce add_draining_request()	2026-01-18 15:36:07 +01:00
Tomasz Grabiec	baea12c9cb	topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization Reaching critical disk utilization on destination means the draining either caused it, or at least works against reliveing it. So it's better to cancel those requests. In case of decommission, if critical disk utilization was caused by it due to not enough capacity, aborting decomission will bring capacity back to the system and rebalancing will relieve critical disk utlization.	2026-01-18 15:36:07 +01:00
Tomasz Grabiec	1b784e98f3	tablets: topology_coordinator: Refactor to propagate reason for migration rollback Will be easier to implement later change to cancel topology request, where we need to give a reason for doing so.	2026-01-18 15:36:07 +01:00
Tomasz Grabiec	2d954f4b19	tablet_allocator: Skip co-location on draining nodes In case of decommission, it's not desirable because it's less urgent. In case of removenode, it leads to failure of removenode operation because scheduled co-locating migration will fail if the destination is on the excluded node, and this failure will be interpreted as drain failure and coordinator will cancel the request. Not a problem before "parallel decommission" because this failure is only a streaming failure, not a barrier failure, so exception doesn't escape into the catch clause in transition stage handler, and the migration is simply rolled back. Once draining happens in the tablet migration track, streaming failure will be interpreted as drain failure and cancel the request.	2026-01-18 15:36:06 +01:00
Tomasz Grabiec	d9e1a6006f	node_ops: task_manager_module: Populate entity field also for active requests	2026-01-18 15:36:06 +01:00
Tomasz Grabiec	bbd293d440	tasks: node_ops: Put node id in the entity field If we have many node requests active at a time, it's useful to know which requets works on which node. Fixes #27208	2026-01-18 15:36:06 +01:00
Tomasz Grabiec	576ebcdd30	tasks, node_ops: Unify setting of task_stats in get_status() and get_stats() They should return the same, so extract the common logic.	2026-01-18 15:36:05 +01:00
Tomasz Grabiec	629d6d98fa	topology: Protect against empty cancelation reason Request would be deemed successful, which is counter to the intention of cancelation and effect on the system.	2026-01-18 15:36:05 +01:00
Tomasz Grabiec	7446eb7e8d	tasks, topology: Make pending node operations abortable We want to be able to cancel decommission when it's still in the tablet draining phase. Such a request is in a pending and paused state, and can be safely canceled. We set the node's "draining" flag back to false.	2026-01-18 15:36:05 +01:00
Tomasz Grabiec	091ed4d54b	doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged Since `288e75fe22`	2026-01-18 15:36:05 +01:00
Tomasz Grabiec	a009644c7d	raft_topology, tablets: Drain tablets in parallel with other topology operations Allows other topology operations to execute while tablets are being drained on decommission. In particular, bootstrap on scale-out. This is important for elasticity. Allows multiple decommission/removenode to happen in parallel, which is important for efficiency. Flow of decommission/removenode request: 1) pending and paused, has tablet replicas on target node. Tablet scheduler will start draining tablets. 2) No tablets on target node, request is pending but not paused 3) Request is scheduled, node is in transition 4) Request is done Nodes are considered draining as soon as there is a leave or remove request on them. If there are tablet replicas present on the target node, the request is in a paused state and will not be picked by topology coordinator. The paused state is computed from topology state automatically on reload. When request is not paused, its execution starts in write_both_read_old state. The old tablet_draining state is not entered (it's deprecated now). Tablet load balancing will yield the state machine as soon as some request is no longer paused and ready to be scheduled, based on standard preemption mechanics. The test case test_explicit_tablet_movement_during_decommission is removed. It verifies that tablet move API works during tablet draining transition. After this PR, we no longer enter this transition, so the test doesn't work. It loses its purpose, because movement during normal tablet balancing is not special and tested elsewhere.	2026-01-18 15:36:05 +01:00
Tomasz Grabiec	e38ee160fc	virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node It gives a more accurate picture of what happens in the cluster.	2026-01-18 15:36:04 +01:00
Tomasz Grabiec	1c2e47e059	locator: topology: Add "draining" flag to a node They are being drained of tablet replicas, tablet scheduler works to move replicas away from such nodes. This state is set at the beginning of decommission and removenode operations.	2026-01-18 15:36:04 +01:00
Tomasz Grabiec	a37b1ce832	topology_coordinator: Extract generate_cancel_request_update()	2026-01-18 15:36:04 +01:00
Tomasz Grabiec	77bd00bf9f	storage_service: Drop dependency in topology_state_machine.hh in the header To reduce compilation time.	2026-01-18 15:36:04 +01:00
Tomasz Grabiec	a24c3fc229	locator: Extract common code in assert_rf_rack_valid_keyspace()	2026-01-18 15:36:04 +01:00
Tomasz Grabiec	d3ee82ea51	topology_coordinator, storage_service: Validate node removal/decommission at request submission time After parallel tablet draining, the validation at the time request starts executing is too late, tablets will be already drained. This trips tests which expect validation failure, but get tablet draining failure instead. Also, in case of decommission, it's a waste to go through draining only to discover that the operation has to be rolled back due to validation. So avoid submitting a request altogether if it's invalid. The validation at request execution start remains, for extra sefety. validate_removing_node() was extracted out of topology_coordinator, so that it can be called by storage_service on non-coordinator. Some tests need adjusting for the fact that after failed removenode the node may still not be marked as excluded, so we need to explicitly exclude it or add to the list of ignored nodes in the next removenode operation.	2026-01-18 15:36:04 +01:00
Nadav Har'El	34d28475d9	Merge 'Implement Vector Search filtering API' from Dawid Pawlik Since Vector Store service filtering API has been implemented (scylladb/vector-store#334), there is a need for the implementation of Scylla side part. This patch should implement a `statement_restrictions` parsing into Vector Store filtering API compatible JSON objects. Those objects should be added to ANN query vector POST requests as `filter` object. After this patch, the subset of all operations ([Vector Search Filtering Milestone 1](https://scylladb.atlassian.net/wiki/spaces/RND/pages/156729450/Vector+Search+Filtering+Design+Document#Milestone-1)) happy path should be completed, allowing users to filter on primary key columns with single column `=` and `IN` or multiple column `()=()` and `() IN ()`. The restrictions for other operations should be implemented in a PR on Vector Store service side. --- This PR implements parsing the `statement_restrictions` into Vector Store filtering API compatible JSON objects. The JSON objects are created and used in ANN vector queries with filtering. It closes the Scylla side implementation of Vector Search filtering milestone 1. Unit tests for `statement_restrictions` parsing are added. Integration tests will be added on Vector Store service side PR. --- Fixes: SCYLLADB-249 New feature, should land into 2026.1 Closes scylladb/scylladb#28109 * github.com:scylladb/scylladb: docs: update documentation on filtering with vector queries test/vector_search: add test for filtered ANN with VS mock test/vector_search: add restriction to JSON conversion unit tests vector_search: cql: construct and use filter in ANN vector queries select_statement: do not require post query ordering for vector queries vector_search: add `statement_restrictions` to JSON parsing	2026-01-18 16:11:29 +02:00
Ernest Zaslavsky	eb76858369	Update seastar submodule seastar dd46b6f..e00f1513 ``` e00f1513 Merge 'net: Add DNS TTL to the net::hostent' from Ernest Zaslavsky 8a69e1f4 net: extract common implementation of inet_address::find_all cb469fd1 net: deprecate the addr_list in hostent 1d59c0ca net: expose DNS TTL via net::hostent 3c6d919f http: add virtual close() to connection_factory bbd0001a Revert "net: expose DNS TTL via net::hostent" ``` Closes scylladb/scylladb#28147	2026-01-18 15:00:48 +02:00
Andrzej Jackowski	6eca7e4ff6	transport: unify lambda capture lifetime for control connections Workload prioritization was added in scylladb/scylladb#22031. The functionality of updating service levels was implemented as a lambda coroutine, leaving room for the lambda coroutine fiasco. The problem was noticed and addressed in scylladb/scylladb#26404. There are currently three functions that call switch_tenant: - update_user_scheduling_group_v1 and update_user_scheduling_group_v2 use the deducing this (this auto self) to ensure the proper lifecycle of the lambda capture. - update_control_connection_scheduling_group doesn’t use the deducing this, but the lambda captures only `this`, which is used before the first possible coroutine preemption. Therefore, it doesn’t seem that any memory corruption or undefined behavior is possible here. Nevertheless, it seems better to start using the deducing this in update_control_connection_scheduling_group as well, to avoid problems in the future if someone modifies the code and forgets to add it. Fixes: SCYLLADB-284 Closes scylladb/scylladb#28158	2026-01-17 20:36:31 +02:00
Nikos Dragazis	8aca7b0eb9	test: database_test: Fix serialization of partition key The `make_key` lambda erroneously allocates a fixed 8-byte buffer (`sizeof(s.size())`) for variable-length strings, potentially causing uninitialized bytes to be included. If such bytes exist and they are not valid UTF-8 characters, deserialization fails: ``` ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7) ``` Fixes #28195. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#28197	2026-01-17 20:32:06 +02:00
Botond Dénes	1e09a34686	replica: add abort polling to memtable and cache readers Continuing the read once it is aborted (e.g. due to timeout) is a waste of resources, as the produced results will be discarded. Poll the permit's abort exception in the memtable and cache reader's fill_buffer(). This results in one poll per buffer filled (8KB of data). We already have similar poll for sstable readers, as disk reads are usually much heavier and therefore it is more important to stop them ASAP after abort. Cache and memtable reads are usually quick but not always, hence it is important to also have polling in the cache and memtable readers. Refs: #11469 Fixes: #28148 Closes scylladb/scylladb#28149	2026-01-16 18:03:04 +01:00
Ferenc Szili	0aebc17c4c	docs: correct spelling errors in size based balancing docs `0ede8d154b` introduced the dev doc for size based load balancing, but also added spelling errors. This PR fixed these errors. Closes scylladb/scylladb#28196	2026-01-16 17:41:57 +02:00
Aleksandra Martyniuk	ad2381923f	service: fin indentation	2026-01-16 11:38:10 +01:00
Aleksandra Martyniuk	504290902c	test: add test_numeric_rf_to_rack_list_conversion_abort Add regression test that checks whether aborted rf change leaves the system_schema.keyspaces unchanged.	2026-01-16 11:36:21 +01:00
Aleksandra Martyniuk	3ed8701301	service: tasks: fix type of global_topology_request_virtual_task Currently, the type of global_topology_request_virtual_task isn't taken out of std::variant before printing, which results with a task of type variant(actual_type). Retrieve the type from the variant before passing it to task type.	2026-01-16 11:36:21 +01:00
Aleksandra Martyniuk	580dfd63e5	service: do not change the schema while pausing the rf change Currently, if a rf change request is paused, it immediately changes the system_schema.keyspaces to use rack list for this keyspace. If the request is aborted, the co-location might not be finished. Hence, we can end up with inconsistent schema and tablet replica state. Update the system_schema.keyspaces only after the co-location is done (and not when it's started).	2026-01-16 11:36:15 +01:00
Patryk Jędrzejczak	eb7be9010d	Merge 'topology_coordinator: Refresh load stats after table is created or altered' from Tomasz Grabiec We switched to the size-based load balancing, which now has more strict requirements for load stats. We no longer need only per-node stats, but also per-tablet stats. Bootstrapping a node triggers stats refresh, but allocating tablets on table creation didn't. So after creating a table, load balancer couldn't make progress for up to 60s (stats refresh period). This makes tests take longer, and can even cause failures if tests are using a low-enough timeout. Fixes https://github.com/scylladb/scylladb/issues/27921 No backport becuse only master is vulnerable (size-based load balancing). Closes scylladb/scylladb#27926 * https://github.com/scylladb/scylladb: test: cluster: Add reproducer for missed notification in topology coordinator topology_coordinator: Wake up the state machine after stats refresh topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier topology_coordinator: Fix potential missed notification topology_coordinator: Refresh load stats after table is created or altered tablets: Do a group0 read barrier on tablet load stats refresh topology_coordinator: Ensure stats are refreshed in the gossip scheduling group test: Use ManagerClient.{disable,enable}_tablet_balancing() test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing()	2026-01-16 11:34:57 +01:00
Dawid Pawlik	383f9e6e56	docs: update documentation on filtering with vector queries Add a description of available filtering options with ANN vector queries. Provide an example of such query and a reference to `WHERE` clause restrictions.	2026-01-16 11:18:23 +01:00
Dawid Pawlik	67d3454d2b	test/vector_search: add test for filtered ANN with VS mock Implement a test using Vector Store mock to check if end-to-end integration works with filtered ANN query.	2026-01-16 11:18:23 +01:00
Dawid Pawlik	a54be82536	test/vector_search: add restriction to JSON conversion unit tests Add unit tests for conversion of CQL restrictions to Vector Store filtering API compatible JSON objects. The tests include: - empty restriction - `ALLOW FILTERING` in restriction - single column restrictions - `=`, `<`, `>`, `<=`, `>=`, `IN` - multiple column restrictions - `()=()`, `()<()`, `()>()`, `()<=()`, `()>=()`, `() IN ()` - multiple restrictions conjunction - `TEXT` and `BOOLEAN` column restrictions	2026-01-16 11:18:23 +01:00
Dawid Pawlik	2a38794b8e	vector_search: cql: construct and use filter in ANN vector queries Add `filter` option in `ann()` function to write the filter JSON object as the POST request in ANN vector queries. Adjust existing `vector_store_client_test` tests accordingly.	2026-01-16 11:18:23 +01:00
Dawid Pawlik	304e908e3b	select_statement: do not require post query ordering for vector queries As there is only one `ORDER BY` clause with `ANN OF` ordering supported in ANN vector queries, there is no need to require post query ordering for the ANN vector queries. The standard ordering is not allowed here. In fact the ordering is done on the Vector Store service side within the ANN search, so that the returned primary keys are already sorted accordingly. If left unchanged, the filtering with `IN` clauses would cause a `bad_function_call` server error as the filtering with `IN` clauses require the post query ordering in a standard case.	2026-01-16 11:18:23 +01:00
Dawid Pawlik	a84d1361db	vector_search: add `statement_restrictions` to JSON parsing Add a module parsing the statement restrictions into Vector Store filtering API compatible JSON objects. The API was defined in: scylladb/vector-store#334 Examplary JSON object compatible with the API: ``` { "restrictions": [ { "type": "==", "lhs": "pk", "rhs": 1 }, { "type": "IN", "lhs": "pk", "rhs": [2, 3] }, { "type": "<", "lhs": "ck", "rhs": 4 }, { "type": "<=", "lhs": "ck", "rhs": 5 }, { "type": ">", "lhs": "pk", "rhs": 6 }, { "type": ">=", "lhs": "pk", "rhs": 7 }, { "type": "()==()", "lhs": ["pk", "ck"], "rhs": [10, 20] }, { "type": "()IN()", "lhs": ["pk", "ck"], "rhs": [[100, 200], [300, 400]] }, { "type": "()<()", "lhs": ["pk", "ck"], "rhs": [30, 40] }, { "type": "()<=()", "lhs": ["pk", "ck"], "rhs": [50, 60] }, { "type": "()>()", "lhs": ["pk", "ck"], "rhs": [70, 80] }, { "type": "()>=()", "lhs": ["pk", "ck"], "rhs": [90, 0] } ], "allow_filtering": true } ```	2026-01-16 11:18:23 +01:00
Tomasz Grabiec	3fb7719277	topology_coordinator: Update load stats in case rebuilding with no live replica Such rebuild has no read_from replica, but we know the tablet size will be 0. If we don't, stats will be incomplete until the next refresh. This is important for test cases which do removenode or replace while all replicas are down. So for example test_replace from test_tablets_removenode.py, which uses RF=1 and replaces a node. Without this, the test waits for 60s needlessly after the first round of rebuilding migrations before scheduling more migrations. This can cause the test to time out. Fixes #28115 Closes scylladb/scylladb#28121	2026-01-16 11:19:01 +02:00
Sergey Zolotukhin	799d837295	test: disable test_start_bootstrapped_with_invalid_seed The test intermittently fails when an invalid DNS name is resolved, likely due to ISP DNS error hijacking (see scylladb/scylladb#28153). Disable this test to unblock CI. Fixes scylladb/scylladb#28153 Closes scylladb/scylladb#28162	2026-01-15 10:25:45 +01:00
Jenkins Promoter	51d61f809e	Update pgo profiles - aarch64	2026-01-15 05:13:03 +02:00
Jenkins Promoter	eed1e7fa23	Update pgo profiles - x86_64	2026-01-15 04:33:43 +02:00
Tomasz Grabiec	eef798d84f	Merge 'Distribute data evenly among primary replicas during restore' from Robert Bindar Most likely `817fdad` uncovered the fact that our choice of primary replica was resonating with tablet allocation and we were ending up picking the same replica as primary within a scope instead of rotating primaryship among all replicas in the scope. This created situations where for instance, restoring into a 9 nodes with primary_replica_only=true would put all data into 3 nodes, leaving the other 6 unused. The balancing of the dataset was performed by the subsequent repair step. This PR fixes this by changing the formula for picking up the primary replica out of a set of eligible replicas from within the passed scope. The PR also extends the testing scenarios in `test_backup.py` so we get to run restore for a set of topologies, for all combinations of scope, primary_replica_only and min_tablet_counts. Most of the work was done by @bhalevy [here](https://github.com/scylladb/scylladb/compare/master...bhalevy:scylla:load-balance-primary-replica), this PR just splitted it and did touchups here and there. Fixes #27281 Closes scylladb/scylladb#27397 * github.com:scylladb/scylladb: test: reduce dataset and number of test cases or debug builds test: bump repair timeout up, it's sometimes not enough in CI test: refactor test_refresh.py to match test_restore_with_streaming_scopes. test: extend test_restore_with_streaming_scopes test: Adjust test_restore_primary_replica_different_dc_scope_all test: Refactor restoring code in test_backup to match SM pattern test: add check_mutation_replicas calls after fresh creation of dataset test: extend create_dataset to accept consistency_level test: refactor check_mutation_replicas so it's more readable test: make create_dataset async and refactor so it's configurable test: use defaultdict in collect_mutations test: add log marks to facilitate reusing server for restore locator: tablets: Distribute data evenly among primary replicas during restore	2026-01-14 18:57:55 +01:00
Avi Kivity	bd08b6e5b2	Merge 'Unify configuration of object storage endpoints (take 2)' from Pavel Emelyanov To configure S3 storage, one needs to do ``` object_storage_endpoints: - name: s3.us-east-1.amazonaws.com port: 443 https: true aws_region: us-east-1 ``` and for GCS it's ``` object_storage_endpoints: - name: https://storage.googleapis.com:433 type: gs credentials_file: <gcp account credentials json file> ``` This PR updates the S3 part to look like ``` object_storage_endpoints: - name: https://s3.us-east-1.amazonaws.com:443 aws_region: us-east-1 ``` fixes: #26570 This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with. About correctness of the changes. No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111 to show that they are valid and pass before changing the core code. Enhancing the way configuration is made, likely no need to backport. Closes scylladb/scylladb#28112 * github.com:scylladb/scylladb: test: Validate S3 endpoints new format works docs: Update docs according to new endpoints config option format object_storage: Create s3 client with "extended" endpoint name s3/storage: Tune config updating sstable: Shuffle args for s3_client_wrapper test: Rename badconf variable into objconf test: Split the object_store/test_get_object_store_endpoints test	2026-01-14 18:29:03 +02:00
Yaniv Michael Kaul	d919aacc69	storage_proxy: mark write_timeouts metric for counter write timeouts When a counter write times out (due to rpc::timeout_error or timed_out_error), the code was throwing mutation_write_timeout_exception but not marking the write_timeouts metric. This resulted in counter write timeouts not being counted in the scylla_storage_proxy_coordinator_write_timeouts metric. Regular writes go through mutate_internal -> mutate_end, which catches mutation_write_timeout_exception and marks the metric. However, counter writes use a separate code path (mutate_counters) that has its own exception handling but was missing the metric update. This fix adds get_stats().write_timeouts.mark() before throwing the timeout exception in the counter write path, consistent with how the CAS path handles cas_write_timeouts. Refs: https://scylladb.atlassian.net/browse/SCYLLADB-245 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#28019	2026-01-14 17:50:46 +02:00
Gleb Natapov	bee5f63cb6	topology coordinator: complete pending operation for a replaced node A replaced node may have pending operation on it. The replace operation will move the node into the 'left' state and the request will never be completed. More over the code does not expect left node to have a request. It will try to process the request and will crash because the node for the request will not be found. The patch checks is the replaced node has peening request and completes it with failure. It also changes topology loading code to skip requests for nodes that are in a left state. This is not strictly needed, but makes the code more robust. Fixes #27990 Closes scylladb/scylladb#28009	2026-01-14 13:11:27 +01:00
Botond Dénes	551eecab63	Merge 'EAR: deprecate the replicated key provider' from Calle Wilund Refs #22733. Adds runtime warning and docs info that replicated provider is deprecated and will be removed. Fixes #27292 Closes scylladb/scylladb#27270 * github.com:scylladb/scylladb: docs::encryption: Add warning that replicated provider is deprecated ent::encryption: Switch default key provider from replicated to local replicated_key_provider: Add deprecation warning on usage	2026-01-14 13:47:23 +02:00
Calle Wilund	2fd6ca4c46	test/cluster/test_internode_compression: Transpose test from dtest Refs #27429 re-implement the dtest with same name as a scylla pytest, using a python level network proxy instead of tcpdump etc. Both to avoid sudo and also to ensure we don't race. v2: * Included positive test (mode=all)	2026-01-14 10:53:34 +01:00
Patryk Jędrzejczak	6b5923c64e	test: test_group0_schema_versioning: wait for schema sync in system.local `test_schema_versioning_with_recovery` is currently flaky. It performs a write with CL=ALL and then checks if the schema version is the same on all nodes by calling `verify_table_versions_synced`. All nodes are expected to sync their schema before handling the replica write. The node in RECOVERY mode should do it through a schema pull, and other nodes should do it through a group 0 read barrier. The problem is in `verify_local_schema_versions_synced` that compares the schema versions in `system.local`. The node in RECOVERY mode updates the schema version in `system.local` after it acknowledges the replica write as completed. Hence, the check can fail. We fix the problem by making the function wait until the schema versions match. Note that RECOVERY mode is about to be retired together with the whole gossip-based topology in 2026.2. So, this test is about to be deleted. However, we still want to fix it, so that it doesn't bother us in older branches. Fixes #23803 Closes scylladb/scylladb#28114	2026-01-14 09:55:45 +01:00
Jakub Smolar	aefd815194	test.py: add pexpect to the dependencies Use pexpect to control a presistent GDB process with pattern reads and timeouts. This makes 'scylla_gdb' tests faster and less flaky. Added python3-pexpect in 'install-dependencies.sh'. Closes scylladb/scylladb#26419 [avi: build optimized clang 21.1.8 regenerated frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-21.1.8-Fedora-43-x86_64.tar.gz ] Closes scylladb/scylladb#28134	2026-01-14 10:17:37 +02:00
Botond Dénes	122b7847e5	Merge 'index: Accept view properties in CREATE INDEX' from Dawid Mędrek Problem ------- Secondary indexes are implemented via materialized views under the hood. The way an index behaves is determined by the configuration of the view. Currently, it can be modified by performing the CQL statement `ALTER MATERIALIZED VIEW` on it. However, that raises some concerns. Consider, for instance, the following scenario: 1. The user creates a secondary index on a table. 2. In parallel, the user performs writes to the base table. 3. The user modifies the underlying materialized view, e.g. by setting the `synchronous_updates` to `true` [1]. Some of the writes that happened before step 3 used the default value of the property (which is `false`). That had an actual consequence on what happened later on: the view updates were performed asynchronously. Only after step 3 had finished did it change. Unfortunately, as of now, there is no way to avoid a situation like that. Whenever the user wants to configure a secondary index they're creating, they need to do it in another schema change. Since it's not always possible to control how the database is manipulated in the meantime, it leads to problems like the one described. That's not all, though. The fact that it's not possible to configure secondary indexes is inconsistent with other schema entities. When it comes to tables or materialized views, the user always have a means to set some or even all of the properties during their creation. Solution -------- The solution to this problem is extending the `CREATE INDEX` CQL statement by view properties. The syntax is of form: ``` > CREATE INDEX <index name> > .. ON <keyspace>.<table> (<columns>) > .. WITH <properties> ``` where `<properties>` corresponds to both index-specific and view properties [2, 3]. View properties can only be used with indexes implemented with materialized views; for example, it will be impossible to create a vector index when specifying any view property (see examples below). When a view property is provided, it will be applied when creating the underlying materialized view. The behavior should be similar to how other CQL statements responsible for creating schema entities work. High-level implementation strategy ---------------------------------- 1. Make auxiliary changes. 2. Introduce data structures representing the new set of index properties: both index-specific and those corresponding to the underlying view. 3. Extend `CREATE INDEX` to accept view properties. 4. Extend `DESCRIBE INDEX` and other `DESCRIBE` statements to include view properties in their output. User documentation is also updated at the steps to reflect the corresponding changes. Implementation considerations ----------------------------- There are a number of schema properties that are now obsolete. They're accepted by other CQL statements, but they have no effect. They include: * `index_interval` * `replicate_on_write` * `populate_io_cache_on_flush` * `read_repair_chance` * `dclocal_read_repair_chance` If the user tries to create a secondary index specifying any of those keywords, the statement will fail with an appropriate error (see examples below). Unlike materialized views, we forbid specifying the clustering order when creating a secondary index [4]. This limitation may be lifted later on, but it's a detail that may or may not prove troublesome. It's better to postpone covering it to when we have a better perspective on the consequences it would bring. Examples -------- Good examples ``` > CREATE INDEX idx ON ks.t (v); > CREATE INDEX idx ON ks.t (v) WITH comment = 'ok view property'; > CREATE INDEX idx ON ks.t (v) .. WITH comment = 'multiple view properties are ok' .. AND synchronous_updates = true; > CREATE INDEX idx ON ks.t (v) .. WITH comment = 'default value ok' .. AND synchronous_updates = false; ``` Bad examples ``` > CREATE INDEX idx ON ks.t (v) WITH replicate_on_write = true; SyntaxException: Unknown property 'replicate_on_write' > CREATE INDEX idx ON ks.t (v) .. WITH OPTIONS = {'option1': 'value1'} .. AND comment = 'some text'; InvalidRequest: Error from server: code=2200 [Invalid query] message="Cannot specify options for a non-CUSTOM index" > CREATE CUSTOM INDEX idx ON ks.t (v) .. WITH OPTIONS = {'option1': 'value1'} .. AND comment = 'some text'; InvalidRequest: Error from server: code=2200 [Invalid query] message="CUSTOM index requires specifying the index class" > CREATE CUSTOM INDEX idx ON ks.t (v) .. USING 'vector_index' .. WITH OPTIONS = {'option1': 'value1'} .. AND comment = 'some text'; InvalidRequest: Error from server: code=2200 [Invalid query] message="You cannot use view properties with a vector index" > CREATE INDEX idx ON ks.t (v) WITH CLUSTERING ORDER BY (v ASC); InvalidRequest: Error from server: code=2200 [Invalid query] message="Indexes do not allow for specifying the clustering order" ``` and so on. For more examples, see the relevant tests. References: [1] https://docs.scylladb.com/manual/branch-2025.4/cql/cql-extensions.html#synchronous-materialized-views [2] https://docs.scylladb.com/manual/branch-2025.4/cql/secondary-indexes.html#create-index [3] https://docs.scylladb.com/manual/branch-2025.4/cql/mv.html#mv-options [4] https://docs.scylladb.com/manual/branch-2025.4/cql/dml/select.html#ordering-clause Fixes scylladb/scylladb#16454 Backport: not needed. This is an enhancement. Closes scylladb/scylladb#24977 * github.com:scylladb/scylladb: cql3: Extend DESC INDEX by view properties cql3: Forbid using CLUSTERING ORDER BY when creating index cql3: Extend CREATE INDEX by MV properties cql3/statements/create_index_statement: Allow for view options cql3/statements/create_index_statement: Rename member cql3/statements/index_prop_defs: Re-introduce index_prop_defs cql3/statements/property_definitions: Add extract_property() cql3/statements/index_prop_defs.cc: Add namespace cql3/statements/index_prop_defs.hh: Rename type cql3/statements/view_prop_defs.cc: Move validation logic into file cql3/statements: Introduce view_prop_defs.{hh,cc} cql3/statements/create_view_statement.cc: Move validation of ID schema/schema.hh: Do not include index_prop_defs.hh	2026-01-14 09:54:27 +02:00
Pavel Emelyanov	e57ee84662	util: Re-use seastar::util::memory_data_sink A data_sink that stores buffers into an in-memory collection had appeared in seastar recently. In Scylla there's similar thing that uses memory_data_sink_buffer as a container, so it's possible to drop the data_sink_impl iself in favor of seastar implementation. For that to work there should be append_buffers() overload for the aforementioned container. For its nice implementation the container, in turn, needs to get push_back() method and value_type trait. The method already exists, but is called put(), so just rename it. There's one more user of it this method in S3 client, and it can enjoy the added append_buffers() helper. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#28124	2026-01-14 08:54:00 +02:00
Avi Kivity	3fc5a32136	tools: toolchain: update instructions for building optimized clang with version information The instructions for building optimized clang neglected to mention that the clang version to be built must be specified. Correct that. Closes scylladb/scylladb#28135	2026-01-14 06:46:20 +02:00
Botond Dénes	6e17bf5c1a	tools/scylla-nodetool: migrate to std::localtime fmt::localtime() is now deprecated, users should migrate to equivalents from the standard libraries. std::localtime is not thread safe, so a local wrapper is introduced, based on the thread-safe localtime_r() (from libc). Closes scylladb/scylladb#27821	2026-01-13 20:46:31 +02:00
Nikos Dragazis	7fa1f87355	db/config: Update sstable_compression_user_table_options description Clarify what "user table" means. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-01-13 20:45:59 +02:00
Nikos Dragazis	1e37781d86	schema: Add initializer for compression defaults In PR `5b6570be52` we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams). Fix this by moving the logic into the `schema_builder` via a schema initializer. This ensures that the default compression settings are applied uniformly regardless of how the table is created, while also keeping the logic in a central place. Register the initializer at startup in all executables where schemas are being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`). Finally, remove the ad-hoc logic from `create_table_statement` (redundant as of this patch), remove the xfail markers from the relevant tests and adjust `test_describe_cdc_log_table_create_statement` to expect LZ4WithDicts as the default compressor. Fixes #26914. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-01-13 20:45:59 +02:00
Nikos Dragazis	d5ec66bc0c	schema: Generalize static configurators into schema initializers Extend the `static_configurator` mechanism to support initialization of arbitrary schema properties, not only static ones, by passing a `schema_builder` reference to the configurator interface. As part of this change, rename `static_configurator` to `schema_initializer` to better reflect its broader responsibility. Add a checkpoint/restore mechanism to allow de-registering an initializer (useful for testing; will be used in the next patch). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-01-13 20:45:59 +02:00
Nikos Dragazis	5b4aa4b6a6	schema: Initialize static properties eagerly Schemas maintain a set of so-called "static properties". These are not user-visible schema properties; they are internal values carried by in-memory `schema` objects for convenience (`349bc1a9b6`, https://github.com/scylladb/scylladb/pull/13170#issuecomment-1469848086). Currently, the initialization of these properties happens when a `schema_builder` builds a schema (`schema_builder::build()`), by invoking all registered "static configurators". This patch moves the initialization of static properties into the `schema_builder` constructor. With this change, the builder initializes the properties once, stores them in a data member, and reuses them for all schema objects that it builds. This doesn't affect correctness as the values produced by static configurators are "static" by nature; they do not depend on runtime state. In the next patch, we will replace the "static configurator" pattern with a more general pattern that also supports initialization of regular schema properties, not just static ones. Regular properties cannot be initialized in `build()` because users may have already explicitly set values via setters, and there is no way to distinguish between default values and explicitly assigned ones. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-01-13 20:45:55 +02:00
Nikos Dragazis	76b2d0f961	db: config: Add accessor for sstable_compression_user_table_options The `sstable_compression_user_table_options` config option determines the default compression settings for user tables. In patch `2fc812a1b9`, the default value of this option was changed from LZ4 to LZ4WithDicts and a fallback logic was introduced during startup to temporarily revert the option to LZ4 until the dictionary compression feature is enabled. Replace this fallback logic with an accessor that returns the correct settings depending on the feature flag. This is cleaner and more consistent with the way we handle the `sstable_format` option, where the same problem appears (see `get_preferred_sstable_version()`). As a consequence, the configuration option must always be accessed through this accessor. Add a comment to point this out. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-01-13 18:30:38 +02:00
Nikos Dragazis	4ec7a064a9	test: Check that CQL and Alternator tables respect compression config In patches `11f6a25d44` and `7b9428d8d7` we added tests to verify that auxiliary tables for both CQL and Alternator have the same default compression settings as their base tables. These tests do not check where these defaults originate from; they just verify that they are consistent. Add some more tests to verify the actual source of the defaults, which is expected to be the `sstable_compression_user_table_options` from the configuration. Unlike the previous tests, these tests require dedicated Scylla instances with custom configuration, so they must be placed under `test/cluster/`. Mark them as xfail-ing. The marker will be removed later in this series. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-01-13 18:16:08 +02:00
Avi Kivity	489d1a0fbc	Merge 'replica: don't throw exceptions for read timeout' from Botond Dénes Read timeouts are a common occurence and they typically occur when the replica is overloaded. So throwing exceptions for read timeouts is very harmful. Be careful not to thow exceptions while propagating them up the future chain. Add a test to enfore and detect regressions. Fixes: scylladb/scylladb#25062 Improvement, normally not a backport candidate, but we may decide to backport if customer(s) are found to suffer from this. Closes scylladb/scylladb#25068 * github.com:scylladb/scylladb: reader_permit: remove check_abort() test/boost/database_test: add test for read timeout exceptions sstables/mx/reader: don't throw exceptions on the read-path readers/multishard: don't throw exceptions on the read-path replica/table: don't throw exceptions on the read-path multishard_mutation_query: fix indentation multishard_mutation_query: don't throw exceptions on the read-path service/storage_proxy: don't throw exceptions on the full-scan path cql3/query_processor: don't throw exceptions on the read-path reader_permit: add get_abort_exception()	2026-01-13 16:17:41 +02:00
Calle Wilund	da17e8b18b	gossiper/main: Extend special treatment of node ID resolve for rpc_address Refs #27429 If running with broadcast_address != listen/cql/rpc address, topology gets confused about the varying addresses. Need to special case resolve both addresses as "self". I.e. extend broadcast_address treatment to cql_address as well. Added export of this via gossiper for symmetry.	2026-01-13 14:12:19 +01:00
Avi Kivity	c6dfae5661	treewide: #include Seastar headers with angle brackets Seastar is an external library from the point of view of ScyllaDB, so should be included with angle brackets. Closes scylladb/scylladb#27947	2026-01-13 14:56:15 +02:00
Tomasz Grabiec	63b9a7e2b5	test: pylib: log_browsing: Grep logs without considering newly appended lines At the end of the test case, the framework greps logs for errors and backtraces. The servers are still running at this point. Some test cases enable debug-level logging. If servers manage to produce new lines between the python script processes them, the grep will never return. Protect against this by grepping over a file snapshot. Fixes #28086 Closes scylladb/scylladb#28088	2026-01-13 14:41:02 +02:00
Nadav Har'El	fc6fff61d1	docs/alternator: add document on reducing Alternator network costs This patch adds a new document, docs/alternator/network.md, explaining the various mechanisms that can be used to reduce network usage in Alternator. It explains compression of requests and responses, header reduction, rack-aware routing, and RPC compression. Many of these topics - especially support in the client libraries - are work in progress, so some details are still missing in the new document. Still, I think it is a good start that can be improved later. Fixes #27915. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27927	2026-01-13 14:29:01 +02:00
Radosław Cybulski	7b1060fad3	alternator: refactor streams.cc Fix indentation levels from previous commit.	2026-01-13 12:04:13 +01:00
Radosław Cybulski	ef63fe400a	alternator: refactor streams.cc Refactor streams.cc - turn `.then` calls into coroutines. Reduces amount of clutter, lambdas and referenced variables. Note - the code is kept at the same indentation level to ease review, the next commit will fix this.	2026-01-13 12:03:54 +01:00
Karol Baryła	2c471ec57a	protocol-extensions.md: Fix client_options docs When this column and relevant SUPPORTED key were added, the documentation was mistakenly put in the section about shard awareness extension. This commit moves the documentation into a dedicated section. I also expended it to describe both the new column and the new SUPPORTED key.	2026-01-13 11:49:00 +01:00
Karol Baryła	30d4d3248d	system_keyspace.md: Add client_options column It was recently introduced, but the documentation was not updated.	2026-01-13 11:35:52 +01:00
Karol Baryła	a0a6140436	system_keyspace.md: Fix order in system.clients scheduling_group column is places after protocol_version in the current version.	2026-01-13 11:33:34 +01:00
Pavel Emelyanov	9ffd22491f	test: Validate S3 endpoints new format works Extend the test_get_object_store_endpoints() test to configure S3 endpoints in full-url format and check that they are rendered properly via API/CQL. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-13 13:24:18 +03:00
Pavel Emelyanov	bd225784bd	docs: Update docs according to new endpoints config option format Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-13 13:24:06 +03:00
Pavel Emelyanov	f227de24b2	object_storage: Create s3 client with "extended" endpoint name For this, add the s3::client::make(endpoint, ...) overload that accepts endpoint in proto://host:port format. Then it parses the provided url and calls the legacy one, that accepts raw host string and config with port, https bit, etc. The generic object_storage_endpoint_param no longer needs to carry the internal s3::endpoint_config, the config option parsing changes respectively. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-13 13:24:06 +03:00
Pavel Emelyanov	8f97e6b3de	s3/storage: Tune config updating Don't prepare s3::endpoint_config from generic code, jut pass the region and iam_role_arn (those that can potentially change) to the callback. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-13 13:24:06 +03:00
Pavel Emelyanov	bee3564564	sstable: Shuffle args for s3_client_wrapper Make it construct like gs_client_wrapper -- with generic endpoint param reference and make the storage-specific casts/gets/whatever internally. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-13 13:24:06 +03:00
Pavel Emelyanov	83e88d206c	test: Rename badconf variable into objconf It's not actually a "bad" config, it's just some config the test works with. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-13 13:23:20 +03:00
Pavel Emelyanov	9c627bc44a	test: Split the object_store/test_get_object_store_endpoints test It tests two things -- the way object storage config is represented via API and CQL (from sytem.config) and that updating config affects CREATE KEYSPACE CQL (with keyspace storage options) It's better to split the test, as its former part is going to be extented to validate old/new config formats (see #26570) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-01-13 13:23:03 +03:00
Robert Bindar	dfcabb5fa4	test: reduce dataset and number of test cases or debug builds Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:46:51 +02:00
Robert Bindar	ca3c57e821	test: bump repair timeout up, it's sometimes not enough in CI Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:46:49 +02:00
Robert Bindar	6f5e58e718	test: refactor test_refresh.py to match test_restore_with_streaming_scopes. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:46:48 +02:00
Robert Bindar	6e636a4231	test: extend test_restore_with_streaming_scopes to test restoring with a different min_tablet_count than the schema was originally created with. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:46:46 +02:00
Robert Bindar	92cd1ddec3	test: Adjust test_restore_primary_replica_different_dc_scope_all to match the new topology arhitecture Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:46:44 +02:00
Robert Bindar	db13ece9a0	test: Refactor restoring code in test_backup to match SM pattern This patch refactors the restoring code in cluster/test_backup.py so it matches better the way SM works. The patch also refactors test_restore_with_streaming_scopes so to facilitate running restore scenarios under all supported scopes with or w/o primary_replica_only enabled by reusing the servers and backups for a topology. This allows us to test a lot more scenarios without making the test impossibly slow. split from bhalevy/load-balance-primary-replica Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:46:43 +02:00
Robert Bindar	ba01589f53	test: add check_mutation_replicas calls after fresh creation of dataset to validate that mutation assertions are sane split from bhalevy/load-balance-primary-replica Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:46:41 +02:00
Robert Bindar	b835d32cb0	test: extend create_dataset to accept consistency_level Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:46:35 +02:00
Robert Bindar	e7d44356d9	test: refactor check_mutation_replicas so it's more readable split from bhalevy/load-balance-primary-replica Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:46:31 +02:00
Robert Bindar	733b4dbbb7	test: make create_dataset async and refactor so it's configurable with num_keys and min_tablet_count split from bhalevy/load-balance-primary-replica Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:46:20 +02:00
Robert Bindar	f2c8949e4a	test: use defaultdict in collect_mutations split from bhalevy/load-balance-primary-replica Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:45:03 +02:00
Robert Bindar	45faeba97d	test: add log marks to facilitate reusing server for restore split from bhalevy/load-balance-primary-replica Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:44:48 +02:00
Robert Bindar	d88036db48	locator: tablets: Distribute data evenly among primary replicas during restore Most likely `817fdad` uncovered the fact that our choice of primary replica was resonating with tablet allocation and we were ending up picking the same replica as primary within a scope instead of rotating primaryship among all replicas in the scope. This created situations where for instance, restoring into a 9 nodes cluster with primary_replica_only=true would put all data into 3 nodes, leaving the other 6 unused. The balancing of the dataset was performed by the subsequent repair step. split from bhalevy/load-balance-primary-replica Fixes #27281 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-01-13 11:44:20 +02:00
Nadav Har'El	2a831ad373	Merge 'Address CodeQL Errors' from Botond Dénes Address all errors reported by CodeQL as reported on https://github.com/scylladb/scylladb/security/quality. This is a mixed bag, with some harmless issues, while others are severe problems which will result in the code breaking (if it is even run). I suspect some of the more severe problems were found in dead code that is not used at all -- hence nobody noticed. Still, these issues are good to fix, so we can reduce noise in the reports and improve the maintainability of the code. Code cleanup, no backport Closes scylladb/scylladb#27838 * github.com:scylladb/scylladb: pgo/pgo.py: don't mutate input params test/pylib/coverage_utils.py: profdata_to_lcov: don't mutate defaulted param test/cluster/dtest/tools/misc.py: add type annotations to list_to_hashed_dict() idl-compiler.py: raise TypeError instead of raw str test/pylib/lcov_utils.py: don't call set when iterating over it configure.py: move away from .format(**locals()) test/cluster/object_store/conftest.py: add missing call to parent constructor idl-compiler.py: add missing call to parent class constructor tools/scyllatop/fake.py: pass correct number of args to _add_metric	2026-01-13 11:43:57 +02:00
Nadav Har'El	609b283d98	test/cqlpy: add another reproducer for known issue This patch adds a second reproducer for issue #25839, which is about scanning a secondary index which returns partial results. The new test uses count(*) without requesting the row themselves, but still has the same problem of counting only part of the rows. This is the problem that a user reported in issue #28026. Unlike the previous test, this test works correctly on older versions of Scylla - by using larger data, like on Cassandra - without changing a configuration variable that did not yet exist. So with this test we can confirm that this bug is a Scylla 5.2 regression: test/cqlpy/run --release 5.1 test_secondary_index.py::test_short_count passes, while test/cqlpy/run --release 5.2 test_secondary_index.py::test_short_count fails. It also fails on master, so the new test is marked "xfail". Refs #25839 Refs #28026 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28108	2026-01-13 11:15:27 +02:00
Botond Dénes	7b562bb185	Merge 'system.clients: Address SSL refactor review comments' from Piotr Smaron and Copilot Addresses outstanding review comments from PR #22961 where SSL field collection was refactored into generic_server::connection base class. This patch consists of minor cosmetic enhancements for increased readability, mainly, with some minor fixups explained in specific commits. Cosmetic changes, no need to backport. Closes scylladb/scylladb#27575 * github.com:scylladb/scylladb: test_ssl: fix indentation generic_server: improve logging broken TLS connection test_ssl: improve timeout and readability alternator/server: update SSL comment	2026-01-13 11:00:26 +02:00
Botond Dénes	354c805e6a	reader_permit: remove check_abort() This method can cause performance regressions if used in the wrong place -- namely if it is used to abort reads by throwing the abort exception. Exceptions should be propagated during reads without throwing them, otherwise they cause extra CPU load, making a bad situation worse. Remove this method, so it doesn't accidentally get more users, migrate remaining users to get_abort_exception().	2026-01-13 10:47:57 +02:00
Botond Dénes	a0ddac655d	test/boost/database_test: add test for read timeout exceptions Read timeouts shouldn't trigger exceptions thrown, exceptions should be solely propagated via futures, otherwise they put extra strain on the system at the worst possible time: when it is overload already enough that reads started to time out. The test covers both single partition reads and full scans, with two scenarios: * timeout while the read is queued * timeout when the read is already ongoing	2026-01-13 10:47:57 +02:00
Botond Dénes	6801611b94	sstables/mx/reader: don't throw exceptions on the read-path If the read is aborted via the permit (due to timeout) don't throw the abort exception, instead propagate it via the future chain. Also, use try_catch<> instead of try ... catch to decorate malformed_sstable_exception with the file name.	2026-01-13 10:47:57 +02:00
Botond Dénes	fa6ffe9d20	readers/multishard: don't throw exceptions on the read-path Use coroutine::try_future() to avoid exceptions taking flight and triggering expensive stack-unwinding. Especially bad for common exceptions like timeouts.	2026-01-13 10:47:57 +02:00
Botond Dénes	1f6f7ceb68	replica/table: don't throw exceptions on the read-path Use coroutine::as_future() to avoid exceptions taking flight and triggering expensive stack-unwinding. Especially bad for common exceptions like timeouts. Not using coroutine::try_future(), because on the error path, the querier has to be closed.	2026-01-13 10:47:57 +02:00
Botond Dénes	131489fe48	multishard_mutation_query: fix indentation Left broken by previous patch.	2026-01-13 10:47:57 +02:00
Botond Dénes	7eeb7fcfba	multishard_mutation_query: don't throw exceptions on the read-path Use coroutine::try_future() to avoid exceptions taking flight and triggering expensive stack-unwinding. Especially bad for common exceptions like timeouts.	2026-01-13 10:47:57 +02:00
Botond Dénes	9bea842c01	service/storage_proxy: don't throw exceptions on the full-scan path Use coroutine::try_future() to avoid exceptions taking flight and triggering expensive stack-unwinding. Especially bad for common exceptions like timeouts.	2026-01-13 10:47:57 +02:00
Botond Dénes	404f5b0808	cql3/query_processor: don't throw exceptions on the read-path Use coroutine::try_future() to avoid exceptions taking flight and triggering expensive stack-unwinding. Especially bad for common exceptions like timeouts.	2026-01-13 10:47:57 +02:00
Botond Dénes	5cd237fec5	reader_permit: add get_abort_exception() Will replace check_abort(). The latter throws an exception which is something we want to avoid when a read is aborted, in particular when it times out. Also add a convenience get_abort_exception() method to mutation_reader.	2026-01-13 10:47:57 +02:00
Michał Hudobski	c8aa49b196	vector search, paging: add test for paging warnings We add a test that validates that indexed queries do not throw a warning related to vector search paging Fixes: SCYLLADB-248 Closes scylladb/scylladb#28077	2026-01-13 10:33:36 +02:00
Nadav Har'El	34191d8fd4	alternator: fix signature checking of headers with multiple spaces We have a test in test_compressed_response.py that reproduces a bug where in Alternator's signature checking code, if a header had multiple consecutive spaces its signature isn't checked correctly. This patch fixes this and that xfailing test begins to pass. But it turns out that the handling of multiple consecutive spaces in headers when calculating the authentication signature is just one example of "header canonization" that the AWS Signature V4 specification requires us to do. There are additional types of header canonization that Alternator must do, and this patch also adds new tests in test_authorization.py for checking all the types of canonization. Fortunately, for all other types of canonizations, we already handled them correctly - Alternator already lowercases header names, sorts them alphabetically and removes leading and trailing spaces before calculating the signature. So most of the new tests added pass also without this patch, and only one of them, test_canonization_middle_whitespace, needs this patch to pass. As usual, all the new tests also pass on DynamoDB. Fixes #27775 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#28102	2026-01-13 10:29:13 +02:00
Andrei Chekun	3f14d45d6e	test.py: fix the link for the failed_test directory With new UI Jenkins escaping the HTML tags during rendering to prevent XSS. This will show just link without custom name as a string that can be copied and then pasted to navigate to the failed directory. Closes scylladb/scylladb#28062	2026-01-13 10:22:38 +02:00
Yaniv Kaul	4e3aa53f8b	test/rest_api/test_gossiper.py: fix for Variable defined multiple times To fix the problem, we need to remove the first, redundant definition of test_gossiper_unreachable_endpoints (lines 19-24). The second definition (lines 25-40) should be retained as it has more substantial test logic. No other code changes or imports are needed, as the test logic is preserved fully in the retained definition. Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27632	2026-01-13 10:17:29 +02:00
Yaniv Kaul	1722e35a7e	.github/workflows/docs-validate-metrics.yml: add workflow permissions Potential fix for code scanning alert no. 167: Workflow does not contain permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27819	2026-01-13 10:16:35 +02:00
Yaniv Kaul	39c26794ec	.github/workflows/docs-pr.yaml: add workflow permissions Potential fix for code scanning alert no. 145: Workflow does not contain permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27808	2026-01-13 10:15:45 +02:00
Yaniv Michael Kaul	d9cce7ccbf	.github/copilot-instructions.md: add test philosophy Try to teach CoPi a bit about how we'd like to see it implement tests, according to this repo best practices. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#28032	2026-01-13 10:05:13 +02:00
Anna Stuchlik	8141283262	doc: add the version name to the Install ScyllaDB page Fixes https://github.com/scylladb/scylladb/issues/28021 Closes scylladb/scylladb#28022	2026-01-13 10:01:48 +02:00
Avi Kivity	66aee0fb5e	alternator: add optional listeners for proxy protocol v2 Following `954f2cbd2f`, which added proxy protocol v2 listeners for CQL, we do the same for alternator. We add two optional ports for plain and TLS-wrapped HTTP. We test each new port, that the old ports still work, and that mixing up a port with no proxy protocol and a connection with proxy protocol (or the opposite) fails. The latter serves to show that the testing strategy is valid and doesn't just pass whatever happens. We also verify that the correct addresses (and TLS mode) show up in system.clients. Closes scylladb/scylladb#27889	2026-01-13 09:59:24 +02:00
Nadav Har'El	e7df03127b	alternator: support "deflate" encoding in request compression Currently Alternator supports compressed requests in the gzip format with "Content-Encoding: gzip". We did not support any other compression formats. It turns out that DynamoDB also supports the "deflate" encoding. The "deflate" format is just a small variant of gzip and also supported by the same zlib library that we already use, so it is very easy to add support for it as well. So this patch adds it. Beyond compatibility with DynamoDB, another benefit of this patch is symmetry with our response compression support (PR #27454), where we supported both gzip and deflate compression of responses - so we should support the same for requests. This patch also adds tests for Content-Encoding: deflate, which pass on DynamoDB (proving that "deflate" is indeed supported there). On Alternator the new tests failed before this patch and pass with this patch. Refs #27243 (which asks to support more compression formats). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27917	2026-01-13 09:58:12 +02:00
Nadav Har'El	a1f198d453	test/cqlpy: translate Cassandra's test InsertInvalidateSizedRecordsTest This is a translation of Cassandra's CQL unit test source file validation/operations/InsertInvalidateSizedRecordsTest.java into our cqlpy framework. This is one of the tests added to Cassandra as part of the vector search work, but actually has nothing to do with vector search - it checks what happens when key columns of different types exceeed their maximum size (64KB). Unfortunately, each one of the tests added here fail on ScyllaDB, providing more reproducers for two already known issues (which already had plenty of reproducers...): Refs #8627 Cleanly reject updates with indexed values where value > 64k Refs #12247 Better error reporting for oversized keys during INSERT One of the tests also fails on Cassandra, due to CASSANDRA-19270. It is not clear to me how this unit test actually passed on Cassandra, I can only guess that the Python driver somehow makes the request differently than what the Java unit tests use to make requests to Cassandra. One of the tests in the original Cassandra source file I did not translate, readingEmptyStringsForDifferentTypes, because it tests cqlsh, not pure CQL. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27944	2026-01-13 08:59:36 +02:00
Yaniv Kaul	2c26ca3651	.github/workflows/read-toolchain.yaml: hardening for code scanning alert no. 146: Workflow does not contain permissions #27807 Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27807	2026-01-13 08:57:53 +02:00
tomek7667	19313d67e3	docs/cql/ddl.rst: fix formatting of deprecated initial sub-option Closes scylladb/scylladb#26852	2026-01-13 08:55:24 +02:00
Anna Stuchlik	14cadcbc18	doc: remove references to Open Source Fixes https://github.com/scylladb/scylladb/issues/28118 Closes scylladb/scylladb#28119	2026-01-13 08:43:26 +02:00
Botond Dénes	25db8f6a70	pgo/pgo.py: don't mutate input params It is considered a dangerous practice with possible unintended side-effects, affecting later calls to the same function. Found by CodeQL "Modification of parameter with default".	2026-01-13 08:33:17 +02:00
Botond Dénes	4ec3d76b87	test/pylib/coverage_utils.py: profdata_to_lcov: don't mutate defaulted param It is considered a dangerous practice as it creates a side-effect for later calls to the same function. Create a new variable instead and mutate that. Also remove the unused update_known_ids parameter, which defaults to True and no caller changes it. Passing False to this param also seem to have no effect. Instead of trying to guess what the desired effect of passing False is and fixing it, just remove this unused param. Found by CodeQL "Modification of parameter with default".	2026-01-13 08:33:17 +02:00
Botond Dénes	157fe2b80e	test/cluster/dtest/tools/misc.py: add type annotations to list_to_hashed_dict() To hopefully shut up CodeQL "Iterable can be either a string or a sequence". This change makes the code more readable anyway, so it is more than just a gratuitous change to make some code-scanner happy.	2026-01-13 08:33:17 +02:00
Botond Dénes	849332c5c2	idl-compiler.py: raise TypeError instead of raw str Unlike in C++, in Python one can only throw objects which inherit from Exception. The message complains about wrong type so wrap it in TypeError before passing to raise. Found by CodeQL "Illegal raise".	2026-01-13 08:33:17 +02:00
Botond Dénes	3c6e9637a0	test/pylib/lcov_utils.py: don't call set when iterating over it Probably a typo. Found by CodeQL "Non-callable called".	2026-01-13 08:33:17 +02:00
Botond Dénes	eb4ee5a126	configure.py: move away from .format(locals()) Use f strings instead, they are just as convenient with the added bonus of editors providing syntax highighting for it. Additionally, this shuts up CodeQL complaint about "Suspicious unused loop iteration variable" in loops where the loop variable was passed to format indirectly via locals().	2026-01-13 08:33:17 +02:00
Botond Dénes	d2db84714e	test/cluster/object_store/conftest.py: add missing call to parent constructor Replace manual init of parent fields. Found by CodeQL: "Missing call to superclass `__init__` during object initialization". The secret_key is not initialized to server.secret_key, instead of server.access_key. This probably fixes a (benign) bug.	2026-01-13 08:33:17 +02:00
Botond Dénes	26e978cf08	idl-compiler.py: add missing call to parent class constructor Replace manual init of parent fields. Found by CodeQL "Missing call to superclass `__init__` during object initialization".	2026-01-13 08:33:17 +02:00
Botond Dénes	20d0782dc4	tools/scyllatop/fake.py: pass correct number of args to _add_metric Discovered by CodeQL "Wrong number of arguments in call".	2026-01-13 08:33:17 +02:00
Michał Jadwiszczak	649efd198f	docs/dev/service_levels: update docs to service levels on raft Since Scylla 6.0, service levels are manged by Raft group0. This patch updates table name used by service levels and adds a paragraph describing service levels on raft. Fixes scylladb/scylladb#18177 Closes scylladb/scylladb#26556	2026-01-13 06:49:18 +02:00
Botond Dénes	725e99e263	Merge 'test.py: Fix boost logs' from Andrei Chekun Write the boost logs into stdout in HRF format and in XML to the file. The XML file will be used for parsing and providing the error information in the summary section of the fail. Fixes: https://github.com/scylladb/scylladb/issues/28045 Framework enhancements, no need to backport. Closes scylladb/scylladb#28107 * github.com:scylladb/scylladb: test.py: remove XML log from fail summary test.py: fix truncated boost output to stdout file	2026-01-13 06:19:05 +02:00
Anna Stuchlik	791ab4ed02	doc: clarify the information about SSTable version support Fixes https://github.com/scylladb/scylladb/issues/27765 Closes scylladb/scylladb#27835	2026-01-13 06:17:37 +02:00
Tomasz Grabiec	8dc20e6aaf	test: cluster: Add reproducer for missed notification in topology coordinator	2026-01-13 00:40:23 +01:00
Tomasz Grabiec	7a04dd2d22	topology_coordinator: Wake up the state machine after stats refresh Otherwise, coordinator may not react to changing stats after explicit calls to trigger_load_stats_refresh() done on node replace or table creation, if stats take longer to refresh than it takes the coordinator to go idle. The periodic refresh does wake up the topology coordinator, so the issue is not dramatic in production, but it's annoying in tests, which take longer because of that. Fixes #25163	2026-01-13 00:40:23 +01:00
Tomasz Grabiec	d910e6ea63	topology_coordinator: Move tablet_load_stats_refresh_before_rebalancing injection earlier Refreshing stats will signal _topo_sm.event, so do it before waiting for the event, to avoiding busy looping in the coordinator. This will produce lots of logs in test cases which enable debug-level logging in the raft logger. Refs #28086	2026-01-13 00:40:16 +01:00
Tomasz Grabiec	e5dee2aab8	topology_coordinator: Fix potential missed notification Checking for work is not atomic, so there is room for missed notification. Especially that notifications are not always triggered from fibers which take the group0 guard. Fix by subscribing for the event before checking for work. Fixes #27958	2026-01-13 00:39:01 +01:00
Tomasz Grabiec	2b7aa3211d	topology_coordinator: Refresh load stats after table is created or altered We switched to the size-based load balancing, which now has more strict requirements for load stats. We no longer need only per-node stats (capacity), but also per-tablet stats. Bootstrapping a node triggers stats refresh, but allocating tablets on table creation didn't. So after creating a table, load balancer couldn't make progress for up to 60s (stats refresh period). This makes tests take longer, and can even cause failures if tests are using a low-enough timeout. Fixes #27921	2026-01-13 00:38:59 +01:00
Tomasz Grabiec	663831ebd7	tablets: Do a group0 read barrier on tablet load stats refresh Stats refresh will be triggered on topology coordinator by events like allocating new tablets on table creation. For refresh to be effective, all replicas must see the new tablets, otherwise stats will be incomplete.	2026-01-13 00:38:00 +01:00
Tomasz Grabiec	c4c5ed5aba	topology_coordinator: Ensure stats are refreshed in the gossip scheduling group Refresh can be triggered from different places, but it should run in the gossip scheduling group, like group0 operations.	2026-01-13 00:38:00 +01:00
Tomasz Grabiec	5e6935f276	test: Use ManagerClient.{disable,enable}_tablet_balancing()	2026-01-13 00:38:00 +01:00
Tomasz Grabiec	6936704677	test: Add missing calls to disable_tablet_balancing() in tests which use move_tablet() API If a test tries to move a tablet, it assumes the tablets are stable. This fixes flakiness exposed by size-based load-balancing and a later change to refresh stats sooner.	2026-01-13 00:38:00 +01:00
Tomasz Grabiec	c8098e07c9	test: pylib: Introduce ManagerClient.{disable,enable}_tablet_balancing() It's a global operation, so we can use any server. It's not only convenient. The call via api.disable_tablet_balancing() confuse people to think that it's a per-server operation. This leads to proliferation of code which does it needlessly on all servers.	2026-01-13 00:38:00 +01:00
Andrei Chekun	dfa6a61721	test.py: remove XML log from fail summary Remove XML log from fail summary. Add text from the first error in the XML file to the fail summary	2026-01-12 14:26:58 +01:00
Andrei Chekun	d96a50481a	test.py: fix truncated boost output to stdout file Change the behavior of the catching the boost log output. With this change boost will output it's logging to stdour with HRF format and to the tempfile in XML format. This will help for easier debuggint when all messages will be in the output file and still in the fail summary.	2026-01-12 14:15:23 +01:00
Botond Dénes	6bcc18e5c6	erge 'test.py: integrate python tests to be executed with pytest runner' from Andrei Chekun This will move responsibility for running tests with pytest in the same manner as it was done with boost tests. From this commit, test.py is not responsible anymore for running python tests and relies completely on pytest. This is another step for unification of test execution. Convert skip_mode function to `pytest.mark` to be able to use to annotate the whole module instead of each test explicitly. NOTE: this is a breaking change. From this commit, several directories with tests will require a path to the file to launch the test. Affected directories test/alternator test/broadcast_tables test/cql test/cqlpy test/rest_api Changes only in framework, so no backport. This PR will increase the amount of the tests by 30 test, due to the fact that how test.py and pytest discover tests. test.py count a file as a test, and when skip used in suite.yaml it will exclude the tests from discovery completely. While the pytest count test funstion as a test and uses skip_mode mark and will discover the tests, but it will skip them during execution, hence the difference test.py output before PR: ```bash > ./test.py --mode=release rest_api/test_compaction_task rest_api/test_task_manager --list --no-gather-metrics ``` test.py output in this PR: ```bash > ./test.py --mode=release test/rest_api/test_compaction_task.py test/rest_api/test_task_manager.py --list rest_api/test_compaction_task.py::test_global_major_keyspace_compaction_task.release.1 rest_api/test_compaction_task.py::test_major_keyspace_compaction_task.release.1 rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task.release.1 rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task.release.1 rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task.release.1 rest_api/test_compaction_task.py::test_reshaping_compaction_task.release.1 rest_api/test_compaction_task.py::test_resharding_compaction_task.release.1 rest_api/test_compaction_task.py::test_regular_compaction_task.release.1 rest_api/test_compaction_task.py::test_compaction_task_abort.release.1 rest_api/test_compaction_task.py::test_major_keyspace_compaction_task_async.release.1 rest_api/test_compaction_task.py::test_cleanup_keyspace_compaction_task_async.release.1 rest_api/test_compaction_task.py::test_offstrategy_keyspace_compaction_task_async.release.1 rest_api/test_compaction_task.py::test_rewrite_sstables_keyspace_compaction_task_async.release.1 rest_api/test_compaction_task.py::test_compaction_progress[major_keyspace_compaction_task_impl_run_fail].release.1 rest_api/test_compaction_task.py::test_compaction_progress[shard_major_keyspace_compaction_task_impl_run_fail].release.1 rest_api/test_compaction_task.py::test_compaction_progress[table_major_keyspace_compaction_task_impl_run_fail].release.1 rest_api/test_task_manager.py::test_task_manager_modules.release.1 rest_api/test_task_manager.py::test_task_manager_tasks.release.1 rest_api/test_task_manager.py::test_task_manager_status_running.release.1 rest_api/test_task_manager.py::test_task_manager_status_done.release.1 rest_api/test_task_manager.py::test_task_manager_status_failed.release.1 rest_api/test_task_manager.py::test_task_manager_not_abortable.release.1 rest_api/test_task_manager.py::test_task_manager_wait.release.1 rest_api/test_task_manager.py::test_task_manager_ttl.release.1 rest_api/test_task_manager.py::test_task_manager_user_ttl.release.1 rest_api/test_task_manager.py::test_task_manager_sequence_number.release.1 rest_api/test_task_manager.py::test_task_manager_recursive_status.release.1 rest_api/test_task_manager.py::test_module_not_exists.release.1 rest_api/test_task_manager.py::test_task_folding.release.1 rest_api/test_task_manager.py::test_abort_on_unregistered_task.release.1 ``` Fixes: https://github.com/scylladb/scylladb/issues/27716 Closes scylladb/scylladb#26395 * github.com:scylladb/scylladb: test.py: fix test_vector_similarity.py docs: add directories excluded from test.py test.py: prevent file descriptors leaking test.py: capture print inside the test test.py: do not print header for collection with test.py test.py: remove not supported functionality test.py: switch of execution of several test directories by test.py runner test.py: integrate python tests to be executed with pytest runner test.py: fix test/vector_search_validator to be able to run with pytest test.py: prepare base class for migration test.py: move environment preparation to one method test.py: introduce new environment variable TESTPY_PREPARED_ENVIRONMENT	2026-01-12 14:17:19 +02:00
Botond Dénes	04b8f72946	Merge 'repair: Implement auto repair for tablet repair' from Asias He repair: Implement auto repair for tablet repair This patch implements the basic auto repair support for tablet repair. It was decided to add no per table configuration for the initial implementation, so two scylla yaml config options are introduced to set the default auto repair configs for all the tablet tables. - auto_repair_enabled_default Set true to enable auto repair for tablet tables by default. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. - auto_repair_threshold_default_in_seconds Set the default time in seconds for the auto repair threshold for tablet tables. If the time since last repair is bigger than the configured time, the tablet is eligible for auto repair. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. The following metrcis are added: - auto_repair_needs_repair_nr The number of tablets with auto repair enabled that needs repair - auto_repair_enabled_nr The number of tablets with auto repair enabled The metrics are useful to tell if auto repair is falling behind. In the future, more auto repair scheduling will be added, e.g., scheduling based on the repaired and unrepaired sstable set size, tombstone ratio and so on, in addition to the time based scheduling. Fixes SCYLLADB-99 New feature. No backport. Closes scylladb/scylladb#27534 * github.com:scylladb/scylladb: topology_coordinator: Add metrics for tablet repair repair: Implement auto repair for tablet repair	2026-01-12 14:16:01 +02:00
Yaniv Kaul	f1c9eda49e	Potential fix for code scanning alert no. 144: Workflow does not contain permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27809	2026-01-12 12:21:35 +02:00
Marcin Maliszkiewicz	3c9f52e709	Merge 'doc: update the Web Installer instructions' from Anna Stuchlik This PR: - Replaces a fixed version name with the variable for the current version in the instructions for installing a non-default version with Web Installer. This will make using the installer more user-friendly. - Removes the instruction for Open Source from the Web Installer docs. Fixes https://github.com/scylladb/scylladb/issues/28005 Fixes https://github.com/scylladb/scylladb/issues/28079 Closes scylladb/scylladb#28046 * github.com:scylladb/scylladb: doc: remove the instruction for Open Source from the Web Installer docs doc: add the version variable to the Web Installer instructions	2026-01-12 11:10:04 +01:00
Petr Gusev	889d7782ed	treewide: use coroutine::maybe_yield in coroutines It's more efficient since coroutine::maybe_yield returns a lightweight struct (awaitable), not the future. Closes scylladb/scylladb#28101	2026-01-12 10:38:47 +01:00
Marcin Maliszkiewicz	09af3828ab	auth: remove confusing deprecation msg from hash_with_salt Closes scylladb/scylladb#27705	2026-01-12 10:12:54 +01:00
Asias He	7980890029	topology_coordinator: Add metrics for tablet repair - scylla_tablet_ops_failed Number of failed tablet {auto, user} repair - scylla_tablet_ops_succeeded Number of succeeded tablet {auto, user} repair Currently auto_repair and user_repair tablet task are added. We can add more tablet tasks later, e.g., rebuild, migration.	2026-01-12 15:26:05 +08:00
Alex	e430065c92	db: views: serialize create/drop view operations via shard 0 Create and drop view operations are currently performed on all shards, and their execution is not fully serialized. On slower processors this can lead to interleavings that leave stale entries in `system.scylla_views_build` A problematic sequence looks like this: * `on_create_view()` runs on shard 0 → entries for shard 0 and shard 1 are created * `on_drop_view()` runs on shard 0 → entry for shard 0 is removed * `on_create_view()` runs on shard 1 → entries for shard 0 and shard 1 are created again * `on_drop_view()` runs on shard 1 → entry for shard 1 is removed, while the shard 0 entry remains This results in a leftover row in `system.scylla_views_builds_in_progress`, causing `view_build_test.cc` to get stuck indefinitely in an eventual state and eventually be terminated by CI. This patch fixes the issue by fully serializing all view create and drop operations through shard 0. Shard 0 becomes the single execution point and notifies other shards to perform their work in order. Requests originating. new process: - view_builder::on_create_view(...) runs only on shard 0 and kicks off dispatch_create_view(...) in the background. - dispatch_create_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed. - dispatch_create_view(...) calls handle_seed_view_build_progress(...) on shard 0. That: - writes the global “build progress” row across all shards via _sys_ks.register_view_for_building_for_all_shards(...). - After seeding, dispatch_create_view(...) broadcasts to all shards with container().invoke_on_all(...). - Each shard runs handle_create_view_local(...), which: - waits for pending base writes/streams, flushes the base, - resets the reader to the current token and adds the new view, - handles errors and triggers _build_step to continue processing. Drop view - view_builder::on_drop_view(...) runs only on shard 0 and kicks off dispatch_drop_view(...) in the background. - dispatch_drop_view(...) (shard 0) first checks should_ignore_tablet_keyspace(...) and returns early if needed. - It broadcasts handle_drop_view_local(...) to all shards with invoke_on_all(...). - Each shard runs handle_drop_view_local(...), which: - removes the view from local build state (_base_to_build_step and _built_views) by scanning existing steps, - ignores missing keyspace cases. - After all shards finish local cleanup, shard 0 runs handle_drop_view_global_cleanup(...), which: - removes global build progress, built‑view state, and view build status in system tables, Shutdown - drain() waits on _view_notification_sem before _sem so in‑flight dispatches finish before bookkeeping is halted. In addition, the test is adjusted to remove the long eventual wait (596.52s / 30 iterations) and instead rely on the default wait of 17 iterations (~4.37 minutes), eliminating unnecessary delays while preserving correctness. Fixes: https://github.com/scylladb/scylladb/issues/27898 Backport: not required as the problem happens on master Closes scylladb/scylladb#27929	2026-01-12 09:23:22 +02:00
Michał Hudobski	92c988514c	vector_search: allow all where clauses in vector search queries To prepare for implementation of filtering we skip validation of where clauses in vector search queries. All queries that would be blocked by the lack of ALLOW FILTERING now will pass through. Fixes: VECTOR-410 Closes scylladb/scylladb#27758	2026-01-11 12:56:44 +02:00
Marcin Maliszkiewicz	03e0dd0841	Merge 'test/alternator: fix most tests to run on DynamoDB' from Nadav Har'El We can run Alternator's tests against DynamoDB with `test/alternator/run --aws`, and our intention is that all except a few specially marked should pass on DynamoDB - indicating that the test itself is correct and checks compatibility with DynamoDB and not with some misunderstood spec. Before this patch series, almost two dozen Alternator's tests failed on DynamoDB. This series fixes most of them. Refs #26079 (it fixes almost all the problems but probably not all of them so let's keep the issue open for a while longer) Closes scylladb/scylladb#27995 * github.com:scylladb/scylladb: test/alternator: fix some expected error messages to fit DynamoDB test/alternator: fix compressed request test on non-us-east1 test/alternator: fix test's expected error message on DynamoDB test/alternator: mark Alternator-only test scylla_only test/alternator: fix test on DynamoDB test/alternator: increase wait_for_gsi() timeout test/alternator: fix test passing a spurious parameter	2026-01-09 18:05:20 +01:00
Botond Dénes	7e1c8776b7	docs: remove sstabledump and sstablemetadata These tools are deprecated and no longer shipped by ScyllaDB packages. They no longer support the latest SSTable versions and ScyllaDB-only features, like encryption and dictionary based compression. Remove them from the documentation. Closes scylladb/scylladb#27608	2026-01-09 17:31:54 +01:00
Dawid Mędrek	2385afa1c7	scripts/pull_github_pr.sh: Update instructions for creating token The interface of Jenkins has changed, and the instructions for creating a token are out-of-date. This commit updates them. Closes scylladb/scylladb#28054	2026-01-09 17:45:00 +02:00
Ferenc Szili	0ede8d154b	docs: add docs for size based load balancing This patch updates the documentation for size based load balancing. Closes scylladb/scylladb#27616	2026-01-09 16:25:25 +02:00
Andrei Chekun	1f60208aa0	test.py: fix test_vector_similarity.py There is a known limitation of the xdist. Since it makes discovery in each thread, then compare it with master thread. The discovered lists of test should be the same. Sets are not order guaranteed, so they should not be used for parametrized testing, because discovery of the tests with using xdist will fail. This PR just converts set to dist, to eliminate issue mentioned above.	2026-01-09 15:08:40 +01:00
Yaniv Michael Kaul	af8eaa9ea5	scripts: fixes flagged by CodeQL/PyLens Unused imports, unused variables and such. Initially, there were no functional changes, just to get rid of some standard CodeQL warnings. I've then broken the CI, as apparently there's a install time(!?) Python script creation for the sole purpose of product naming. I changed it - we have it in etcdir, as SCYLLA-PRODUCT-FILE. So added (copied from a different script) a get_product() helper function in scylla_util.py and used it instead. While at it, also fixed the too broad import from scylla_util, which 'forced' me to also fix other specific imports (such as shutil). Improvement - no need to backport. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#27883	2026-01-09 15:13:12 +02:00
Anna Stuchlik	396093ff60	doc: remove the instruction for Open Source from the Web Installer docs Fixes https://github.com/scylladb/scylladb/issues/28079	2026-01-09 14:07:32 +01:00
Botond Dénes	af6cb0d0a4	Merge 'raft topology: preserve IP -> ID mapping of a replacing node on restart' from Patryk Jędrzejczak We currently do it only for a bootstrapping node, which is a bug. The missing IP can cause an internal error, for example, in the following scenario: - replace fails during streaming, - all live nodes are shut down before the rollback of replace completes, - all live nodes are restarted, - live nodes start hitting internal error in all operations that require IP of the replacing node (like client requests or REST API requests coming from nodetool). We fix the bug here, but we do it separately for replace with different IP and replace with the same IP. For replace with different IP, we persist the IP -> host ID mapping in `system.peers` just like for bootstrap. That's necessary, since there is no other way to determine IP of the replacing node on restart. For replace with the same IP, we can't do the same. This would require deleting the row corresponding to the node being replaced from `system.peers`. That's fine in theory, as that node is permanently banned, so its IP shouldn't be needed. Unfortunately, we have many places in the code where we assume that IP of a topology member is always present in the address map or that a topology member is always present in the gossiper endpoint set. Examples of such places: - nodetool operations, - REST API endpoints, - `db::hints::manager::store_hint`, - `group0_voter_handler::update_nodes`. We could fix all those places and verify that drivers work properly when they see a node in the token metadata, but not in `system.peers`. However, that would be too risky to backport. We take a different approach. We recover IP of the replacing node on restart based on the state of the topology state machine and `system.peers` just after loading `system.peers`. We rely on the fact that group 0 is set up at this point. The only case where this assumption is incorrect is a restart in the Raft-based recovery procedure. However, hitting this problem then seems improbable, and even if it happens, we can restart the node again after ensuring that no client and REST API requests come before replace is rolled back on the new topology coordinator. Hence, it's not worth to complicate the fix (by e.g. looking at the persistent topology state instead of the in-memory state machine). Fixes #28057 Backport this PR to all branches as it fixes a problematic bug. Closes scylladb/scylladb#27435 * github.com:scylladb/scylladb: gossiper: add_saved_endpoint: make generations of excluded nodes negative test: introduce test_full_shutdown_during_replace utils: error_injection: allow aborting wait_for_message raft topology: preserve IP -> ID mapping of a replacing node on restart	2026-01-09 14:56:16 +02:00
Calle Wilund	a7cdb602e1	db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc Fixes #27992 When doing a commit log oversized allocation, we lock out all other writers by grabbing the _request_controller semaphore fully (max capacity). We thereafter assert that the semaphore is in fact zero. However, due to how things work with the bookkeep here, the semaphore can in fact become negative (some paths will not actually wait for the semaphore, because this could deadlock). Thus, if, after we grab the semaphore and execution actually returns to us (task schedule), new_buffer via segment::allocate is called (due to a non-fully-full segment), we might in fact grab the segment overhead from zero, resulting in a negative semaphore. The same problem applies later when we try to sanity check the return of our permits. Fix is trivial, just accept less-than-zero values, and take same possible ltz-value into account in exit check (returning units) Added whitebox (special callback interface for sync) unit test that provokes/creates the race condition explicitly (and reliably). Closes scylladb/scylladb#27998	2026-01-09 14:06:58 +02:00
Łukasz Paszkowski	7bf26ece4d	test_user_writes_rejection: Fix test flakiness caused by typo and non-local CL=ONE reads The current code: ``` try: cql.execute(f"INSERT INTO {cf} (pk, t) VALUES (-1, 'x')", host=host[0], execution_profile=cl_one_profile).result() except Exception: pass ``` contains a typo: `host=host[0]` which throws an exception becase Host object is not subscriptable. The test does not fail because the except block is too broad and suppresses all exceptions. Fixing the typo alone is insufficient. The write still succeeds because the remaining nodes are UP and the query uses CL=ONE, so no failure should be expected. Another source of flakiness is data verification: ``` SELECT * FROM {cf} WHERE pk = 0; ``` Even when a coordinator is explicitly provided, using CL=ONE does not guarantee a local read. The coordinator may forward the read request to another replica, causing the verification to fail nondeterministically. This patch rewrites the tests to address these issues: - Fix the typo: `host[0]` to `hosts[0]` - Verify data using `MUTATION_FRAGMENTS({cf})` which guarantees a local read on the coordinator node - Reconnect the driver after node restart Fixes https://github.com/scylladb/scylladb/issues/27933 Closes scylladb/scylladb#27934	2026-01-09 13:42:05 +02:00
Andrei Chekun	82e81a8664	docs: add directories excluded from test.py Add new directories that are excluded from the test.py executor and will be fully managed by pytest	2026-01-09 11:59:25 +01:00
Andrei Chekun	353bae7d66	test.py: prevent file descriptors leaking With migration to the pytest, file descriptors will be hanged during the whole life of the process. Previously it was not an issue, because test.py was executing only one file with Popen, so descriptors will be freed with process done. With new approach they are blocked. This will allow to eliminate this. Fix issue when we had issue with getting cluster and then trying to set it dirty while it None. Put cluster to the pool only if it was created	2026-01-09 11:59:25 +01:00
Andrei Chekun	67c5267053	test.py: capture print inside the test Capture the printing inside the test case to output it after the test and not directly during the testing process.	2026-01-09 11:59:25 +01:00
Andrei Chekun	594aedd6a5	test.py: do not print header for collection with test.py Skip printing the default pytest headers when printing list of the tests. Before: ``` $ ./test.py --mode=dev test/boost/sstable_conforms_to_mutation_source_test.cc --list Test session starts (platform: linux, Python 3.13.9, pytest 8.3.4, pytest-sugar 1.1.1) rootdir: /home/xtrey/projects/scylladb/test configfile: pytest.ini plugins: xdist-3.8.0, allure-pytest-2.15.0, sugar-1.1.1, anyio-4.8.0, asyncio-0.24.0, timeout-2.3.1 asyncio: mode=Mode.AUTO, default_loop_scope=session timeout: 24000.0s timeout method: signal timeout func_only: False session timeout: 24000.0s boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_tiny.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_medium.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_large.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_tiny.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_medium.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_large.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_tiny.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_medium.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_large.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_reversing_reader_random_schema.dev.1 Results (0.06s): ``` After: ``` $ ./test.py --mode=dev test/boost/sstable_conforms_to_mutation_source_test.cc --list boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_tiny.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_medium.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_mc_large.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_tiny.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_medium.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_md_large.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_tiny.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_medium.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_conforms_to_mutation_source_ms_large.dev.1 boost/sstable_conforms_to_mutation_source_test.cc::test_sstable_reversing_reader_random_schema.dev.1 Results (0.06s): ```	2026-01-09 11:59:25 +01:00
Andrei Chekun	21a1ff3d5c	test.py: remove not supported functionality In the current state pytest do not support the order of execution, so this parameter is removed. There is no big need in this due to the differences what pytest and test.py counted test. pytest run test functions in the threads, while test.py executed test files in the threads. That's why pytest's way is more granular and allows to fill threads better. Remove skip node, since it already added as a pytest mark for each test in the file. Remove pool_size, since this is not used by pytest at all. Pytest uses xdist to set the amount of threads instead of pool_size used by test.py	2026-01-09 11:59:25 +01:00
Andrei Chekun	e8c50a5ad4	test.py: switch of execution of several test directories by test.py runner With this commit test.py will lose ability to run tests by itself always bypassing execution to the pytest. NOTE: this is a breaking change. From this commit, several directories with tests will require a path to the file to launch the test. Affected directories test/alternator test/broadcast_tables test/cql test/cqlpy test/rest_api	2026-01-09 11:59:25 +01:00
Andrei Chekun	61d49525ad	test.py: integrate python tests to be executed with pytest runner With this commit test.py will be bypassing the tests execution to the pytest. However, it will still be able to run test by itself. With providing test name like `broadcast_tables/test_broadcast_tables` it will execute test with test.py runner, but if the path to the file will be provided like `test/broadcast_tables/test_broadcast_tables.py` it will bypass execution to the pytest. `--test-py-init` tells to run pytest session in test.py-compatible mode Update the help text for the name parameter for test.py about changes how it works and which directory is served by pytest	2026-01-09 11:59:25 +01:00
Andrei Chekun	808b29885f	test.py: fix test/vector_search_validator to be able to run with pytest build_mode fixture have dynamic scope. It depends how the pytest is executed. When it executed through test.py scope will be session and since it's broader that package everything work fine. While with pure pytest it will fail because build_mode will have module scope. This fix allows to run tests with pure pytest, this needed for migration test to be executed by pytest runner instead test.py.	2026-01-09 11:59:25 +01:00
Andrei Chekun	8252de7b55	test.py: prepare base class for migration Since all tests share the same base class and some of the tests executed by test.py and some with pytest, we need to handle two cases where configuration is located: suite.yaml and test_config.yaml After full migration suite.yaml case will be removed	2026-01-09 11:59:25 +01:00
Andrei Chekun	48ff74b6b2	test.py: move environment preparation to one method Since anyway these two methods should be called one by one in two different cases: when test.py executes test and pytest executes test, merging them into one. Additionally, set environment variable to show the underneath pytest process that environment was already prepared and there is no need to clean directories or start additional services.	2026-01-09 11:59:25 +01:00
Andrei Chekun	e074e21490	test.py: introduce new environment variable TESTPY_PREPARED_ENVIRONMENT Introduce the new environment variable that will be used to signalize to the pytest runner that environment war already prepared by test.py. This needed to be able to run the test with pytest and test.py(that actually will run pytest underneath).	2026-01-09 11:59:25 +01:00
copilot-swe-agent[bot]	8c48b82b84	test_ssl: fix indentation	2026-01-09 10:27:17 +01:00
Piotr Smaron	2bcbebe92d	generic_server: improve logging broken TLS connection Preiously we were logging a broken TLS connection and then this has been logged later again, so now instead of logging we're constructing an exception with a message extened with TLS info, which later will be catched with its full message still logged.	2026-01-09 10:24:55 +01:00
Piotr Smaron	7016fc4835	test_ssl: improve timeout and readability 1. With this change the test really waits 10s, previously (in case something went wrong), the timeout could take way more than that. 2. Added `else` to above `if` to increase clarity of execution flow - it doesn't change logic, but makes it more clear.	2026-01-09 10:22:19 +01:00
Asias He	7ba7b25bdd	repair: Implement auto repair for tablet repair This patch implements the basic auto repair support for tablet repair. It was decided to add no per table configuration for the initial implementation, so two scylla yaml config options are introduced to set the default auto repair configs for all the tablet tables. - auto_repair_enabled_default Set true to enable auto repair for tablet tables by default. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. - auto_repair_threshold_default_in_seconds Set the default time in seconds for the auto repair threshold for tablet tables. If the time since last repair is bigger than the configured time, the tablet is eligible for auto repair. The value will be overridden by the per keyspace or per table configuration which is not implemented yet. The following metrcis are added: - auto_repair_needs_repair_nr The number of tablets with auto repair enabled that needs repair - auto_repair_enabled_nr The number of tablets with auto repair enabled The metrics are useful to tell if auto repair is falling behind. In the future, more auto repair scheduling will be added, e.g., scheduling based on the repaired and unrepaired sstable set size, tombstone ratio and so on, in addition to the time based scheduling. Fixes SCYLLADB-99	2026-01-09 16:11:39 +08:00
Botond Dénes	60570d7114	Merge 'topology coordinator: restrict node join/remove to preserve RF-rack validity' from Michael Litvak Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations: * Altering a keyspace's RF if it would make the keyspace RF-rack-invalid * Adding a node in a new rack * Removing / Decommissioning the last node in a rack Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views. The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid. Fixes scylladb/scylladb#23345 Fixes https://github.com/scylladb/scylladb/issues/26820 backport to relevant versions for materialized views with tablets since it depends on rf-rack validity Closes scylladb/scylladb#26354 * github.com:scylladb/scylladb: docs: update RF-rack restrictions cql3: don't apply RF-rack restrictions on vector indexes cql3: add warning when creating mv/index with tablets about rf-rack service/tablet_allocator: always allow tablet merge of tables with views locator: extend rf-rack validation for rack lists test: test rf-rack validity when creating keyspace during node ops locator: fix rf-rack validation during node join/remove test: test topology restrictions for views with tablets test: add test_topology_ops_with_rf_rack_valid topology coordinator: restrict node join/remove to preserve RF-rack validity topology coordinator: add validation to node remove locator: extend rf-rack validation functions view: change validate_view_keyspace to allow MVs if RF=Racks db: enforce rf-rack-validity for keyspaces with views replica/db: add enforce_rf_rack_validity_for_keyspace helper db: remove enforce parameter from check_rf_rack_validity test: adjust test to not break rf-rack validity	2026-01-09 10:01:23 +02:00
Patryk Jędrzejczak	eee2b6c7af	Merge 'tablets: Make balancing disabling RPC preempt tablet transitions' from Tomasz Grabiec Disabling of balancing waits for topology state machine to become idle, to guarantee that no migrations are happening or will happen after the call returns. But it doesn't interrupt the scheduler, which means the call can take arbitrary amount of time. It may wait for tablet repair to be finished, which can take many hours. We should do it via topology request, which will interrupt the tablet scheduler. Enabling of balancing can be immediate. Fixes https://github.com/scylladb/scylladb/issues/27647 Fixes #27210 Closes scylladb/scylladb#27736 * https://github.com/scylladb/scylladb: test: Verify that repair doesn't block disabling of tablet load balancing tablets: Make balancing disabling call preempt tablet transitions	2026-01-08 21:55:19 +02:00
Piotr Dulikowski	8e3e39a64a	Merge 'service/storage_service: update service levels cache after upgrade to v2' from Michał Jadwiszczak Service levels cache is empty after upgrade to consistent topology if no mutations are commited to `system.service_levels_v2` or rolling restart is not done. To fix the bug, this patch adds service levels cache reloading after upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`. Fixes [SCYLLADB-90](https://scylladb.atlassian.net/browse/SCYLLADB-90) This fix should be backported to all versions containing service levels on Raft. [SCYLLADB-90]: https://scylladb.atlassian.net/browse/SCYLLADB-90?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#27585 * github.com:scylladb/scylladb: service/storage_service: update service levels cache after upgrade to v2 service/storage_service: check if service levels were already upgraded before doing migration to raft	2026-01-08 21:55:19 +02:00
Michał Hudobski	e2e479f20d	auth: fix cdc vector search indexing permission bug VECTOR_SEARCH_INDEXING permission didn't work on cdc tables as we mistakenly checked for vector indexes on the cdc table insted of the base. This patch fixes that and adds a test that validates this behavior. Fixes: VECTOR-476 Closes scylladb/scylladb#28050	2026-01-08 21:55:19 +02:00
Ernest Zaslavsky	19fe630c0e	Update seastar submodule seastar 4dcd4df..dd46b6fe ``` dd46b6fe net: expose DNS TTL via net::hostent b94f81b0 test: Extend statat() test to check ENOENT exception reporting ``` Closes scylladb/scylladb#28006	2026-01-08 21:55:19 +02:00
Michael Litvak	8f15c7a874	db/view/view_update_generator: move discover_staging_sstables to start Call discover_staging_sstables in view_update_generator::start() instead of in the constructor, because the constructor is called during initialization before sstables are loaded. The initialization order was changed in `5d1f74b86a` and caused this regression. It means the view update generator won't discover staging sstables on startup and view updates won't be generated for them. It also causes issues in sstable cleanup. view_update_generator::start() is called in a later stage of the initialization, after sstable loading, so do the discovery of staging sstables there. Fixes scylladb/scylladb#27956 Closes scylladb/scylladb#27970	2026-01-08 21:55:19 +02:00
Botond Dénes	8c72dcc1ec	Merge 'database: truncate_table_on_all_shards: consider can_flush on all shards' from Benny Halevy Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them. This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged). Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`. Fixes #27639 * The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions Closes scylladb/scylladb#27643 * github.com:scylladb/scylladb: test: database_test: do_with_some_data: randomize keys database: truncate_table_on_all_shards: drop outdated TODO comment database: truncate_table_on_all_shards: consider can_flush on all shards memtable_list: unify can_flush and may_flush test: database_test: add test_flush_empty_table_waits_on_outstanding_flush replica: table, storage_group, compaction_group: add needs_flush test: database_test: do_with_some_data_in_thread: accept void callback function	2026-01-08 21:55:19 +02:00
Avi Kivity	633e6e0037	build: update toolchain generation procedure for optimized clang Explain where to pick up existing clang archives, and how to upload new ones. Closes scylladb/scylladb#27690	2026-01-08 21:55:18 +02:00
Evgeniy Naydanov	a9da14be19	test: dtest: reproducer for parallel rebuild failure 2-DC cluster parallel non-RBNO rebuild failure when expanding RF in DC2. Steps to reproduce: 1. Provision a cluster with 2 datacenters and at least 2 nodes in the second datacenter. 2. Let’s assume datacenter names are "dc1" and "dc2". 3. Create a keyspace ("keyspace1") with RF=0 in dc2. 4. Populate some data into dc1. 5. Change keyspace1 replication in dc2 to 2. 6. On 2 nodes in dc2 run the following command in parallel: nodetool rebuild --source-dc dc1 Parallel execution of rebuilds is not possible with RBNO enabled. This test is the repro for #27804 Closes scylladb/scylladb#27747	2026-01-08 21:55:18 +02:00
Botond Dénes	9b4a7f1d14	Merge 'test: cluster: object_store: test_backup: modernize do_abort_restore' from Benny Halevy Currently the function uses a regular expression to check the system log for a specific message. This is tangential to the ability to cleanly abort the restore task, plus the regular expression has a syntax error: ``` test/cluster/object_store/test_backup.py:534 /home/bhalevy/dev/scylla/test/cluster/object_store/test_backup.py:534: SyntaxWarning: "$" is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\("? A raw string is also an option. await wait_for_first_completed([l.wait_for("Failed to handle STREAM_MUTATION_FRAGMENTS \(receive and distribute phase$ for .+: Streaming aborted", timeout=10) for l in logs]) ``` Thsi change modernizes the implementation by: - using auto_dc_rack for manager.servers_add - using new_test_keyspace to generate and auto delete the keyspace - using async gatherio and a prepared statement to insert the data - simplifing the keys and values by NOT using os.urandom (that is notoriously slow) - inserting fewer keys in debug mode - removing the log check With that, the test can be reenabled in all modes. * No backport needed since the test was disabled Closes scylladb/scylladb#27892 * github.com:scylladb/scylladb: test_backup: do_abort_restore: reduce data footprint test_backup: do_abort_restore: use error injection test_backup: do_abort_restore: use asyncio for cql test_backup: do_abort_restore: use new_test_keyspace test_backup: do_abort_restore: use logger rather than print test_backup: do_abort_restore: pass auto_rack_dc to servers_add	2026-01-08 21:55:18 +02:00
Anna Stuchlik	f614482e66	doc: add the patch release upgrade procedure for version 2025.4 Adds the patch upgrade guide based on previous upgrade guides. Fixes https://github.com/scylladb/scylladb/issues/27982 Closes scylladb/scylladb#27985	2026-01-08 21:55:18 +02:00
Asias He	4f77dd058d	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#27679	2026-01-08 21:55:18 +02:00
Anna Stuchlik	3f1c7c70f5	doc: remove the link to the Download Center ... from the OS support page. Fixes https://github.com/scylladb/scylladb/issues/28047 Closes scylladb/scylladb#28048	2026-01-08 21:55:18 +02:00
Asias He	0aabf51380	repair: Fix sstable_list_to_mark_as_repaired with multishard writer It was obseved: ``` test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to segfault. backtrace pointed to a failure when allocating an object from the chain of freed objects, which indicates memory corruption. (gdb) bt at ./seastar/include/seastar/core/shared_ptr.hh:275 at ./seastar/include/seastar/core/shared_ptr.hh:430 Usual suspect is use-after-free, so ran the reproducer in the sanitize mode, which indicated shared ptr was being copied into another cpu through the multi shard writer: seastar - shared_ptr accessed on non-owner cpu, at: ... -------- seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer... ``` The multishard writer itself was fine, the problem was in the streaming consumer for repair copying a shared ptr. It could work fine with same smp setting, since there will be only 1 shard in the consumer path, from rpc handler all the way to the consumer. But with mixed smp setting, the ptr would be copied into the cpus involved, and since the shared ptr is not cpu safe, the refcount change can go wrong, causing double free, use-after-free. To fix, we pass a generic incremental repair handler to the streaming consumer. The handler is safe to be copied to different shards. It will be a no op if incremental repair is not enabled or on a different shard. A reproducer test is added. The test could reproduce the crash consistently before the fix and work well after the fix. Fixes #27666 Closes scylladb/scylladb#27870	2026-01-08 21:55:18 +02:00
Radosław Cybulski	5f48ab3875	storage_proxy: fix invalid assert Change invalid `assert(true)` into `SCYLLA_ASSERT(false)`, as the latter was clearly meant. Closes scylladb/scylladb#27900	2026-01-08 21:55:18 +02:00
Andrei Chekun	c950c2e582	test.py: convert skip_mode function to pytest.mark Function skip_mode works only on function and only in cluster test. This if OK when we need to skip one test, but it's not possible to use it with pytestmark to automatically mark all tests in the file. The goal of this PR is to migrate skip_mode to be dynamic pytest.mark that can be used as ordinary mark. Closes scylladb/scylladb#27853 [avi: apply to test/cluster/test_tablets.py::test_table_creation_wakes_up_balancer]	2026-01-08 21:55:16 +02:00
Tomasz Grabiec	a52de4ecdc	test: cluster: test_topology_ops[_encrypted]: Fix failures due to background migrations fencing out writes The test if flaky, with failures in: for server in servers: > await check_node_log_for_failed_mutations(manager, server) test/cluster/test_topology_ops_encrypted.py:84: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ manager = <test.pylib.manager_client.ManagerClient object at 0xffff602e8590> server = ServerInfo(server_id=1769, ip_addr='127.82.127.43', rpc_address='127.82.127.43', datacenter='DEFAULT_DC', rack='DEFAULT_RACK', pid=186578) async def check_node_log_for_failed_mutations(manager: ManagerClient, server: ServerInfo): logging.info(f"Checking that node {server} had no failed mutations") log = await manager.server_open_log(server.server_id) occurrences = await log.grep(expr="Failed to apply mutation from", filter_expr="(TRACE\|DEBUG\|INFO)") > assert len(occurrences) == 0 E AssertionError test/cluster/util.py:319: AssertionError As diagnosed by Gleb in https://github.com/scylladb/scylladb/issues/27942#issuecomment-3710013625: "The fencing errors here look legit given that we do not wait for all requests to complete while shutting down the storage proxy. The scenario is this: Test does writes to rf=3 keyspace with cl=one. One node is shutting down while there is a tablet migration. Tablet migration executes barrier and drain which fails on a node that is been shutdown. The topology coordinator proceeds fencing the old topology, but there still can be un-handled mutation requests from the shutting down node on other nodes and they will generate fencing errors like they should. They way to avoid it (though it is benign) is to wait for all outgoing storage proxy requests to complete during shutdown, but even then the error may still happen since a request may timeout before it is processed by the other side, so it may be completed by a storage proxy coordinator side, but still not handled by replica side. This what we have fencing for in the first place." Fix by diabling background tablet migrations, so that we have no topology barriers concurrent with node shutdown. Fixes #27942 Closes scylladb/scylladb#28034	2026-01-08 21:53:47 +02:00
Tomasz Grabiec	34df158605	test: cluster: Fix NoHostAvailable error in test_not_enough_token_owners The driver must see server_c before we stop server_a, otherwise there will be no live host in the pool when we attempt to drop the keyspace: ``` @pytest.mark.asyncio async def test_not_enough_token_owners(manager: ManagerClient): """ Test that: - the first node in the cluster cannot be a zero-token node - removenode and decommission of the only token owner fail in the presence of zero-token nodes - removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token owners would fall below the RF of some keyspace using tablets """ logging.info('Trying to add a zero-token server as the first server in the cluster') await manager.server_add(config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}, expected_error='Cannot start the first node in the cluster as zero-token') logging.info('Adding the first server') server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"}) logging.info('Adding two zero-token servers') # The second server is needed only to preserve the Raft majority. server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0] logging.info(f'Trying to decommission the only token owner {server_a}') await manager.decommission_node(server_a.server_id, expected_error='Cannot decommission the last token-owning node in the cluster') logging.info(f'Stopping {server_a}') await manager.server_stop_gracefully(server_a.server_id) logging.info(f'Trying to remove the only token owner {server_a} by {server_b}') await manager.remove_node(server_b.server_id, server_a.server_id, expected_error='cannot be removed because it is the last token-owning node in the cluster') logging.info(f'Starting {server_a}') await manager.server_start(server_a.server_id) logging.info('Adding a normal server') await manager.server_add(property_file={"dc": "dc1", "rack": "r2"}) cql = manager.get_cql() await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60) > async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name: test/cluster/test_not_enough_token_owners.py:57: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/lib64/python3.14/contextlib.py:221: in __aexit__ await anext(self.gen) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830> opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }" host = None @asynccontextmanager async def new_test_keyspace(manager: ManagerClient, opts, host=None): """ A utility function for creating a new temporary keyspace with given options. It can be used in a "async with", as: async with new_test_keyspace(ManagerClient, '...') as keyspace: """ keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host) try: yield keyspace except: logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation") raise else: > await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host) E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')}) test/cluster/util.py:544: NoHostAvailable ``` Fixes #28011 Closes scylladb/scylladb#28040	2026-01-08 21:53:47 +02:00
Andrei Chekun	ee0bf35615	test.py: add custome exit code for pytest in case maxfail reached This PR adds custom exit code in case when maxfail reached. This is needed for easier detection why pytest failed in CI. Closes scylladb/scylladb#28018	2026-01-08 21:53:47 +02:00
Anna Stuchlik	1b653166f1	doc: add the version variable to the Web Installer instructions This commit replaces a fixed version name with the variable for the current version in the instructions for installing a non-default version with Web Installer. This will make using the installer more user-friendly. Fixes https://github.com/scylladb/scylladb/issues/28005	2026-01-08 10:12:21 +01:00
Benny Halevy	ebd667a8e0	test: database_test: do_with_some_data: randomize keys With randomized keys, and since we're inserting only 2 keys, it is possible that they would end up owned only by a single shard, reproducing #27639 in snapshot_list_contains_dropped_tables. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:46 +02:00
Benny Halevy	93b827c185	database: truncate_table_on_all_shards: drop outdated TODO comment The comment was added in `83323e155e` Since then, table::seal_active_memtable was improved to guarantee waiting on oustanding flushes on success (See `d55a2ac762`), so we can remove this TODO comment (it also not covered by any issue so nobody is planned to ever work on it). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:46 +02:00
Benny Halevy	2a803d2261	database: truncate_table_on_all_shards: consider can_flush on all shards can_flush might return a different value for each shard so check it right before deciding whether to flush or clear a memtable shard. Note that under normal condition can_flush would always return true now that it checks only the presence of the seal memtable function rather than check memtable_list::empty(). Fixes #27639 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:46 +02:00
Benny Halevy	02ee341a03	memtable_list: unify can_flush and may_flush Now that we have a unit test proving that it's safe to flush an empty memtable list there is no need to distinguish between may_flush and can_flush. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:46 +02:00
Benny Halevy	0342a24ee0	test: database_test: add test_flush_empty_table_waits_on_outstanding_flush Test that table::flush waits on outstanding flushes, even if the active memtable is empty Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:49:45 +02:00
Benny Halevy	5be6b80936	replica: table, storage_group, compaction_group: add needs_flush Table needs flush if not all its memtable lists are empty. To be used in the next patch for a unit test. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:41:22 +02:00
Benny Halevy	ec4069246d	test: database_test: do_with_some_data_in_thread: accept void callback function Many test cases already assume `func` is being called a seastar thread and although the function they pass returns a (ready) future, it serves no purpose other than to conform to the interface. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-08 09:41:22 +02:00
Patryk Jędrzejczak	946a2bb988	storage_service: do not call raft_topology_update_ip for left nodes This `raft_topology_update_ip` call always returns after `t.find(raft_id)` returns `nullptr`, so it effectively does nothing. It's not a bug, since there is no reason to update `system.peers` for left nodes anyway. We delete the rows corresponding to left nodes in `process_left_node` (called just above). Closes scylladb/scylladb#27899	2026-01-07 16:52:13 +01:00
Michał Jadwiszczak	be16e42cb0	service/storage_service: update service levels cache after upgrade to v2 Service levels cache is empty after upgrade to consistent topology if no mutations are commited to `system.service_levels_v2` or rolling restart is not done. To fix the bug, this commit adds service levels cache reloading after upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`. Fixes SCYLLADB-90	2026-01-07 14:06:13 +01:00
Michał Jadwiszczak	53d0a2b5dc	service/storage_service: check if service levels were already upgraded before doing migration to raft There is no need to call `service_level_controller::upgrade_to_v2()` on every topology state load, we only need to do it once.	2026-01-07 14:06:13 +01:00
Nadav Har'El	f7eae50d98	test/alternator: fix some expected error messages to fit DynamoDB All tests I am fixing in this patch do pass for me on DynamoDB, but other developers report that they fail because some DynamoDB servers apparently use slightly different error messages, with less detail about the cause of an error. For example, some of our tests currently expect an error message that looks like: An error occurred (ValidationException) when calling the Query operation: Invalid operator used in KeyConditionExpression: attribute_exists But some servers don't report the ": attribute_exists" at the end, so we can't use the word "attribute_exists" it in the test to recognize the correct error, and needs to use a different word (which both versions of DynamoDB and Alternator all print). As another example, the good old DynamoDB error: An error occurred (ValidationException) when calling the Query operation: 1 validation error detected: Value 'DOG' at 'conditionalOperator' failed to satisfy constraint: Member must satisfy enum value set: [OR, AND] Got replaced by the following less informative message: An error occurred (ValidationException) when calling the Query operation: Failed to satisfy constraint: Member must satisfy enum value set: [ALL, OR]' So we need to fix the test to allow it too. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-01-07 14:06:33 +02:00
Nadav Har'El	e97fbc2d65	test/alternator: fix compressed request test on non-us-east1 The test test_compressed_request.py::test_compressed_request coerces boto3 to send a compressed request, and wrongly used region_name=us-east-1 to set up the connection. Theoretically, this doesn't matter because we also set the correct URL (for either Alternator or the desired region in AWS). But in fact it does matter, because region name is part of the request's signature, and DynamoDB refuses the request if it comes to a different region than it is signed for. So this test fails when run on DynamoDB on any other region except us-east-1. The fix is simple - don't use the constant "us-east-1", but pick up the correct region name from the original connection. The functions new_dynamodb_session(), new_dynamodb() and new_dynamodb_stremas() had the same bug and we fix it too, but it didn't break any test because the only tests using these functions were Scylla-only so the AWS region problem didn't apply to them.	2026-01-07 13:33:46 +02:00
Patryk Jędrzejczak	f0d159abb0	Merge 'test/raft: use valid sentinel in liveness check to prevent digest errors' from Emil Maskovsky Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants. The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value. By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test. The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry. Fixes: scylladb/scylladb#27307 Backporting to active branches - this is a test-only fix (low risk) for a flaky test that exists in older branches (thus affects the CI of active branches). Closes scylladb/scylladb#28010 * https://github.com/scylladb/scylladb: test/raft: use valid sentinel in liveness check to prevent digest errors test/raft: improve debugging in randomized_nemesis_test	2026-01-07 12:31:21 +01:00
Nadav Har'El	2c02e463ff	test/alternator: fix test's expected error message on DynamoDB The Alternator test test_tag.py::test_tag_lsi_gsi expects to see an error - it's not allowed to set a tag on a GSI or LSI - but the error message that DynamoDB prints recently changed - instead of saying "ResourceArn" the new error message says "resource arn". Change the test to allow both forms, so it will pass on both Alternator (which still uses the word ResourceArn - which is the name of the parameter) and on DynamoDB (which uses "resource arn"). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-01-07 12:51:10 +02:00
Nadav Har'El	4f3150c282	test/alternator: mark Alternator-only test scylla_only The test test_batch.py::test_batch_write_item_large_broken_connection failed on DynamoDB (Refs #26079). It turns out this test has many problems: 1. This test wrongly assumes a batch write needs to complete in one attempt - and this fails on DynamoDB with low WCU capacity where the batch needs to be resumed in multiple requests. Using boto3's batch_writer() fixes this problem. 2. This test has NOTHING to do with batches - so is mis-named and mis-placed. The batch write is just a way to prepare some data in the table, and the real test is about Query'ing the data back and observing the long response and reproducing issue #14454. I did not rename or move the test, but left a comment explaining the situation. 3. This test is written to assume the Query's response uses HTTP chunked encoding. Which isn't actually true for DynamoDB, at least not at the time of this writing. So the test fails on DynamoDB. For the last reason, I made this test scylla_only. This test can't really be run on DynamoDB without rewriting it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-01-07 12:51:10 +02:00
Nadav Har'El	df6b347911	test/alternator: fix test on DynamoDB The test test_batch.py::test_batch_write_item_large often fails when running on DynamoDB, and this patch fixes it. The test checks that a large but not over-the-limits large batch works. However, "works" only means that the batch is not an error - it doesn't guarantee that all the items in the batch are performed. If the WCU limits of the table are exceeded DynamoDB may perform only part of the the batch and return the remaining items as UnprocessedItems. This not only can happen, it usually does happen on DynamoDB - because a new on-demand-billing table always start with a very low WCU capacity. So in this patch we update the test to recognize and perform the UnprocessedItems, instead of assuming it needs to be empty. The test continues to pass on Alternator, and finally passes on DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-01-07 12:51:10 +02:00
Nadav Har'El	9d6a463324	test/alternator: increase wait_for_gsi() timeout In Alternator tests, the wait_for_gsi() utility function is used in tests that add a GSI to an existing table, to wait for this new GSI to become ready. Although this takes a fraction of a second on Alternator, we noticed that this takes many minutes (!) on DynamoDB so we used an absurdly high 10 minute timeout to allow tests to also pass on DynamoDB. But it turns out that 10 minutes wasn't absurdly high enough, and tests using it in test_gsi_updatetable.py started to fail on DynamoDB. Empirically, 10 minutes was enough in the past but it seems that today adding a GSI to an empty table routinely takes as much as 20 minutes. So this patch increases the wait_for_gsi() timeout to a whopping 30 minutes. After this patch, the tests in test_gsi_updatetable.py which used to fail - test_gsi_backfill_with_lsi, test_gsi_backfill_with_real_column, test_gsi_creates_and_deletes and test_gsi_backfill_oversized_key now all pass on DynamoDB - but each takes more than 20 minutes to pass. To allow the test to fail much more quickly on Alternator (where creating a GSI takes a fraction of a second), we set a much lower but still very high timeout when running on Alternator - 60 seconds. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-01-07 12:50:54 +02:00
Łukasz Paszkowski	62313a6264	load_sketch: Allow populating load_sketch with normalized current load Currently, tablet allocation intentionally ignores current load ( introduced by the commit #1e407ab) which could cause identical shard selection when allocating a small number of tablets in the same topology. When a tablet allocator is asked to allocate N tablets (where N is smaller than the number of shards on a node), it selects the first N lowest shards. If multiple such tables are created, each allocator run picks the same shards, leading to tablet imbalance across shards. This change initializes the load sketch with the current shard load, scaled into the [0,1] range, ensuring allocation still remains even while starting from globally least-loaded shards. Fixes https://github.com/scylladb/scylladb/issues/27620 Closes scylladb/scylladb#27802	2026-01-07 11:49:01 +01:00
Avi Kivity	2642636ada	build: avoid ccache masquarading when choosing ccache too In `12dcf79c60`, we avoid the ccache masquarate directory when choosing sccache, as that would give us a double-caching effect: first sccache is called, then clang++ is looked up finding ccache masquarading as clang++. We solved that by converting the name clang++ to the absolute path /usr/bin/clang++ (or whatever), skipping over the masquarade directory in $PATH. It turns out that we need to do the same for ccache. That commit changed the compile command to 'ccache clang++', and ccache will look up clang++ in $PATH, finding itself in the masquarade directory. Fix that by avoiding the masquarade directory if a compiler cache is specified explicitly or is found with --compiler-cache=auto. Closes scylladb/scylladb#27996	2026-01-06 17:47:09 +02:00
Nadav Har'El	5f79d93102	Merge 'Alternator response compression' from Szymon Malewski This pull request introduces HTTP response compression to Alternator, allowing responses (both string and chunked) to be compressed using `gzip` or `deflate` when requested by clients and when the response size exceeds configurable thresholds. * Added new source files `http_compression.cc` and `http_compression.hh` implementing compression logic, including parsing client `Accept-Encoding` headers, selecting compression algorithms, and compressing response bodies using zlib. * Added two new configuration options to `db::config` (`alternator_response_gzip_compression_level` and `alternator_response_gzip_compression_threshold_in_bytes`) to control compression level (and optionally disable compression with level 0 - no compression) and minimum response size for compression. * Added tests showing compliance with DynamoDB behavior. Fixes #27246 New feature - no backporting Closes scylladb/scylladb#27454 * github.com:scylladb/scylladb: alternator/http_compression: Add compression of streamed response alternator/http_compression: Add implementation od gzip/deflate of string response alternator/http_compression: Add handling of Accept-Encoding header test/alternator: add tests for compressed responses	2026-01-06 16:47:11 +02:00
Emil Maskovsky	4ba3e90f33	test/raft: use valid sentinel in liveness check to prevent digest errors Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants. The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value. By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test. The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry. Fixes: scylladb/scylladb#27307	2026-01-06 14:34:02 +01:00
Emil Maskovsky	3af5183633	test/raft: improve debugging in randomized_nemesis_test Move the post-condition check before the assertion to ensure it is always executed first. Before, the wrong value could be passed to the digest_remove assertion, making the pre-check trigger there instead of the post-check as expected. Also, add a check in the append_seq constructor to ensure that the digest value is valid when creating an append_seq object.	2026-01-06 14:32:46 +01:00
Ferenc Szili	a51cb3dad9	test: fix flaky test_update_load_stats_after_migration Disable load balancing to avoid the balancer moving the tablet from a node with less to a node with more available disk space. Otherwise, the move_tablet API can fail (if the tablet is already in transisiton) or be a no-op (in case the tablet has already been migrated) Fixes: #27980 Closes scylladb/scylladb#27993	2026-01-06 11:57:35 +02:00
Andrei Chekun	b546315edf	test.py: fix race condition in initizlization of cqlpy tests Fix the race condition when the process finished, while test is trying to checks its descriptors. Now instead of failing the whole loop, it will continue to iterate the rest of the process to find the needed process. Closes scylladb/scylladb#27994	2026-01-06 10:40:25 +02:00
Avi Kivity	4c9c3aae23	tools: toolchain: add dockerfile for future toolchain To avoid surprises when libstdc++, clang, or other components in the toolchain introduce regressions, we introduce a "future toolchain". This builds on the Fedora version under active development, and the development branches of gcc and llvm. The future toolchain is not intended to be frozen. Rather, periodically we will build the future toolchain, then build ScyllaDB and run its unit tests under that toolchain, then discard it. Any problems will then have be be tracked down by a developer and either reported to the source repository, or fixed in ScyllaDB. Closes scylladb/scylladb#27964	2026-01-05 19:38:58 +02:00
Nadav Har'El	384e394ff0	Merge 'Add similarity functions to calculate similarity of given vectors' from Dawid Pawlik It should be possible to return the similarity of vectors in CQL statements following the [Cassandra compatible syntax](https://cassandra.apache.org/doc/latest/cassandra/getting-started/vector-search-quickstart.html#query-vector-data-with-cql): ``` SELECT comment, similarity_cosine(comment_vector, [0.1, 0.15, 0.3, 0.12, 0.05]) FROM cycling.comments_vs; ``` Although the calculations are slow, and we already have calculated results returned via Vector Store API, we need the functionality as it allows us to calculate similarity of vectors not stored in vector indexes. It will be needed for [quantization and rescoring](https://scylladb.atlassian.net/wiki/spaces/RND/pages/195985800/Quantization+and+Rescoring). The feature is also a nice-to-have in testing as requested many times by testing and CX teams. The optimized version utilizing already calculated distances from Vector Store without a need of rescoring will be coming soon after via https://github.com/scylladb/scylladb/pull/27991. --- The patch adds functions: - `similarity_cosine(<vector>, <vector>)`, - `similarity_euclidean(<vector>, <vector>)`, - `similarity_dot_product(<vector>, <vector>)` Where `<vector>` is either a column of type `VECTOR<FLOAT, N>` or a vector of floats literal. These functions can be called with every `SELECT` query, not only ANN vector queries as opposed to https://github.com/scylladb/scylladb/pull/25993. The similarity calculations are implemented inspired by [USearch's implementation]( `a2f1759910/include/usearch/index_plugins.hpp (L1304-L1385)`) and made compatible with [Cassandra's documentation](https://cassandra.apache.org/doc/5.0/cassandra/developing/cql/functions.html#vector-similarity-functions). That would guarantee the results in ScyllaDB are calculated using the exact same algorithms as used in Vector Store indexes. --- Fixes: SCYLLADB-88 Fixes: SCYLLADB-89 New feature, should land into 2026.1 Closes scylladb/scylladb#27524 * github.com:scylladb/scylladb: docs: add vector similarity functions documentation test/cqlpy: add similarity functions correctness tests test/cqlpy: add similarity functions invalid call tests cql3: introduce similarity functions syntax vector_similarity_fcts: introduce similarity functions vector_similarity_fcts: retrieve similarity function argument types vector_similarity_fcts: add calculating similarity between vectors	2026-01-05 18:28:10 +02:00
Tomasz Grabiec	ffa11d6a2d	test: Verify that repair doesn't block disabling of tablet load balancing Refs #27647	2026-01-05 13:22:15 +01:00
Tomasz Grabiec	ccdb301731	tablets: Make balancing disabling call preempt tablet transitions This patch modifies RESTful API handler which disables tablet balancing to use topology request to wait for already running tablet transitions. Before, it was just waiting for topology to be idle, so it could wait much longer than necessary, also for operations which are not affected by the flag, like repair. And repair can take hours. New request type is introduced for this synchronization: noop_request. It will preempt the tablet scheduler, and when the request executes, we know all later tablet transitions will respect the "balancing disabled" flag, and only things which are unuaffected by the flag, like repair, will be scheduled. Fixes #27647	2026-01-05 13:22:08 +01:00
Nadav Har'El	5c2ca56adf	test/alternator: fix test passing a spurious parameter The test test_streams.py::test_streams_putitem_new_item_overrides_old_lsi failed on DynamoDB (Refs #26079) because we passed an unused parameter NonKeyAttributes to the Projection setting an LSI. NonKeyAttributes is only allowed when ProjectionType=INCLUDE, but we used ProjectionType=ALL. DynamoDB refuses to create an LSI with such inconsistent parameters, and we just need to remove this unnecessary parameter from this test. The reason why this test didn't fail on Alternator is that Alternator doesn't yet support or even parse the Projection parameter (Refs #5036). We also add an xfailing test (passes on DynamoDB, fails on Alternator) checking that a spurious NonKeyAttributes parameter is rejected. When we get around to implement the projection feature (#5036), this will be yet another acceptance test for this feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2026-01-05 13:51:01 +02:00
Botond Dénes	e4da0afb8d	reader_concurrency_semaphore: add protection against negative count resource leaks The semaphore has detection and protection against regular resource leaks, where some resources go unaccounted for and are not released by the time the semaphore is destroyed. There is no detection or protection against negative leaks: where resources are "made up" of thin air. This kind of leaks looks benign at first sight, a few extra resources won't hurt anyone so long as this is a small amount. But turns out that even a single extra count resource can defeat a very important anti-deadlock protection in can_admit_read(): the special case which admits a new permit regardless of memory resources, when all original count resources all available. This check uses ==, so if resource > original, the protection is defeated indefinitely. Instead of just changing == to >=, we add detection of such negative leaks to signal(), via on_internal_error_noexcept(). At this time I still don't now how this negative leak happens (the code doesn't confess), with this detection, hopefully we'll get a clue from tests or the field. Note that on_internal_error_noexcept() will not generate a coredump, unless ScyllaDB is explicitely configured to do so. In production, it will just generate an error log with a backtrace. The detection also clams the _resources to _initial_resources, to prevent any damage from the negativae leak. I just noticed that there is no unit test for the deadlock protection described above, so one is added in this PR, even if only loosely related to the rest of the patch. Fixes: SCYLLADB-163 Closes scylladb/scylladb#27764	2026-01-05 12:45:15 +02:00
Anna Stuchlik	375479d96c	doc: fix the syntax of internal links Some internal links had the wrong syntax: they were formatted as external links. As a result, they redirected the user to the outdated Open Source documentation. This commit fixes that bug. Fixes https://github.com/scylladb/scylladb/issues/25899 Closes scylladb/scylladb#27905	2026-01-05 10:44:58 +01:00
Szymon Malewski	1f658bb2e2	alternator/http_compression: Add compression of streamed response This patch adds compression of chunked responses. It adds intermediate stream to compress chunks of data that are provided to http sink. Fixes #27246	2026-01-05 10:14:42 +01:00
Szymon Malewski	b8afb173a6	alternator/http_compression: Add implementation od gzip/deflate of string response Previous commit added means to decide whether client asks for compression and with which algorithm. This patch adds actual compression of responses based on zlib library. For now only string (not chunked) responses are compressed. Several previously defined tests start to pass.	2026-01-05 10:14:42 +01:00
Szymon Malewski	ec329f85b0	alternator/http_compression: Add handling of Accept-Encoding header This is an initial patch to add support of Alternator's compressed responses. The actual compression (gzip,deflate) will be added in the following commits. The main functionality added in this commmit is parsing of Accept-Encoding header, that indicates compression algorithms supported by the client. In this commit we add also configuration parameters of response gzip/deflate compression. They allow to enable/disable compression, set level and a size threshold below which a response is not compressed. With current implementation it is possible to decide a compression for each response, but it is not used yet.	2026-01-05 10:14:40 +01:00
Szymon Malewski	08386ea959	test/alternator: add tests for compressed responses Adds set of tests that: 1. Show how DynamoDB handles response compression. It supports 'gzip' and 'deflate' compression, which can be selected by providing 'Accept-Encoding` header. It only encodes response above 4096B. - `test_compressed_response`, `test_compressed_response_large` show compression for various response sizes. - `test_accept_encoding_header` focuses on testing various values of Accept-Encoding header. - `test_multiple_accept_encoding_headers` verifies behaviour with repeted Accept-Encoding headers. 2. Will confirm implementation of response compression in Alternator (#27246) Additonally to above test, we check Altenator specific expectations: - `test_chunked_response_compression` makes sure that compression will work also for chunked responses. - `test_set_compression_options` checks config options to set response size threshold for compression and compression level 3. `test_signature_trims_accept_encoding_spaces` reveals Alternator's bug in signature verification (#27775)	2026-01-05 10:13:40 +01:00
Avi Kivity	0df85c8ae8	Revert "Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov" This reverts commit `1bb897c7ca`, reversing changes made to `954f2cbd2f`. It makes incompatible changes to the object storage configuration format, breaking tests [1]. It's likely that it doesn't break any production configuration, but we can't be sure. Fixes #27966 Closes scylladb/scylladb#27969	2026-01-05 08:53:41 +02:00
Benny Halevy	a8114f9bcc	test_backup: do_abort_restore: reduce data footprint To make the test fast, in particular in debug mode insert fewer keys and do not rely on os.urandom which is notoriously slow Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-05 08:04:08 +02:00
Benny Halevy	c0dd662144	test_backup: do_abort_restore: use error injection Currently the test depends on timing and enough inserted data to abort the restore tasks at exactly the right time. This is flaky in nature, so instead, use error injection to synchronize the abort with mutation streaming. Note that with that we no longer get the STREAM_MUTATION_FRAGMENTS log message, so waiting for it is dropped from the test. The most imporant thing is that some restore tasks must fail. (We cannot guarantee all would fail unfortunately) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-05 08:03:53 +02:00
Benny Halevy	16dd07c7d4	test_backup: do_abort_restore: use asyncio for cql Use the more modern asyncio facility to run cql queries and a prepared statement to insert data into the table. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-05 08:03:53 +02:00
Benny Halevy	f1a583c39c	test_backup: do_abort_restore: use new_test_keyspace For creating a keyspace with a unique name and auto-deleting it on exit. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-05 08:03:53 +02:00
Benny Halevy	3e8431a3d9	test_backup: do_abort_restore: use logger rather than print Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-05 08:03:53 +02:00
Benny Halevy	acb2c9b045	test_backup: do_abort_restore: pass auto_rack_dc to servers_add To generate multi-rack cluster, otherwise we get the following error: ``` E cassandra.protocol.ConfigurationException: <Error from server: code=2300 [Query invalid because of configuration issue] message="Replication factor 3 exceeds the number of racks (1) in dc datacenter1"> ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-05 08:03:53 +02:00
Dani Tweig	1ef6ac5439	consolidating jira automation to one workflow file Closes scylladb/scylladb#27854	2026-01-05 07:09:03 +02:00
copilot-swe-agent[bot]	4e41b6f106	tools/scylla-nodetool: Increase precision of compression ratio from 1 to 2 decimal places In the tablestats (cfstats) command. Fixes: https://github.com/scylladb/scylladb/issues/27962 Closes scylladb/scylladb#27965	2026-01-05 07:07:06 +02:00
Avi Kivity	e03d24e3f3	Merge 'Use file_stat with a relative path when listing directories' from Benny Halevy With the additional file_stat overload introduced in [Update seastar submodule](`3e9b071838`), use the opened directory for more efficient, relative-path based stat. * Enhancement, no backport needed Closes scylladb/scylladb#27967 * github.com:scylladb/scylladb: table: get_snapshot_details: use relative-path based file_stat table: get_snapshot_details: fix warning in exists_in_dir table: get_snapshot_details: fix staging dir calculation backup: process_snapshot_dir: use relative-path based file_stat directory_lister: add ctor with opened directory	2026-01-04 22:06:34 +02:00
Nadav Har'El	c4a9d7eb3e	cql: fix DESC KEYSPACES when a "USE" is in effect If a CQL session USEs a keyspace and then calls DESC TABLES, the user expects to see only the tables in the chosen keyspace. However, calling DESC KEYSPACES should still return list all the keyspaces - returning just the USEd one is not useful - and also not what Cassandra does. We had an xfailing test test_describe.py::test_keyspaces_with_use which reproduces this bug (and passes on Cassandra). In this patch we fix this bug. The fix is simple - USE should affect DESC statements, but be ignored for DESC KEYSPACES. We can then remove the xfail marker from the test. The patch also includes a new test for the DESC TABLES case, where the USE does have an affect. And I wanted to make sure the patch doesn't break this case. As usual, the new test passes on both Cassandra and ScyllaDB. Fixes #26334 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27971	2026-01-04 22:01:12 +02:00
Dawid Mędrek	77a934e5b9	db/hints: Prevent draining hints before hint replay is allowed Context ------- The procedure of hint draining boils down to the following steps: 1. Drain a hint sender. That should get rid of all hints stored for the corresponding endpoint. 2. Remove the hint directory corresponding to that endpoint. Obviously, it gets more complex than this high-level perspective. Without blurring the view, the relevant information is that step 1 in the algorithm above may not be executed. Breaking it down, it comprises of two calls to `hint_sender::send_hints_maybe()`. The function is responsible for sending out hints, but it's not unconditional and will not be performed if any of the following bullets is not satisfied: * `hint_sender::replay_allowed()` is not `true`. This can happen when hint replay hasn't been turned on yet. * `hint_sender::can_send()` is not `true`. This can happen if the corresponding endpoint is not alive AND it hasn't left the cluster AND it's still a normal token owner. There is one more relevant point: sending hints can be stopped if replaying hints fails and `hint_sender::send_hints_maybe()` returns `false`. However, that's not not possible in the case of draining. In that case, if Scylla comes across any failure, it'll simply delete the corresponding hint segment. Because of that, we ignore it and only focus on the two bullets. --- Why is it a problem? -------------------- If a hint directory is not purged of all hint segments in it, any attempt to remove it will fail and we'll observe an error like this: ``` Exception when draining <host ID>: std::filesystem::__cxx11::filesystem_error (error system:39, filesystem error: remove failed: Directory not empty [<path>]) ``` The folder with the remaining hints will also stay on disk, which is, of course, undesired. --- When can it happen? ------------------- As highlighted in the Context section of this commit message, the key part of the code that can lead to a dangerous situation like that is `hint_sender::send_hints_maybe()`. The function is called twice when draining a hint endpoint manager: once to purge all of the existing hints, and another time after flushing all hints stored in a commitlog instances, but not listed by `hint_sender` yet. If any of those calls misbehaves, we may end up with a problem. That's why it's crucial to ensure that the function always goes through ALL of the hints. Dangerous situations: 1. We try to drain hints before hint replay is allowed. That will violate the first bullet above. 2. The node we're draining is dead, but it hasn't left the cluster, and it still possesses some tokens. --- How do we solve that? --------------------- Hint replay is turned on in `main.cc`. Once enabled, it cannot be disabled. So to address the first bullet above, it suffices to ensure that no draining occurs beforehand. It's perfectly fine to prevent it. Soon after hint replay is allowed, `main.cc` also asks the hint manager to drain all of the endpoint managers whose endpoints are no longer normal token owners (cf. `db::hints::manager::drain_left_nodes()`). The other bullet is more tricky. It's important here to know that draining only initiated in three situations: 1. As part of the call to `storage_service::notify_left()`. 2. As part of the call to `storage_service::notify_released()`. 3. As part of the call to `db::hints::manager::drain_left_nodes()`. The last one is trivially non-problematic. The nodes that it'll try to drain are no longer normal token owners, so `can_send()` must always return `true`. The second situation is similar. As we read in the commit message of scylladb/scylladb@eb92f50413, which introduced the notion of released nodes, the nodes are no longer normal token owners: > In this patch we postpone the hint draining for the "left" nodes to > the time when we know that the target nodes no longer hold ownership > of any tokens - so they're no longer referenced in topology. I'm > calling such nodes "released". I suggest reading the full commit message there because the problems there are somewhat similar these changes try to solve. Finally, the first situation: unfortunately, it's more tricky. The same commit message says: > When a node is being replaced, it enters a "left" state while still > owning tokens. Before this patch, this is also the time when we start > draining hints targeted to this node, so the hints may get sent before > the token ownership gets migrated to another replica, and these hints > may get lost. This suggests that `storage_service::notify_left()` may be called when the corresponding node still has some tokens! That's something that may prevent properly draining hints. Fortunately, no hope is lost. We only drain hints via `notify_left()` when hinted handoff hasn't been upgraded to being host-ID-based yet. If it has, draining always happens via `notify_released()`. When I write this commit message, all of the supported versions of Scylla 2025.1+ use host-ID-based hinted handoff. That means that problems can only arise when upgrading from an older version of Scylla (2024.1 downwards). Because of that, we don't cover it. It would most likely require more extensive changes. --- Non-issues ---------- There are notions that are closely related to sending hints. One of them is the host filter that hinted handoff uses. It decides which endpoints are eligible for receiving hints, and which are not. Fortunately, all endpoints rejected by the host filter lose their hint endpoint managers -- they're stopped as part of that procedure. What's more, draining hints and changing the host filter cannot be happening at the same time, so it cannot lead to any problems. The solution ------------ To solve the described issue, we simply prevent draining hints before hint replay is allowed. No reproducer test is attached because it's not feasible to write one. Fixes scylladb/scylladb#27693 Closes scylladb/scylladb#27713	2026-01-04 16:54:05 +02:00
Benny Halevy	4d46674d03	table: get_snapshot_details: use relative-path based file_stat With the additional file_stat overload introduced in `3e9b071838`, use the opened directory for more efficient, relative-path based stat. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-04 11:05:56 +02:00
Benny Halevy	2d2177d2c9	table: get_snapshot_details: fix warning in exists_in_dir The functor is called both on the data directory as well as on the staging directory, so the warning printed if the found file is not the same inode should print the given path, not datadir / name (as was copy and pasted). Refs #27635 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-04 11:05:56 +02:00
Benny Halevy	240b32a87a	table: get_snapshot_details: fix staging dir calculation staging is based off of datadir, not snapshot_dir. the issue was introduced in `f5ca3657e2`. Refs #27635 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-04 11:05:56 +02:00
Benny Halevy	1a08ef2062	backup: process_snapshot_dir: use relative-path based file_stat With the additional file_stat overload introduced in `3e9b071838`, use the opened directory for more efficient, relative-path based stat. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-01-04 11:05:56 +02:00
Benny Halevy	8d00266f88	directory_lister: add ctor with opened directory This ctor allows the caller to open the directory first, on its own, and pass it down to the directory_lister. Once all callers use this ctor we can get rid of the delayed open in the get() method. Also, in can be used to replace full-path based file_stat calls on listed entries with file_stat(directory, name) calls that are based on statat() and a relative path name that is present in the listed directory entry. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> sq	2026-01-04 11:05:18 +02:00
Amnon Heiman	c6d1c63ddb	distributed_loader: system_replicated_keys as system keyspace This patch adds system_replicated_keys to the list of known system keyspaces. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-01-02 16:41:47 +02:00
Amnon Heiman	83c1103917	replicated_key_provider: make KSNAME public Move KSNAME constant from internal static to public member of replicated_key_provider_factory class. It will be used to identify it as a system keyspace. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2026-01-02 16:39:51 +02:00
Dawid Pawlik	c0b06a7fc6	docs: add vector similarity functions documentation Add documentation in `functions.rst` as the CQL reference for a vector similarity functions. This includes the syntax, example usage, and prerequisites for the parameters.	2026-01-02 13:02:59 +01:00
Dawid Pawlik	115bd51873	test/cqlpy: add similarity functions correctness tests Add `calculate_similarity` function for testing purposes. Add tests checking if CQL returned values match the calculated ones with the precision up to 5th decimal place. The tests should also be run on Cassandra to check compatibility with their responses.	2026-01-02 13:02:59 +01:00
Dawid Pawlik	12aa33106f	test/cqlpy: add similarity functions invalid call tests Add tests checking that calling similarity functions with: - non-vector columns - non-vector values - vectors with mismatching dimensions as arguments fails.	2026-01-02 12:49:22 +01:00
Dawid Pawlik	b03d520aff	cql3: introduce similarity functions syntax The similarity function syntax is: `similarity_<metric_name>(<vector>, <vector>)` Where `<metric_name>` is one of `cosine`, `euclidean` and `dot_product` matching the intended similarity metric to be used within calculations. Where `<vector>` is either a vector column name or vector literal. Add `vectorSimilarityArgs` symbol that is an extension of `selectionFunctionArgs`, but allowing to use the `value` as an argument as well as the `unaliasedSelector`. This is needed as the similarity function syntax allows both the arguments to be a vector value, so the grammar needs to recognize the vector literal there as well. Since we actually support `SELECT`s with constants since this patch, return true instead of throwing an error while trying to convert the function call to constant.	2026-01-02 12:48:43 +01:00
Dawid Pawlik	5b2b8d596a	vector_similarity_fcts: introduce similarity functions This patch introduces scalar functions `similarity_cosine()`, `similarity_euclidean()`, and `similarity_dot_product()` which should return a float - similarity of the given vectors calculated according to the function's similarity metric. The argument types of this function are retrieved with the `retrieve_vector_arg_types`, but shall be assignable to `vector<float, N>` where `N` is the same for both arguments. This patch introduces a dimensionality check during the execusion of those functions.	2026-01-02 12:48:43 +01:00
Dawid Pawlik	b72df3ae27	vector_similarity_fcts: retrieve similarity function argument types This patch retrieves the argument types for similarity functions. Newly introduced `retrieve_vector_arg_types` function checks if the provided arguments are vectors of floats and if both the vector values match the same type (dimension). If so, we know the exact type and set it as the function arguments type. Otherwise, if the exact type is unkown, but we can assign to vector<float, N> then the dimensionality check will be done during execution of the similarity function. This also takes care of null values and bind variables the same way as implemented in Cassandra to stay compatible. Meaning that if we can infer the type from one argument, then the latter may be unknown (null or ?). Additionally this patch adds `test_assignment_any_vector` function which tests the weak assignment to vector<float, N> as mentioned above.	2026-01-02 12:48:43 +01:00
Dawid Pawlik	2bedefbb85	vector_similarity_fcts: add calculating similarity between vectors This commit introduces `compute_cosine_similarity`, `compute_euclidean_similarity`, `compute_dot_product_similarity` functions to calculate the vectors similarity in respective metric. The similarity is a float value meaning how similar the vectors are in a range of [0, 1]. Values closer to 1 indicate greater similarity. The `dot_product` similarity requires L2 normalized vectors as arguments. The similarity is calculated based on the jVector's implementation used by Cassandra. `f967f1c924/jvector-base/src/main/java/io/github/jbellis/jvector/vector/VectorSimilarityFunction.java (L36-L69)`	2026-01-02 12:48:08 +01:00
Nadav Har'El	6c8ddfc018	test/alternator: fix typo in test_returnvalues.py Different DynamoDB operations have different settings allowed for their "ReturnValues" argument. In particular, some operations allow ReturnValues=UPDATED_OLD but the DeleteItem operation does not. We have a test, test_delete_item_returnvalues, aimed to verify this but it had a typo and didn't actually check "UPDATED_OLD". This patch fixes this typo. The test still passes because the code itself (executor.cc, delete_item_operation's constructor) has the correct check - it was just the test that was wrong. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27918	2026-01-01 19:33:23 +02:00
Israel Fruchter	40ada3f187	Update tools/cqlsh submodule (v6.0.32) * tools/cqlsh scylladb/scylla-cqlsh@9e5a91d7...scylladb/scylla-cqlsh@5a1d7842 (9): > fix wrong reference in copyutil.py > Add GitHub Action workflow to create releases on new tags > test_copyutil.py: introdcue test for ImportTask > fix(copyutil.py): avoid situatuions file might be move withing multiple processes > Fix Unix socket port display in show_host() method > Merge pull request #157 from scylladb/alert-autofix-1 .github/workflows/build-push.yml: Potential fix for code scanning alert no. 1: Workflow does not contain permissions > .github/workflows/dockerhub-description.yml: Potential fix for code scanning alert no. 9: Workflow does not contain permissions > test_cqlsh_output: skip some cassandra 5.0 table options > tests: template compression cql to use `class` insted of `sstable_comprission` > Pin Cassandra version to 5.0 for reproducible builds > Remove scylla-enterprise integration test and update Cassandra to latest Closes scylladb/scylladb#27924	2026-01-01 19:30:34 +02:00
Łukasz Paszkowski	76b84b71d1	storage/test_out_of_space_prevention.py: Fix async/await bugs - Add missing await keywords for async operations on s2_log.wait_for() and coord_log.wait_for() - Fix incorrect regex: "compaction .* Split {cf}" → "compaction.*Split {cf}" - The commit https://github.com/scylladb/scylladb/commit/f7324a4 demoted compaction start/end log messages to debug level. Hence add compaction=debug log messages to the following tests: test_split_compaction_not_triggered test_node_restart_while_tablet_split test_repair_failure_on_split_rejection Fixes https://github.com/scylladb/scylladb/issues/27931 Closes scylladb/scylladb#27932	2026-01-01 14:24:30 +02:00
Anna Stuchlik	624869de86	doc: remove cassandra-stress from installation instructions The cassandra-stress tool is no longer part of the default package and cannot be run in the way described. This commit removes the instruction to run cassandra-stress. Fixes https://github.com/scylladb/scylladb/issues/24994 Closes scylladb/scylladb#27726	2026-01-01 14:20:58 +02:00
Jenkins Promoter	69d6e63a58	Update pgo profiles - aarch64	2026-01-01 05:10:51 +02:00
Jenkins Promoter	d6e2d3d34c	Update pgo profiles - x86_64	2026-01-01 04:27:14 +02:00
Nadav Har'El	e28df9b3d0	test: fix Python warnings in regular expressions Like C, Python supports some escape sequences in strings such as the familiar "\n" that converts to a newline character. Originally, when backslash was used before a random character, for example, "\.", Python used to just use these literal characters backslash and dot, in the string - and not make a fuss about it. This made it ok to use a string like "hi\.there" as a regular expression. We have a few instances of this in our Python tests. But recent releases of Python started to produce ugly warnings about these cases. The error message looks like: SyntaxWarning: "\." is an invalid escape sequence. Such sequences will not work in the future. Did you mean "\\."? A raw string is also an option. Indeed in most cases the easiest solution is to use a "raw string", a string literal preceded with r. For example, r"hi\.there". In such strings Python doesn't replace escape sequences like \n in the string, and also leaves the \. unchanged for the regular expression to see. So in this patch we use raw strings in all places in test/ where Python warns have this problem. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27856	2025-12-31 20:44:01 +02:00
Yaniv Michael Kaul	597d300527	main.cc: remove warning: 'metric_help' is deprecated Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Backport: no, benign issue. Closes scylladb/scylladb#27680	2025-12-31 18:36:55 +02:00
Avi Kivity	b690ddb9e5	tools: toolchain: dbuild: bind-mount full ~/.cache to container In `afb96b6387`, we added support for sccache. As a side effect it changed the method of invoking ccache from transparent via PATH (if it contains /usr/lib64/ccache) to explicit, by changing the compiler command line from 'clang++' (which may or may not resolve the the ccache binary) to 'ccache /usr/local/bin/clang++', which always invokes ccache. In the default dbuild configuration, PATH does not contain /usr/lib64/ccache, so ccache isn't invoked by default. Users can change this via the SCYLLADB_DBUILD environment variable. As a result of ccache being suddenly enabled for dbuild builds, ccache will now attempt to create ~/.cache/ccache. Under docker, this does not work, because we bind-mount ~/.cache/dbuild. Docker will create the intermediate ~/.cache, but under the root user, not $USER. The intermediate directory being root-owned prevents ~/.cache/ccache from being created. Under podman, this does work, because everything runs under the container's root user. The fix is to bind-mount the entire ~/.ccache into the container. This not only lets ccache create the directory, it will also find an existing ~/.cache/ccache directory and use it, enabling reuse across invocations. Since ccache will now respect configuration changes without access to its configuration file (notably, the maximum cache size), we also bind-mount ~/.config. Since ~/.ccache and ~/.config are not automatically created, we create them explicitly so the bind mounts can work. This is for new nodes enlisted from the cloud; developer machines will have those directories preexisting. Note that the ccache directory used to be ~/.ccache, but was later changed. Had the author known, we would have bind-mounted ~/.cache much earlier. Fixes #27919. Closes scylladb/scylladb#27920	2025-12-31 14:08:41 +01:00
Asias He	3abda7d15e	topology_coordinator: Ensure repair_update_compaction_ctrl is executed Consider this: - n1 is a coordinator and schedules tablet repair - n1 detects tablet repair failed, so it schedules tablet transition to end_repair state - n1 loses leadership and n2 becomes the new topology coordinator - n2 runs end_repair on the tablet with session_id=00000000-0000-0000-0000-000000000000 - when a new tablet repair is scheduled, it hangs since the lock is already taken because it was not removed in previous step To fix, we use the global_tablet_id to index the lock instead of the session id. In addition, we retry the repair_update_compaction_ctrl verb in case of error to ensure the verb is eventually executed. The verb handler is also updated to check if it is still in end_repair stage. Fixes #26346 Closes scylladb/scylladb#27740	2025-12-31 13:17:18 +01:00
Benny Halevy	3e9b071838	Update seastar submodule * seastar f0298e40...4dcd4df5 (29): > file: provide a default implementation for file_impl::statat > util: Genralize memory_data_sink > defer: Replace static_assert() with concept > treewide: drop the support of fmtlib < 9.0.0 > test: Improve resilience of netsed scheduling fairness test > Merge 'file: Use query_device_alignment_info in blkdev_alignments ' from Kefu Chai file: Put alignment helpers in anonymous namespace file: Use query_device_alignment_info in blkdev_alignments > Merge 'file: Query physical block size and minimum I/O size' from Kefu Chai file: Apply physical_block_size override to filesystem files file: Use designated initializers in xfs_alignments iotune: Add physical block size detection disk_params: Add support for physical_block_size overrides from io_properties.yaml block_device: Query alignment requirements separately for memory and I/O > Merge 'json: formatter: fix formatting of std:string_view' from Benny Halevy json: formatter: fix formatting of std:string_view json: formatter: make sure std::string_view conforms to is_string_like Fixes #27887 > demos:improve the output of demo_with_io_intent() in file_demo > test: Add accept() vs accept_abort() socket test > file: Refine posix_file_impl alignments initialization > Add file::statat and a corresponding file_stat overload > cmake: don't compile memcached app for API < 9 > Merge 'Revert to ~old lifetime semantics for lvalues passed to then()-alikes' from Travis Downs future: adjust lifetime for lvalue continuations future: fix value class operator() > pollable_fd: Unfriend everything > Merge 'file: experimental_list_directory: use buffered generator' from Benny Halevy file: experimental_list_directory: use buffered generator file: define list_directory_generator_type > Merge 'Make datagram API use temporary_buffer<>-s' from Pavel Emelyanov net: Deprecate datagram::get_data() returning packet memcache: Fix indentation after previous patch memcache: Use new datagram::get_buffers() API dns: Use new datagram::get_buffers() API tests: Use new datagram::get_buffers() API demo: Use new datagram::get_buffers() API udp: Make datagram implementations return span of temporary_buffer-s > Merge 'Remove callback from timer_set::complete()' from Pavel Emelyanov reactor: Fix indentation after previous patch timers: Remove enabling callback from timer_set::complete() > treewide: avoid 'static sstring' in favor of 'constexpr string_view' > resource: Hide hwloc from public interface > Merge 'Fix handle_exception_type for lvalues' from Travis Downs futures_test: compile-time tests function_traits: handle reference_wrapper > posix_data_sink_impl: Assert to guard put UB > treewide: fix build with `SEASTAR_SSTRING` undefined > avoid deprecation warnings for json_exception > `util/variant_utils`: correct type deduction for `seastar::visit` > net/dns: fixed socket concurrent access > treewide: add missing headers > Merge 'Remove posix file helper file_read_state class' from Pavel Emelyanov file: Remove file_read_state test: Add a test for posix_file_impl::do_dma_read_bulk() > membarrier: simplify locking Adjust scylla to the following changes in scylla: - file_stat became polymorphic - needs explicit inference in table::snapshot_exists, table::get_snapshot_details - file::experimental_list_directory now returns list_directory_generator_type Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27916	2025-12-30 19:37:13 +03:00
Yaniv Kaul	0264ec3c1d	test: test_downgrade_after_partial_upgrade: check that feature is disabled on all nodes after partial upgrade We should check that the test feature is disabled on all nodes after a partial upgrade. This hardens the test a bit, although the old code wasn't that bad, since enabled features are a part of the group 0 state shared by all nodes. Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27654	2025-12-30 17:34:56 +01:00
Nadav Har'El	ffcce1ffc8	test/boost: fix flaky test node_view_update_backlog The boost test view_schema_test.cc::node_view_update_backlog can be flaky if the test machine has a hiccup of 100ms, and this patch fixes it: The test is a unit test for db::view::node_update_backlog, which is supposed to cache the backlog calculation for a given interval. The test asks to cache the backlog for 100ms, and then without sleeping at all tries to fetch a value again and expect the unchanged cached value to be returned. However, if the test run experiences a context switch of 100ms, it can fail, and it did once as reported in #27876. The fix is to change the interval in this test from 100ms to something much larger, like 10 seconds. We don't sleep this amount - we just need the second fetch to happen before 10 seconds has passed, so there's no harm in using a very large interval. However, the second half of this test wants to check that after the interval is over, we do get a new backlog calculation. So for the second half of this test we can and should use a shorter backlog - e.g., 10ms. We don't care if the test machine is slow or context switched, for this half of the test we want to to sleep more than 10ms, and that's easy. The fixed test is faster than the old one (10ms instead of 100ms) and more reliable on a shared test machine. Fixes #27876. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27878	2025-12-30 10:10:42 +01:00
Benny Halevy	c9eab7fbd4	test: test_refresh: add test_refresh_deletes_uploaded_sstables The refresh api is expected to automatically delete the sstable files from the uploads/ dir. Verify that. The code that does that is currently called by sstables_loader::load_new_sstables: ```c++ if (load_and_stream) { ... co_await loader.load_and_stream(ks_name, cf_name, table_id, std::move(sstables_on_shards[this_shard_id()]), primary_replica_only(primary), true /* unlink */, scope, {}); ``` Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27586	2025-12-30 10:51:24 +03:00
Nadav Har'El	80e5860a8c	docs/alternator: document that Streams needs vnodes The current state (after PR #26836) is that Alternator tables are created by default using tablets. But due to issue #23838, Alternator Streams cannot be enabled on a table that uses tablets... An attempt to enable Streams on such a table results in a clear error: "Streams not yet supported on a table using tablets (issue #23838). If you want to use streams, create a table with vnodes by setting the tag 'system:initial_tablets' set to 'none'." But users should be able to learn this fact from the documentation - not just retroactively from an error message. This is especially important because a user might create and fill a table using tablets, and only get this error when attempting to enable Streams on the existing table - when it is too late to change anything. So this patch adds a paragraph on this to compatibility.md, where several other requirements of Alternator Streams are already mentioned. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27000	2025-12-30 10:45:34 +03:00
Avi Kivity	853f3dadda	Merge 'treewide: fix some spelling errors' from Piotr Smaron Irritated by prevailing spellchecker comments attached to every PR, I aim to fix them all. No need to backport, just cosmetic changes. Closes scylladb/scylladb#27897 * github.com:scylladb/scylladb: treewide: fix some spelling errors codespell: ignore `iif` and `tread`	2025-12-29 20:45:31 +02:00
Patryk Jędrzejczak	0fed9f94f8	gossiper: add_saved_endpoint: make generations of excluded nodes negative The explanation is in the new comment in `gossiper::add_saved_endpoint`. We add a test for this change. It's "extremely white-box", but it's better than nothing.	2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak	749b0278e5	test: introduce test_full_shutdown_during_replace	2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak	4526dd93b1	utils: error_injection: allow aborting wait_for_message The test added in the following commit utilizes it.	2025-12-29 19:13:55 +01:00
Patryk Jędrzejczak	fc4c2df2ce	raft topology: preserve IP -> ID mapping of a replacing node on restart We currently do it only for a bootstrapping node, which is a bug. The missing IP can cause an internal error, for example, in the following scenario: - replace fails during streaming, - all live nodes are shut down before the rollback of replace completes, - all live nodes are restarted, - live nodes start hitting internal error in all operations that require IP of the replacing node (like client requests or REST API requests coming from nodetool). We fix the bug here, but we do it separately for replace with different IP and replace with the same IP. For replace with different IP, we persist the IP -> host ID mapping in `system.peers` just like for bootstrap. That's necessary, since there is no other way to determine IP of the replacing node on restart. For replace with the same IP, we can't do the same. This would require deleting the row corresponding to the node being replaced from `system.peers`. That's fine in theory, as that node is permanently banned, so its IP shouldn't be needed. Unfortunately, we have many places in the code where we assume that IP of a topology member is always present in the address map or that a topology member is always present in the gossiper endpoint set. Examples of such places: - nodetool operations, - REST API endpoints, - `db::hints::manager::store_hint`, - `group0_voter_handler::update_nodes`. We could fix all those places and verify that drivers work properly when they see a node in the token metadata, but not in `system.peers`. However, that would be too risky to backport. We take a different approach. We recover IP of the replacing node on restart based on the state of the topology state machine and `system.peers` just after loading `system.peers`. We rely on the fact that group 0 is set up at this point. The only case where this assumption is incorrect is a restart in the Raft-based recovery procedure. However, hitting this problem then seems improbable, and even if it happens, we can restart the node again after ensuring that no client and REST API requests come before replace is rolled back on the new topology coordinator. Hence, it's not worth to complicate the fix (by e.g. looking at the persistent topology state instead of the in-memory state machine).	2025-12-29 19:13:53 +01:00
Patryk Jędrzejczak	4e63e74438	messaging: improve the error messages of closed_errors The default error message of `closed_error` is "connection is closed". It lacks the host ID and the IP address of the connected node, which makes debugging harder. Also, it can be more specific when `closed_error` is thrown due to the local node shutting down. Fixes #16923 Closes scylladb/scylladb#27699	2025-12-29 18:36:07 +02:00
Avi Kivity	567c28dd0d	Merge 'Decouple sstables::storage::snapshot() and ::clone() functionality' from Pavel Emelyanov The storage::snapshot() is used in two different modes -- one to save sstable as snapshot somewhere, and another one to create a copy of sstable. The latter use-case is "optimized" by snapshotting an sstable under new generation, but it's only true for local storage. Despite for S3 storage snapshot is not implemented, _cloning_ sstable stored on S3 is not necessarily going to be the same as doing a snapshot. Another sign of snapshot and clone being different is that calling snapshot() for snapshot itself and for clone use two very different sets of arguments -- snapshotting specifies relative name and omits new generation, while cloning doesn't need "name" and instead provides generation. Recently (#26528) cloning got extra "leave_unsealed" tag, that makes no sense for snapshotting. Having said that, this PR introduces sstables::storage::clone() method and modifies both, callers and implementations, according to the above features of each. As a result, code logic in both methods become much simpler and a bunch of bool classes and "_tag" helper structures goes away. Improving internal APIs, no need to backport Closes scylladb/scylladb#27871 * github.com:scylladb/scylladb: sstables, storage: Drop unused bool classes and tags sstables/storage: Drop create_links_common() overloads sstable: Simplify storage::snapshot() sstables: Introduce storage::clone()	2025-12-29 17:50:54 +02:00
Avi Kivity	9927c6a3d4	Merge 'Reapply "audit: enable some subset of auditing by default"' from Piotr Smaron This reverts commit a5edbc7d612df237a1dd9d46fd5cecf251ccfd13. <h3>Why re-enabling table audit</h3> Audit has been disabled (scylladb/scylla-enterprise/pull/3094) over many concerns raised against the table implementation, e.g. scylladb/scylla-enterprise/issues/2939 / scylladb/scylla-enterprise/issues/2759 + there's whole outstanding backlog of issues . One of the concerns was also a possible loss of availability, and since then we migrated audit keyspace from SimpleStrategy RF=1 to NetworkTopologyStrategy RF=3 (scylladb/scylla-enterprise/pull/3399) and stopped failing queries when auditing fails (scylladb/scylla-enterprise/pull/3118 & scylladb/scylla-enterprise/pull/3117), which improves the situation but doesn't address all the concerns. Eventually we want to use syslog as audit's sink, but it's not fully ready just yet, and so we'll restore table audit for now to increase the security, but later switch to syslog. BTW. cloud will enable table audit for AUTH category scylladb/sre-ops-automation/issues/2970 separately from this effort. <h3>Performance considerations</h3> We are assuming that the events for the enabled categories, i.e. DCL, DDL, AUTH & ADMIN, should appear at about the same, low cadence, with AUTH perhaps having the biggest impact of them all under some workloads. The performance penalty of enabling just the AUTH category [has been measured](https://scylladb.atlassian.net/wiki/spaces/RND/pages/148308005/Audit+performance+impact+test) and while authentication throughput and read/write throughput remain stable, the queries' P99 latency may decrease by a couple of % in the most hardcore scenarios. Fixes: https://github.com/scylladb/scylladb/issues/26020 Gradually re-enabling audit feature, no need to backport. Closes scylladb/scylladb#27262 * github.com:scylladb/scylladb: doc: audit: set audit as enabled by default Reapply "audit: enable some subset of auditing by default"	2025-12-29 16:41:04 +02:00
Tomasz Grabiec	bbf9ce18ef	Merge 'load_balancer: compute node load based on tablet sizes' from Ferenc Szili Currently, the tablet load balancer performs capacity based balancing by collecting the gross disk capacity of the nodes, and computes balance assuming that all tablet sizes are the same. This change introduces size-based load balancing. The load balancer does not assume identical tablet sizes any more, and computes load based on actual tablet sizes. The size-based load balancer computes the difference between the most and least loaded nodes in the balancing set (nodes in DC, or nodes in a rack in case of `rf-rack-valid-keyspaces`) and stops further balancing if this difference is bellow the config option `size_based_balance_threshold_percentage`. This config option does not apply to the absolute load, but instead to the percentage of how much the most loaded node is more loaded than the least loaded node: `delta = (most_loaded - least_loaded) / most_loaded` If this delta is smaller then the config threshold, the balancer will consider the nodes balanced. This PR is a part of a series of PRs which are based on top of each other. - First part for tablet size collection via load_stats: #26035 - Second part reconcile load_stats: #26152 - The third part for load_sketch changes: #26153 - The fourth part which performs tablet load balancing based on tablet size: #26254 - The fifth part changes the load balancing simulator: #26438 This is a new feature, backport is not needed. Fixes #26254 Closes scylladb/scylladb#26254 * github.com:scylladb/scylladb: test, load balancing: add test for table balance load_balancer: add cluster feature for size based balancing load_balancer: implement size-based load balancing config: add size based load balancing config params load_stats: use trinfo to decide how to reconcile tablet size load_sketch: use tablet sizes in load computation load_stats: add get_tablet_size_in_transition()	2025-12-29 15:01:38 +01:00
Pavel Emelyanov	d892140655	Merge 'Reduce allocations when traversing compaction_groups' from Benny Halevy - table, storage_group: add compaction_group_count - And use to reserve vector capacity before adding an item per compaction_group - table: reduce allocations by using for_each_compaction_group rather than compaction_groups() - compaction_groups() may allocate memory, but when called from a synchronous call site, the caller can use for_each_compaction_group instead. * Improvement, no backport needed Closes scylladb/scylladb#27479 * github.com:scylladb/scylladb: table: reduce allocations by using for_each_compaction_group rather than compaction_groups() replica: storage_group: rename compaction_groups to compaction_groups_immediate	2025-12-29 16:26:33 +03:00
Gleb Natapov	4a5292e815	raft topology: Notify that a node was removed only once Raft topology goes over all nodes in a 'left' state and triggers 'remove node' notification in case id/ip mapping is available (meaning the node left recently), but the problem is that, since the mapping is not removed immediately, when multiple nodes are removed in succession a notification for the same node can be sent several times. Fix that by sending notification only if the node still exists in the peers table. It will be removed by the first notification and following notification will not be sent. Closes scylladb/scylladb#27743	2025-12-29 14:22:34 +01:00
Piotr Smaron	fb4d89f789	treewide: fix some spelling errors	2025-12-29 13:53:56 +01:00
Piotr Smaron	ba5c70d5ab	codespell: ignore `iif` and `tread` There are correct: - iif is a boost's header name - `tread carefully` is an actual english phrase	2025-12-29 13:53:56 +01:00
Nadav Har'El	8df9cfcde8	Merge 'Add table size bytes to describe table' from Radosław Cybulski Add table size to DescribeTable's reply in Alternator Fills DescribeTable's reply with missing field TableSizeBytes. - add helper class simple_value_with_expiry, which is like std::optional but the value put has a timeout. - add ignore_errors to estimate_total_sstable_volume function - if set to true the function will catch errors during RPC and ignore them, substituting 0 for missing value. - add a reference to storage_service to executor class (needed to call estimate_total_sstable_volume function). - add fill_table_description and create_table_on_shard0 as non static methods to executor class - calculate TableSizeBytes value for a given table and return it as part of DescribeTable's return value. The value calculated is cached for approximately 6 hours (as per DescribeTable's specification). The algorithm is as follows: - if the requested value is in cache and is still valid it's returned, nothing else happens. - otherwise: - every shard of every node is requested to calculate size of its data - if the error happens, the error is ignored and we assume the given shard has a size of 0 - all such values are summed producing total size - produced value is returned to a caller - on the node the call for a size happened every shard is requested to cache produced value with a 6 hour timeout. - if the next call comes for a differet shard on the same node that doesn't yet have cached value, the shard will request the value to be calculated again. The new value will overwrite the old one on every shard on this node. - if the next call comes to a different node, the process of calculation will happen from start, possibly producing different value. The value will have it's own timeout, there's no attempt made to synchronize value between nodes. - add a alternator_describe_table_info_timeout_in_seconds parameter, which will control, how long DescribeTable's table information are being held in cache. Default is 6 hours. - update test to use parameter `alternator_describe_table_info_timeout_in_seconds` - setting it to 0 and forcing flushing memtables to disk allows checking, that table size has grown. Fixes #7551 Closes scylladb/scylladb#24634 * github.com:scylladb/scylladb: alternator: fix invalid rebase Update tests Update documentation Add table size to DescribeTable's output Promote fill_table_description and create_table_on_shard0 to methods Modify estimate_total_sstable_volume to opt ignore errors Add alternator_describe_table_info_cache_validity_in_seconds config option Add ref to service::storage_service to executor Add simple_value_with_expiry util class	2025-12-29 14:47:36 +02:00
Benny Halevy	f60033db63	db: system_keyspace: get_group0_history: unfreeze_gently Prevent stall when the group0 history is too long using unfreeze_gently rather than the synchronous unfreeze() function Fixes #27872 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27873	2025-12-29 12:00:54 +02:00
copilot-swe-agent[bot]	d25d295e84	alternator/server: update SSL comment	2025-12-29 09:34:08 +01:00
Radosław Cybulski	df20f178aa	alternator: fix invalid rebase Fix an invalid rebase, that would properly merge code coming from master, except that code would ignore refactor done in the patch.	2025-12-29 08:33:10 +01:00
Radosław Cybulski	a31c8762ca	Update tests	2025-12-29 08:33:09 +01:00
Radosław Cybulski	5e1254eef0	Update documentation	2025-12-29 08:33:08 +01:00
Radosław Cybulski	a86b782d3f	Add table size to DescribeTable's output Add a table size to DescribeTable's output.	2025-12-29 08:33:07 +01:00
Radosław Cybulski	1bd855a650	Promote fill_table_description and create_table_on_shard0 to methods Promote `executor::fill_table_description` and `executor::create_table_on_shard0` to methods (from static functions).	2025-12-29 08:33:06 +01:00
Radosław Cybulski	6a26381f4f	Modify estimate_total_sstable_volume to opt ignore errors Modify `storage_service::estimate_total_sstable_volume` function to optionally ignore errors (instead substitute 0), when `ignore_errors` parameter is set to `yes`.	2025-12-29 08:33:06 +01:00
Radosław Cybulski	a532fc73bc	Add alternator_describe_table_info_cache_validity_in_seconds config option Add a `alternator_describe_table_info_cache_validity_in_seconds` configuration option with default value of 6 hours.	2025-12-29 08:33:05 +01:00
Radosław Cybulski	e246abec4d	Add ref to service::storage_service to executor Add a reference to `service::storage_service` to executor object.	2025-12-29 08:33:03 +01:00
Radosław Cybulski	dfa600fb8f	Add simple_value_with_expiry util class Add a `simple_value_with_expiry` utility class, which functions like a `std::optional` with added timeout. When emplacing a value, user needs to provide timeout, after which value expires (in which case the `simple_value_with_expiry` object behaves as if was never set at all). Add boost tests for the new class.	2025-12-29 08:32:52 +01:00
Pavel Emelyanov	2e33234e91	util: Remove lister::rmdir() There's seastar helper that does the same, no need to carry yet another implementation Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27851	2025-12-28 19:46:19 +02:00
Avi Kivity	63e3a22f2e	Merge 'group0_state_machine: don't update in-memory state machine until start' from Piotr Dulikowski Group0 commands consist of one or more mutations and are supposed to be atomic - i.e. the data structures that reflect the group0 tables state are not supposed to be updated while only some mutations of a command are applied, the logic responsible for that is not supposed to observe an inconsistent state of group0 tables. It turns out that this assumption can be broken if a node crashes in the middle of applying a multi-mutation group0 command. Because these mutations are, in general, applied separately, only some mutations might survive a crash and a restart, so the group0 tables might be in an inconsistent state. The current logic of group0_state_machine will attempt to read the group0 tables' state as it was left after restart, so it may observe inconsistent state. This can confuse the node as it may observe a state that it was not supposed to observe, or the state will just outright break some invariants and trigger some sanity checks. One of those was observed in https://github.com/scylladb/scylladb/issues/26945, where a command from the CDC generation publisher fiber was partially applied. The fiber, in addition to publishing generations, it removes old, expired generations as well. Removal is done by removing data that describes the generation from cdc_generations_v3 and by removing the generation's ID from the committed generation list in the topology table. If only the first mutation gets through but not the other one, on reload the node will see a committed CDC generation without data, which will trigger an on_internal_error check. Fix this by delaying the moment when the in memory data structures are first loaded. In `579dcf187a`, a mechanism was introduced which persists the commit index before applying commands that are considered committed. Starting a raft server waits until commands are replayed up to that point. The fix is to start the group0_state_machine in a mode which only applies mutations - the aforementioned mechanism will re-apply the commands which will, thanks to the mutation idempotency, bring the group0 to a consistent state. After the group0 is known to be in consistent state (so, after raft::server_impl::start) the in-memory data structures of group0 are loaded for the first time. There is an exception, however: schema tables. Information about schema is actually loaded into memory earlier than the moment when group0 is started. Applying changes to schema is done through the migration manager module which compares the persisted state before and after the schema mutations are applied and acts on that. Refactoring migration manager is out of scope of this PR. However, this is not a problem because the migration manager takes care to apply all of the mutations given in a command in a single commitlog segment, so the initial schema loading code should not see an inconsistent state due to the state being partially applied. The fix is accompanied by a reproducer of scylladb/scylladb#26945. Fixes: scylladb/scylladb#26945 This is not a regression, so no need to backport. Closes scylladb/scylladb#27528 * github.com:scylladb/scylladb: test: cluster: test for recovery after partial group0 command group0_state_machine: remove obsolete comment about group0 consistency group0_state_machine: don't update in-memory state machine until start group0_state_machine: move reloading out of std::visit service: raft: add state machine ref to raft_server_for_group	2025-12-28 13:59:26 +02:00
Pavel Emelyanov	e963a8d603	checked-file: Implement experimental_list_directory() The method in question returns coroutine generator that co_yields directory_entry-s. In case the method is not implemented, seastar creates a fallback generator, that calls existing subscription-based list_directory() and co_yields them. And since checked file doesn't yet have it, fallback generator is used, thus skipping the lower file yielding lister. Not nice. This patch implements the generator lister for checked file, thus making full use of lower file generator lister too. A side note. It's not enough to implement it like return do_io_check([] { return lower_file->experimental_list_directory(); }); like list_directory() does, since io-checking will _not_ happen on directory reading itself, as it's supposed to. This is the problem of the check_file::list_directory() implementation -- it only checks for exception when creating the subscription (and it really never happens), but reading the directory itself happens without io checks. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27850	2025-12-28 13:37:44 +02:00
Yaron Kaikov	1ee89c9682	Revert "scripts: benign fixes flagged by CodeQL/PyLens" This reverts commit `377c3ac072`. This breaks all artifact tests and cloud image build process Closes scylladb/scylladb#27881	2025-12-28 09:49:49 +02:00
Ferenc Szili	6d3c720a08	test, load balancing: add test for table balance This change adds a boost test which validates the resulting table balance of size based load balancing. The threshold was set to a conservative 1.5 overcommit to avoid flakyness.	2025-12-27 11:39:08 +01:00
Ferenc Szili	b7ebd73e53	load_balancer: add cluster feature for size based balancing This patch adds a cluster feature size_based_load_balancing which, until enabled, will force capacity based balancing. This is needed because during rolling upgrades some of the nodes will have incomplete data in load_stats (missing tablet sizes and effective_capacity) which are needed for size based balancing to make good decisions and issue correct migrations.	2025-12-27 11:39:08 +01:00
Ferenc Szili	10eb364821	load_balancer: implement size-based load balancing This changes introduces tablet size based load balancing. It is an extension of capacity based balancing with the addition of actual tablet sizes. It computes the difference between the most and least loaded nodes in the DC and stops further balancing if this difference is bellow the config option size_based_balance_threshold_percentage. This config option does not apply to the absolute load, but instead to the percentage of how much the most loaded node is more loaded than the least loaded node: delta = (most_loaded - least_loaded) / most_loaded If this delta is smaller then the config threshold, the balancer will consider the nodes balanced.	2025-12-27 11:20:20 +01:00
Ferenc Szili	cc9e125f12	config: add size based load balancing config params This change adds: - The config paremeter force_capacity_based_balancing which, when enabled performs capacity based balancing instead of size based. - The config parameter size_based_balance_threshold_percentage which sets the balance threshold for the size based load balancer. - The config parameter minimal_tablet_size_for_balancing which sets the minimal tablet size for the load balancer.	2025-12-27 10:37:38 +01:00
Ferenc Szili	0c9b93905e	load_stats: use trinfo to decide how to reconcile tablet size This patch corrects the way update_load_stats_on_end_migration() decides which tablet transition occured, in order to reconcile tablet sizes in load_stats. Before, the transition kind was inferred from the value of leaving and pending replicas. This patch changes this to use the value of trinfo.transition. In case of a rebuild, and in case there is only one replica, the new tablet size will be set to 0.	2025-12-27 10:37:38 +01:00
Ferenc Szili	621cb19045	load_sketch: use tablet sizes in load computation This commit changes load_sketch so that it computes node and shard load based on tablet sizes instead of tablet count.	2025-12-27 10:37:23 +01:00
Ferenc Szili	1c9ec9a76d	load_stats: add get_tablet_size_in_transition() This patch adds a method to load_stats which searches for the tablet size during tablet transition. In case of tablet migration, the tablet will be searched on the leaving replica, and during rebuild we will return the average tablet size of the pending replicas.	2025-12-27 10:37:23 +01:00
Pavel Emelyanov	bda1709734	Merge 'test: fix infinite loop in python log browsing code triggered from test_orphaned_sstables_on_startup' from Avi Kivity Recently, test/cluster/test_tablet.py::test_orphaned_sstables_on_startup started spinning in the log browsing code, part of a the test library that looks into log files for expected or unexpected patterns. This reproduced somewhat in continuous integration, and very reliably for me locally. The test was introduced in `fa10b0b390`, a year ago. There are two bugs involved: first, that we're looking for crashes in this test, since in fact it is expected to crash. The node expectedly fails with an on_internal_error. Second, the log browsing code contains an infinite loop if the crash backtrace happens to be the last thing in the log. The series fixes both bugs. Fixes #27860. While the bad code exists in release branches, it doesn't trigger there so far, so best to only backport it if it starts manifesting there. Closes scylladb/scylladb#27879 * github.com:scylladb/scylladb: test: pylib: log_browsing: fix infinite loop in find_backtraces() test: pylib/log_browsing, cluster/test_tablets: don't look for expected crashes	2025-12-26 10:45:56 +03:00
Pavel Emelyanov	712cc8b8f1	sstables, storage: Drop unused bool classes and tags Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-26 09:47:27 +03:00
Pavel Emelyanov	9e189da23a	sstables/storage: Drop create_links_common() overloads There's a bunch of tagged create_links_common() overloads that call the most generic one with properly crafted arguments and the link_mode. Callers of those one-liners can craft the args themselves. As a result, there's only one create_links_common() overload and callers explicitly specifying what they want from it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-26 09:47:27 +03:00
Pavel Emelyanov	32cf358f44	sstable: Simplify storage::snapshot() Now there are only two callers left -- sstable::snapshot() and sstable::seal() that wants to auto-backup the sealed sstable. The snapshot arguments are: - relative path, use _base_dir - no new generation provided - no leave-unsealed tag With that, the implementation of filesystem_storage::snapshot() is as simple as - prepare full path relative to _base_dir - touch new directory - call create_links_common() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-26 09:47:27 +03:00
Pavel Emelyanov	8e496a2f2f	sstables: Introduce storage::clone() And call it from sstable::clone() instead of storage::snapshot(). The snapshot arguements are: - target directory is storage::prefix(), that's _dir itself - new generation is always provided, no need for optional - leave_unsealed bool flag With that, the implementation of filesystem_storage::clone() is as simple as call create_links_common() forwarding args and _dir to it. The unification of leave_unsealed branches will come a bit later making this code even shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-26 09:47:27 +03:00
Nadav Har'El	9c50d29a00	test/boost: fix flaky test_inject_future_disabled The test boost/error_injection_test.cc::test_inject_future_disabled checks what happens when a sleep injection is disabled: The test has a 10-millisecond-sleep injection and measures how much it takes. The test expects it to take less than 10 milliseconds - in fact it should take almost zero. But this is not guaranteed - on a slow debug build and an overcommitted server this do-nothing injection can take some time, and in one run (#27798) it took 14 milliseconds - and the test failed. The solution is easy - make the sleep-that-doesn't-happen much longer - e.g., 10 whole seconds. Since this sleep still doesn't happen, we expect the injection to return in less - much less - than 10 seconds. This 10 seconds is so ridiculously high we don't expect the do-nothing injection to take 10 seconds, not even a ridiculously busy test machine. Fixes #27798 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27874	2025-12-25 20:46:31 +02:00
Avi Kivity	92996ce9fa	test: pylib: log_browsing: fix infinite loop in find_backtraces() The find_backtraces() function uses a very convoluted loop to read the log file. The loop fails to terminate if the last thing in the log file is the backtrace, since the loop termination condition (`not line`) continues to be true. It's not clear why this did not reliably hit before, but it now reliably reproduces for me on both x86 and aarch64. Perhaps timing changed, or perhaps previously we had more text on the log.	2025-12-25 20:22:17 +02:00
Avi Kivity	50a3460441	test: pylib/log_browsing, cluster/test_tablets: don't look for expected crashes test_tablets.test_orphaned_sstables_on_startup verifies that an on_internal_error("Unable to load SSTable...") is generated when an sstable outside a tablet boundary is found on startup. The test indeed finds the error, but then proceeds to hang in find_backtraces(), or fail if find_backtraces() is fixed, since it finds an unexpected (for it) crash. Fix this by not looking for crashes if a new option expected_crash is set. Set it for this test.	2025-12-25 20:22:17 +02:00
Avi Kivity	55c7bc746e	Revert "vector_search_validator: move high availability tests from vector-store.git" This reverts commit `caa0cbe328`. It is either extremely slow or broken. I was never able to get it to run on an r8gd.8xlarge (on the NVMe disk). Even when it passes, it is very slow. Test script: ``` git submodule update --recursive \|\| exit 125 rm -rf build d() { ./tools/toolchain/dbuild -it -- "$@"; } d ./configure.py --mode release \|\| exit 125 d ninja release-build \|\| exit 125 d ./test.py --mode release ``` Ref #27858 Ref #27859 Ref #27860	2025-12-25 12:30:22 +00:00
Botond Dénes	ebb101f8ae	scylla-gdb.py: scylla small-objects: make freelist traversal more robust Traversing the span's freelist is known to generate "Cannot access memory at address ..." errors, which is especially annoying when it results in failed CI. Make this loop more robust: catch gdb.error coming from it and just log a warning that some listed objects in the span may be free ones. Fixes: #27681 Closes scylladb/scylladb#27805	2025-12-25 13:26:09 +03:00
Alex	f769e52877	test: boost: Fix flaky test_large_file_upload_s3 by creating induvidual files for testing During CI runs, multiple instances of the same test may execute concurrently. Although the test uses a temporary directory, the downloaded bucket artifacts were written using an identical filename across all instances. This caused concurrent writers to operate on the same file, leading to file corruption. In some cases, this manifested as test failures and intermittent std::bad_alloc exceptions. Change Description This change ensures that each test instance uses a unique filename for downloaded bucket files. By isolating file writes per test execution, concurrent runs no longer interfere with each other. Fixes: #27824 backport not required Closes scylladb/scylladb#27843	2025-12-25 09:40:13 +02:00
Benny Halevy	51433b838a	table: reduce allocations by using for_each_compaction_group rather than compaction_groups() compaction_groups_immediate() may allocate memory, but when called from a synchronous call site, the caller can use for_each_compaction_group instead to iterate over the compaction groups with no extra allocations. Calling compaction_groups_immediate() is still required from an async context when we want to "sample" the compaction groups so we can safely iterate over them and yield in the inner loop. Also, some performance insensitive call sites using compaction_groups_immediate had been left as they are to keep them simple. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-24 21:19:28 +02:00
Benny Halevy	0e27ee67d2	replica: storage_group: rename compaction_groups to compaction_groups_immediate To better reflect that it returns a materialized vector of compaction_group ptrs. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-24 21:19:26 +02:00
Nadav Har'El	186c91233b	Merge 'scylla-gdb.py: improve scylla fiber and scylla read-stats' from Botond Dénes Improve scylla fiber's ability to traverse through coroutines. Add --direction command-line parameter to scylla-fiber. Fix out-of-date premit collection in scylla read-stat and improve the printout. scylla-gdb.py improvements, no backport needed Closes scylladb/scylladb#27766 * github.com:scylladb/scylladb: scylla-gdb.py: scylla read-stats: include all permit lists scylla-gdb.py: scylla fiber: add --direction command-line param scylla-gdb.py: scylla fiber: add support for traversing through coroutines backward	2025-12-24 17:49:58 +02:00
Botond Dénes	27bf65e77a	db/batchlog_manager: add missing <seastar/coroutine/parallel_for_each.hh> include Build only fails if `--disable-precompiled-header` is passed to `configure.py`. Not sure why. Closes scylladb/scylladb#27721	2025-12-24 16:32:12 +02:00
Botond Dénes	c66275e05c	cql3/statements/batch_statement: make size error message more verbose Mention the type of batch: Logged or Unlogged. The size (warn/fail on too large size) error has different significance depending on the type. Refs: #27605 Closes scylladb/scylladb#27664	2025-12-24 15:27:01 +02:00
Piotr Szymaniak	9c5b4e74c3	doc: Correct reference in dev/audit.md Closes scylladb/scylladb#27832	2025-12-24 15:25:15 +02:00
Botond Dénes	ccc03d0026	test/pylib/runner.py: pytest_configure(): coerce repeat to int Coerce the return value of config.getoption("--repeat") to int to avoid: Traceback (most recent call last): File "/usr/bin/pytest", line 8, in <module> sys.exit(console_main()) ~~~~~~~~~~~~^^ File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 201, in console_main code = main() File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 175, in main ret: ExitCode \| int = config.hook.pytest_cmdline_main(config=config) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 512, in __call__ return self._hookexec(self.name, self._hookimpls.copy(), kwargs, firstresult) ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec return self._inner_hookexec(hook_name, methods, kwargs, firstresult) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall raise exception File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall res = hook_impl.function(args) File "/usr/lib/python3.14/site-packages/_pytest/helpconfig.py", line 154, in pytest_cmdline_main config._do_configure() ~~~~~~~~~~~~~~~~~~~~^^ File "/usr/lib/python3.14/site-packages/_pytest/config/__init__.py", line 1118, in _do_configure self.hook.pytest_configure.call_historic(kwargs=dict(config=self)) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/pluggy/_hooks.py", line 534, in call_historic res = self._hookexec(self.name, self._hookimpls.copy(), kwargs, False) File "/usr/lib/python3.14/site-packages/pluggy/_manager.py", line 120, in _hookexec return self._inner_hookexec(hook_name, methods, kwargs, firstresult) ~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 167, in _multicall raise exception File "/usr/lib/python3.14/site-packages/pluggy/_callers.py", line 121, in _multicall res = hook_impl.function(args) File "/home/bdenes/ScyllaDB/scylladb/scylladb/test/pylib/runner.py", line 206, in pytest_configure config.run_ids = tuple(range(1, config.getoption("--repeat") + 1)) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~ TypeError: can only concatenate str (not "int") to str Closes scylladb/scylladb#27649	2025-12-24 15:13:02 +02:00
Nadav Har'El	8df5189f9c	Merge 'docs: scylla-sstable.rst: extract script API to separate document' from Botond Dénes The script API is 500+ lines long in an already too long and hard to navigate document. Extract it to a separate document, making both documents shorter and easier to navigate. Documentation refactoring, no backport needed. Closes scylladb/scylladb#27609 * github.com:scylladb/scylladb: docs: scylla-sstable-script-api.rst: add introduction and title docs: scylla-sstable.rst: extract script API to separate document docs: scylla-sstable: prepare for script API extract	2025-12-24 15:02:57 +02:00
Botond Dénes	b036a461b7	tools/scylla-sstable: dump-schema: incude UDT description in dump If the table uses UDTs, include the description of these (CREATE TYPE statement) in the schema dump. Without these the schema is not useful. Closes scylladb/scylladb#27559	2025-12-24 14:46:52 +02:00
Botond Dénes	3071ccd54a	Merge 'Storage-agnostic table::snapshot_on_all_shards()' from Pavel Emelyanov The method in question knows that it writes snapshot to local filesystem and uses this actively. This PR relaxes this knowledge and splits the logic into two parts -- one that orchestrates sstables snapshot and collects the necessary metadata, and the code that writes the metadata itself. Closes scylladb/scylladb#27762 * github.com:scylladb/scylladb: table: Move snapshot_file_set to table.cc table: Rename and move snapshot_on_all_shards() method table: Ditch jsondir variable table, sstables: Pass snapshot name to sstable::snapshot() table: Use snapshot_writer in write_manifest() table: Use snapshot_writer in write_schema_as_cql() table: Add snapshot_writer::sync() table: Add snapshot_writer::init() table: Introduce snapshot_writer table: Move final sync and rename seal_snapshot() table: Hide write_schema_as_cql() table: Hide table::seal_snapshot() table: Open-code finalize_snapshot() table: Fix indentation after previuous patch table: Use smp::invoke_on_all() to populate the vector with filenames table: Don't touch dir once more on seal_snapshot() table: Open-code table::take_snapshot() into caller lambda table: Move parts of table::take_snapshot to sstables_manager table: Introduce table::take_snapshot() table: Store the result of smp::submit_to in local variable	2025-12-24 13:46:47 +02:00
Nadav Har'El	4ae45eb367	test/alternator: remove unused imports Remove many unused "import" statements or parts of import statement. All of them were detected by Copilot, but I verified each one manually and prepared this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27676	2025-12-24 13:44:28 +02:00
Nadav Har'El	da00401b7d	test/alternator: rename test with duplicate name The file test/alternator/test_transact.py accidentally had two tests with the same name, test_transact_get_items_projection_expression. This means the first of the two tests was ignored and never run. This patch renames the second of the two to a more appropriate (and unique...) name. I verified that after this change the number of tests in this file grows by one, and that still all tests pass on DynamoDB and fail (as expected by xfail) on Alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27702	2025-12-24 13:43:43 +02:00
Botond Dénes	95d4c73eb1	Merge 'Make object storage config truly updateable' from Pavel Emelyanov The db::config::object_storage_endpoints parameter is live-updateable, but when the update really happens, the new endpoints may fail to propagate to non-zero shards because of the way db::config sharding is implemented. Refs: #7316 Fixes: #26509 Backport to 2025.3 and 2025.4, AFAIK there are set ups with object storage configs for native backup Closes scylladb/scylladb#27689 * github.com:scylladb/scylladb: sstables/storage_manager: Fix configured endpoints observer test/object_store: Add test to validate how endpoint config update works	2025-12-24 13:42:44 +02:00
Botond Dénes	12dcf79c60	Merge 'build: support (and prefer) sccache as the compiler cache' from Avi Kivity Currently, we support ccache as the compiler cache. Since it is transparent, nothing much is needed to support it. This series adds support to sccache[1] and prefers it over ccache when it is installed. sccache brings the following benefits over ccache: 1. Integrated distributed build support similar to distcc, but with automatic toolchain packaging and a scheduler 2. Rust support 3. C++20 modules (upcoming[2]) It is the C++20 modules support that motivates the series. C++20 modules have the potential to reduce build times, but without a compiler cache and distributed build support, they come with too large a penalty. This removes the penalty. The series detects that sccache is installed, selects it if so (and if not overridden by a new option), enables it for C++ and Rust, and disables ccache transparent caching if sccache is selected. Note: this series doesn't add sccache to the frozen toolchain or add dbuild support. That is left for later. [1] https://github.com/mozilla/sccache [2] https://github.com/mozilla/sccache/pull/2516 Toolchain improvement, won't be backported. Closes scylladb/scylladb#27834 * github.com:scylladb/scylladb: build: apply sccache to rust builds too build: prevent double caching by compiler cache build: allow selecting compiler cache, including sccache	2025-12-24 13:40:02 +02:00
Nadav Har'El	74a57d2872	test/cqlpy: remove unused imports Remove many unused "import" statements or parts of import statement. All of them were detected by Copilot, but I verified each one manually and prepared this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27675	2025-12-24 13:31:41 +02:00
Andrzej Jackowski	632ff66897	doc: audit: mention double audit sink in Enabling Audit section Configuration of both table and syslog audit is possible since scylladb/scylladb#26613 was implemented. However, the "Enabling Audit" section of the documentation wasn't updated, which can be misleading. Ref: scylladb/scylladb#26613 Closes scylladb/scylladb#27790	2025-12-24 13:20:03 +02:00
Gleb Natapov	04976875cc	topology coordinator: set session id for streaming at the correct time Commit `d3efb3ab6f` added streaming session for rebuild, but it set the session and request submission time. The session should be set when request starts the execution, so this patch moved it to the correct place. Closes scylladb/scylladb#27757	2025-12-24 13:17:53 +02:00
Yaniv Michael Kaul	377c3ac072	scripts: benign fixes flagged by CodeQL/PyLens Unused imports, unused variables and such. No functional changes, just to get rid of some standard CodeQL warnings. Benign - no need to backport. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#27801	2025-12-24 13:08:24 +02:00
Avi Kivity	d6edad4117	test: pylib: resource_gather: don't take ownership of /sys/fs/cgroup under podman Under podman, we already own /sys/fs/cgroup. Run the chown command only under docker where the container does not map the host user to the container root user. The chown process is sometimes observed to fail with EPERM (see issue). But it's not needed, so avoid it. Fixes #27837. Closes scylladb/scylladb#27842	2025-12-24 10:56:24 +02:00
Marcin Maliszkiewicz	3c1e1f867d	raft: auth: add semaphore to auth_cache::load_all Auth cache loading at startup is racing between auth service and raft code and it doesn't support concurrency causing it to crash. We can't easily remove any of the places as during raft recovery snapshot is not loaded and it relies on loading cache via auth service. Therefore we add semaphore. Fixes https://github.com/scylladb/scylladb/issues/27540 Closes scylladb/scylladb#27573	2025-12-24 10:56:24 +02:00
Nadav Har'El	f3a4af199f	test/cqlpy/test_materialized_view.py: Fix for Commented-out code This patch was suggested and prepared by copilot, I am writing the commit message because the original one was worthless. In commit `cf138da`, for an an unexplained reason, a loop waiting until the expected value appears in a materialized view was replaced by a call for wait_for_view_built(). The old loop code was left behind in a comment, and this commented-out code is now bothering our AI. So let's delete the commented-out code. Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27646	2025-12-24 10:56:23 +02:00
Botond Dénes	1bb897c7ca	Merge 'Unify configuration of object storage endpoints' from Pavel Emelyanov To configure S3 storage, one needs to do ``` object_storage_endpoints: - name: s3.us-east-1.amazonaws.com port: 443 https: true aws_region: us-east-1 ``` and for GCS it's ``` object_storage_endpoints: - name: https://storage.googleapis.com:433 type: gs credentials_file: <gcp account credentials json file> ``` This PR updates the S3 part to look like ``` object_storage_endpoints: - name: https://s3.us-east-1.amazonaws.com:443 aws_region: us-east-1 ``` fixes: #26570 Not-yet released feature, no need to backport. Old configs are not accepted any longer. If it's needed, then this decision needs to be revised. Closes scylladb/scylladb#27360 * github.com:scylladb/scylladb: object_storage: Temporarily handle pure endpoint addresses as endpoints code: Remove dangling mentions of s3::endpoint_config docs: Update docs according to new endpoints config option format object_storage: Create s3 client with "extended" endpoint name test: Add named constants for test_get_object_store_endpoints endpoint names s3/storage: Tune config updating sstable: Shuffle args for s3_client_wrapper	2025-12-24 06:59:02 +02:00
Botond Dénes	954f2cbd2f	Merge 'config, transport: add listeners for native protocol fronted by proxy protocol v2' from Avi Kivity For deployments fronted by a reverse proxy (haproxy or privatelink), we want to use proxy protocol v2 so that client information in system.clients is correct and so that the shard-aware selection protocol, which depends on the source port, works correctly. Add proxy-protocol enabled variants of each of the existing native transport listeners. Tests are added to verify this works. I also manually tested with haproxy. New feature, no backport. Closes scylladb/scylladb#27522 * github.com:scylladb/scylladb: test: add proxy protocol tests config, transport: support proxy protocol v2 enhanced connections	2025-12-24 06:58:00 +02:00
Nadav Har'El	e75c75f8cd	test/cqlpy: fix two tests that couldn't fail because of typo As noticed by copilot, two tests in test_guardrail_compact_storage.py could never fail, because they used `pytest.fail` instead of the correct `pytest.fail()` to fail. Unfortunately, Python has a footgun where if it sees a bare function name without parenthesis, instead of complaining it evaluates the function object and then ignores it, and absolutely nothing happens. So let's add the missing `()`. The test still passes, but now it at least has a chance of failing if we have a regression. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27658	2025-12-24 06:49:54 +02:00
Yaron Kaikov	d671ca9f53	fix: remove return from finally block in s3_proxy.py during any jenkins job that trigger `test.py` we get: ``` /jenkins/workspace/releng-testing/byo/byo_build_tests_dtest/scylla/test/pylib/s3_proxy.py:152: SyntaxWarning: 'return' in a 'finally' block ``` The 'return' statement in the finally block was causing a SyntaxWarning. Moving the return outside the finally block ensures proper exception handling while maintaining the intended behavior. Closes scylladb/scylladb#27823	2025-12-24 06:48:03 +02:00
Avi Kivity	fc81983d42	test: sstable_validation_test: actually test `ms` version sstable_validation_test tests the `scylla sstable validate` command by passing it intentionally corrupted sstables. It uses an sstable cache to avoid re-creating the same sstables. However, the cache does not consider the sstable version, so if called twice with the same inputs for different versions, it will return an sstable with the original version for both calls. As a results, `ms` sstables were not tested. Fix this bug by adding the sstable version (and the schema for good measure) to the cache key. An additional bug, hidden by the first, was that we corrupted the sstable by overwriting its Index.db component. But `ms` sstables don't have an Index.db component, they have a Partitions.db component. Adjust the corrupting code to take that into account. With these two fixes, test_scylla_sstable_validate_mismatching_partition_large fails on `ms` sstables. Disable it for that version. Since it was previously practically untested, we're not losing any coverage. Fixing this test unblocks further work on making pytest take charge of running the tests. pytest exposed this problem, likely by running it on different runners (and thus reducing the effectiveness of the cache). Fixes #27822. Closes scylladb/scylladb#27825	2025-12-24 06:47:31 +02:00
Botond Dénes	cf70250a5c	Update seastar submodule * seastar 7ec14e83...f0298e40 (8): > Merge 'coroutine/try_future: call set_current_task() when resuming the coroutine' from Botond Dénes coroutine/try_future: call set_current_task() when resuming the coroutine core: move set_current_task() out-of-line > stop_signal: stop including reactor.hh > cmake: Mark hwloc headers as system includes to suppress warnings > build: explicitly enable vptr sanitizer > httpd: Add API to set tcp keepalive params > Merge 'Make datagram_channel::send() use temporary_buffer-s' from Pavel Emelyanov net: Remove no longer used to_iovec() helpers net,code: Update callers to use new datagram_channel::send() net: Introduce datagram_channel::send(span<temporary_buffer>) method posix-stack: Make UDP socket implementation use wrapped_iovec posix-stack: Introduce wrapped_iovec > code: Move pollable_fd_state::write_all(const char*) from API level 9 > thread: Remove unused sched_group() helper configure.py: added -lubsan to DEBUG sanitizer flags Closes scylladb/scylladb#27511	2025-12-24 06:46:36 +02:00
Nadav Har'El	54f3e69fdc	Fix for Statement has no effect This problem and its fix was suggested by copilot, I'm just writing the cover letter. test/nodetool/test_status.py has the silly statement tokens == "?" which has no effect. Looking around the code suggested to me (and also to Copilot, nice) that the correct intent was assert tokens == "?" and not, say, tokens = "?". Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27659	2025-12-24 06:43:26 +02:00
Piotr Dulikowski	9ed820cbf5	test: cluster: test for recovery after partial group0 command Add a reproducer for scylladb/scylladb#26945. By using error injections, the test triggers a situation where a command that removes an obsolete CDC generation is partially applied, then the node is killed an brought back. Thanks to the fix, restarting the node succeeds and does not trigger any consistency checks in the group0 reload logic.	2025-12-23 20:50:43 +01:00
Piotr Dulikowski	71bc1886ee	group0_state_machine: remove obsolete comment about group0 consistency The comment is outdated. It is concerned about group0 consistency after crash, and that re-applying committed commands may require a raft quorum. First, `579dcf1` was introduced (long ago) which gets rid of the need for quorum as the node persists the commit index before applying the commands - so it knows up to which command it should re-apply on restart. Second, the preceding commits in this PR makes use of this mechanism for group0. Remove the comment as the concern was fully addressed. Additionally, remove a mention of the comment in raft_group0_client.cc - although it claims that the comment is placed in `group0_state_machine::apply`, it has been moved to `merge_and_apply` in `96c6e0d` (both comments were originally introduced in `6a00e79`).	2025-12-23 20:44:17 +01:00
Piotr Dulikowski	b24001b5e7	group0_state_machine: don't update in-memory state machine until start Group0 commands consist of one or more mutations and are supposed to be atomic - i.e. the data structures that reflect the group0 tables state are not supposed to be updated while only some mutations of a command are applied, the logic responsible for that is not supposed to observe an inconsistent state of group0 tables. It turns out that this assumption can be broken if a node crashes in the middle of applying a multi-mutation group0 command. Because these mutations are, in general, applied separately, only some mutations might survive a crash and a restart, so the group0 tables might be in an inconsistent state. The current logic of group0_state_machine will attempt to read the group0 tables' state as it was left after restart, so it may observe inconsistent state. This can confuse the node as it may observe a state that it was not supposed to observe, or the state will just outright break some invariants and trigger some sanity checks. One of those was observed in scylladb/scylladb#26945, where a command from the CDC generation publisher fiber was partially applied. The fiber, in addition to publishing generations, it removes old, expired generations as well. Removal is done by removing data that describes the generation from cdc_generations_v3 and by removing the generation's ID from the committed generation list in the topology table. If only the first mutation gets through but not the other one, on reload the node will see a committed CDC generation without data, which will trigger an on_internal_error check. Fix this by delaying the moment when the in memory data structures are first loaded. In `579dcf1`, a mechanism was introduced which persists the commit index before applying commands that are considered committed. Starting a raft server waits until commands are replayed up to that point. The fix is to start the group0_state_machine in a mode which only applies mutations - the aforementioned mechanism will re-apply the commands which will, thanks to the mutation idempotency, bring the group0 to a consistent state. After the group0 is known to be in consistent state (so, after raft::server_impl::start) the in-memory data structures of group0 are loaded for the first time. There is an exception, however: schema tables. Information about schema is actually loaded into memory earlier than the moment when group0 is started. Applying changes to schema is done through the migration manager module which compares the persisted state before and after the schema mutations are applied and acts on that. Refactoring migration manager is out of scope of this PR. However, this is not a problem because the migration manager takes care to apply all of the mutations given in a command in a single commitlog segment, so the initial schema loading code should not see an inconsistent state due to the state being partially applied. Fixes: scylladb/scylladb#26945	2025-12-23 20:44:16 +01:00
Piotr Dulikowski	f4efdf18a5	group0_state_machine: move reloading out of std::visit In the next commit, we will adjust the logic so that it only reloads in memory state only when a flag is set. By moving the reload logic to one place in `merge_and_apply`, the next commit will be able to reach its goal by only adding a single `if`.	2025-12-23 20:44:16 +01:00
Piotr Dulikowski	6bdbd91cf7	service: raft: add state machine ref to raft_server_for_group This reference will be used by the code that starts group0. It will manually enable the in-memory state machine only after the group0 server is fully started, which entails replaying the group0 commands that are, locally, seen as committed - in order to repair any inconsistencies that might have arisen due to some commands being applied only partially (e.g. due to a crash).	2025-12-23 20:44:16 +01:00
Michał Hudobski	ce3320a3ff	auth: add system table permissions to VECTOR_SEARCH_INDEXING Due to the recent changes in the vector store service, the service needs to read two of the system tables to function correctly. This was not accounted for when the new permission was added. This patch fixes that by allowing these tables (group0_history and versions) to be read with the VECTOR_SEARCH_INDEXING permission. We also add a test that validates this behavior. Fixes: SCYLLADB-73 Closes scylladb/scylladb#27546	2025-12-23 15:53:07 +02:00
Pawel Pery	caa0cbe328	vector_search_validator: move high availability tests from vector-store.git Initially, tests for high availability were implemented in vector-store.git repository. High availability is currently implemented in scylladb.git repository so this repository should be the better place to store them. This commit copies these tests into the scylladb.git. The commit copies validator-vector-store/src/high_availability.rs (tests logic) and validator-tests/src/common.rs (utils for tests) into the local crate validator-scylla. The common.rs should be copied to be able for reviewer to see common test code and this code most likely be frequent to change - it will be hard to maintain one common version between two repositories. The commit updates also README for vector_search_validator; it shortly describe the validator modules. The commit updates reference to the latest vector-store.git master. As a next step on the vector-store.git high_availability.rs would be removed and common.rs moved from validator-tests into validator-vector-store. References: VECTOR-394 Closes scylladb/scylladb#27499	2025-12-23 15:53:07 +02:00
Yaron Kaikov	bad2fe72b6	.github/workflows: Add email validator workflow This workflow validates that all commits in a pull request use email addresses ending in @scylladb.com. For each commit with an author or committer email that doesn't match this pattern, the workflow automatically adds a comment to the pull request with a warning. This serves two purposes: 1. Alert maintainers when external contributors submit code (which is acceptable, but good to be aware of) 2. Help ScyllaDB developers catch cases where they haven't configured their git email correctly When a non-@scylladb.com email is detected, the workflow posts this comment on the pull request: ``` ⚠️ Non-@scylladb.com Email Addresses Detected Found commit(s) with author or committer emails that don't end with @scylladb.com. This indicates either: - An external contributor (acceptable, but maintainer should be aware) - A developer who hasn't configured their git email correctly For ScyllaDB developers: If you're a ScyllaDB employee, please configure your git email globally: git config --global user.email "your.name@scylladb.com" If only your most recent commit is invalid, you can amend it: git commit --amend --reset-author --no-edit git push --force If you have multiple invalid commits, you need to rewrite them all: git rebase -i <base-commit> # Mark each invalid commit as 'edit', then for each: git commit --amend --reset-author --no-edit git rebase --continue # Repeat for each invalid commit git push --force ``` Fixes: https://scylladb.atlassian.net/browse/RELENG-35 Closes scylladb/scylladb#27796	2025-12-23 15:53:06 +02:00
Pavel Emelyanov	ec15a1b602	table: Do not move foreign string when writing snapshot The table::seal_snapshot() accepts a vector of sstables filenames and writes them into manifest file. For that, it iterates over the vector and moves all filenames from it into the streamer object. The problem is that the vector contains foreign pointers on sets with sstrings. Not only sets are foreign, sstrings in it are foreign too. It's not correct to std::move() them to local CPU. The fix is to make streamer object work on string_view-s and populate it with non-owning references to the sstrings from aforementioned sets. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27755	2025-12-23 15:53:06 +02:00
Pavel Emelyanov	ecef158345	api: Use ranges library to process views in get_built_indexes() No functional changes, just make the loop shorter and more self-contained. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27742	2025-12-23 15:53:06 +02:00
Israel Fruchter	53abf93bd8	Update tools/cqlsh submodule * tools/cqlsh 22401228...9e5a91d7 (7): > Add pip retry configuration to handle network timeouts > Clean up unwanted build artifacts and update .gitignore > test_legacy_auth: update to pytest format > Add support for disabling compression via CLI and cqlshrc > Update scylla-driver to 3.29.7 (#144) > Update scylla-driver version to 3.29.6 > Revert "Migrate workflows to Blacksmith" Closes scylladb/scylladb#27567 [avi: build optimized clang 21.1.7 regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-21.1.7-Fedora-43-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-21.1.7-Fedora-43-x86_64.tar.gz ] Closes scylladb/scylladb#27734	2025-12-23 15:53:06 +02:00
Aleksandra Martyniuk	bbe64e0e2a	test: rename duplicate tests There are two test with name test_repair_options_hosts_tablets in test/nodetool/test_cluster_repair.py and and two test_repair_keyspace in test/nodetool/test_repair.py. Due to that one of each pair is ignored. Rename the tests so that they are unique. Fixes: https://github.com/scylladb/scylladb/issues/27701. Closes scylladb/scylladb#27720	2025-12-23 15:53:06 +02:00
Anna Stuchlik	7198191aa9	doc: fix the license information on DockerHub This commit removes the OSS-related information from DockerHub. It adds the link to the Source Available license. Fixes https://github.com/scylladb/scylladb/issues/22440 Closes scylladb/scylladb#27706	2025-12-23 15:53:06 +02:00
Calle Wilund	d5f72cd5fc	test::pylib::encryption_provider: Push up setting system_key_directory to all providers Fixes #27694 Unless set by config, the location will default to /etc/scylla, which is not a good place to write things for tests. Push the config properly and the directory (but _not_ creation) to all provider basetype. Closes scylladb/scylladb#27696	2025-12-23 15:53:06 +02:00
Dawid Mędrek	afde5f668a	test: Implement describing Boost tests in JSON format The Boost.Test framework offers a way to describe tests written in it by running them with the option `--list_content`. It can be parametrized by either HRF (Human Readable Format) or DOT (the Graphviz graph format) [1]. Thanks to that, we can learn the test tree structure and collect additional information about the tests (e.g. labels [2]). We currently emply that feature of the framework to collect and run Boost tests in Scylla. Unfortunately, both formats have their shortcomings: * HRF: the format is simple to parse, but it doesn't contain all relevant information, e.g. labels. * DOT: the format is designed for creating graphical visualizations, and it's relatively difficult to parse. To amend those problems, we implement a custom extension of the feature. It produces output in the JSON format and contains more than the most basic information about the tests; at the same time, it's easy to browse and parse. To obtain that output, the user needs to call a Boost.Test executable with the option `--list_json_content`. For example: ``` $ ./path/to/test/exec -- --list_json_content ``` Note that the argument should be prepended with a `--` to indicate that it targets user code, not Boost.Test itself. --- The structure of the new format looks like this (top-level downwards): - File name - Test suite(s) & free test cases - Test cases wrapped in test suites Note that it's different from the output the default Boost.Test formats produce: they organize information within test suites, which can potentially span multiple files [3]. The JSON format makes test files the primary object of interest and test suites from different files are always considered distinct. Example of the output (after applying some formatting): ``` $ ./build/dev/test/boost/canonical_mutation_test -- --list_json_content [{"file":"test/boost/canonical_mutation_test.cc", "content": { "suites": [], "tests": [ {"name": "test_conversion_back_and_forth", "labels": ""}, {"name": "test_reading_with_different_schemas", "labels": ""} ] }}] ``` --- The implementation may be seen as a bit ugly, and it's effectively a hack. It's based on registering a global fixture [4] and linking that code to every Boost.Test executable. Unfortunately, there doesn't seem to be any better way. That would require more extensive changes in the test files (e.g. enforcing going through the same entry point in all of them). This implementation is a compromise between simplicity and effectiveness. The changes are kept minimal, while the developers writing new tests shouldn't need to remember to do anything special. Everything should work out of the box (at least as long as there's no non-trivial linking involved). Fixes scylladb/scylladb#25415 --- References: [1] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/utf_reference/rt_param_reference/list_content.html [2] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/tests_grouping.html [3] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/test_tree/test_suite.html [4] https://www.boost.org/doc/libs/1_89_0/libs/test/doc/html/boost_test/tests_organization/fixtures/global.html Closes scylladb/scylladb#27527	2025-12-23 15:53:06 +02:00
Asias He	140858fc22	tablet-mon.py: Add repair support Add repair support in the tablet monitor. Fixes #24824 Closes scylladb/scylladb#27400	2025-12-23 15:53:06 +02:00
Pavel Emelyanov	132aa753da	sstables/storage_manager: Fix configured endpoints observer On start the manager creates observer for object_storage_endpoints config parameter. The goal is to refresh the maintained set of endpoint parameters and client upon config change. The observer is created on shard 0 only, and when kicked it calls manager.invoke-on-all to update manager on all shards. However, there's a race here. The thing is that db::config values are implicitly "sharded" under the hood with the help of plain array. When any code tries to read a value from db::config::something, the reading code secretly gets the value from this inner array indexed by the current shard id. Next, when the config is updated, it first assigns new values to [0] element of the hidden array, then calls broadcast_to_all_shards() helper that copies the valaues from zeroth slot to all the others. But the manager's observer is triggered when the new value is assigned on zero index, and if the invoke-on-all lambda (mentioned above) happens to be faster than broadcast_to_all_shards, the non-zero shards will read old values from db::config's inner array. The fix is to instantiate observer on all shards and update only local shard, whenever this update is triggered. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:43:11 +03:00
Pavel Emelyanov	f902eb1632	test/object_store: Add test to validate how endpoint config update works There's a test for backup with non-existing endpoint/bucket/snapshot. It checks that API call to backup sstables properly fails in that case. This patch adds similar test for "unconfigured endpoint", but it adds the endpoint configuration on-the-fly and expects that backup will proceed after config update. Currently the test fails, as config update only affect the config itself, the storage_manager, that's in charge of maintaining endpoint clients, is not really updated. Next patch will fix it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:41:38 +03:00
Pavel Emelyanov	e0cddc8c99	table: Move snapshot_file_set to table.cc It's not used anywhere else now. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:14:36 +03:00
Pavel Emelyanov	e31b72c61f	table: Rename and move snapshot_on_all_shards() method Now it's database::snapshot_table_on_all_shards(). This is symmetric to database::truncate_table_on_all_shards(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:14:36 +03:00
Pavel Emelyanov	48b1ceefaf	table: Ditch jsondir variable Now the table::snapshot_on_all_shards() is storage-independent and can stop maintaining the local path variable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:14:36 +03:00
Pavel Emelyanov	a21aa5bdf6	table, sstables: Pass snapshot name to sstable::snapshot() Currently sstable::snapshot() is called with directory name where to put snapshots into. This patch changes it to accept snapshot name instead. This makes the table-sstable API be unware of snapshot destination storage type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:14:36 +03:00
Pavel Emelyanov	d0812c951e	table: Use snapshot_writer in write_manifest() The manifest writing code is self-contained in a sense that it needs list of sstable files and output_stream to write it too. The snapshot_writer can provide output_stream for specific component, it can be re-used for manifest writing code, thus making it independent from local filesystem. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:13:47 +03:00
Pavel Emelyanov	fe8923bdc7	table: Use snapshot_writer in write_schema_as_cql() The schema writing code is self-contained in a sense that it needs schema description and output_stream to write it too. Teach the snapshot_writer to provide output_stream and make write_schema_as_cql() be independent from local filesystem. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:06:17 +03:00
Pavel Emelyanov	9fee06d3bc	table: Add snapshot_writer::sync() And move calls to sync_directory() into it too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:06:16 +03:00
Pavel Emelyanov	7a298788c0	table: Add snapshot_writer::init() And move the call to recursive_touch_directory into it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:06:05 +03:00
Pavel Emelyanov	db6a5aa20b	table: Introduce snapshot_writer It's an abstract class that defines how to write data and metadata with table snapshot. Currently it just replaces the storage_options checks done by table::snapshot_on_all_shards(), but it will soon evolve. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 12:00:26 +03:00
Pavel Emelyanov	8e247b06a2	table: Move final sync and rename seal_snapshot() The seal_snapshot() syncs directory at the end. Now when the method is table.cc-local, it doesn't need to be that careful. It looks nicer if being renamed to write_manifest() and it's caller that syncs directory after calling it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 11:57:26 +03:00
Pavel Emelyanov	37746ba814	table: Hide write_schema_as_cql() The method only needs schema description from table. The caller can pre-get it and pass it as argument. This makes it symmetric with seal_snapshot() (that will be renamed soon) and reduces the class table API size. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 11:57:26 +03:00
Pavel Emelyanov	976bcef5d0	table: Hide table::seal_snapshot() The method is static and has nothing to do with table. The snapshot_file_set needs to become public, but it will be moved to table.cc soon. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 11:57:26 +03:00
Pavel Emelyanov	8a4daf3ef1	table: Open-code finalize_snapshot() It makes is easier to modify this code further. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 11:57:26 +03:00
Pavel Emelyanov	6975234d1c	table: Fix indentation after previuous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 11:57:26 +03:00
Pavel Emelyanov	4a88f15465	table: Use smp::invoke_on_all() to populate the vector with filenames There's a vector of foreign pointers to sets with sstable filenames that's populated on all shards. The code does the invoke-on-all by hand to grow the vector with push-back-s. However, if resizing the vector in advance, shards will be able to just populate their slots. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 11:57:25 +03:00
Pavel Emelyanov	41ed11cdbe	table: Don't touch dir once more on seal_snapshot() The directory had been created way before that, in the caller. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 11:57:25 +03:00
Pavel Emelyanov	07708acebd	table: Open-code table::take_snapshot() into caller lambda Now when the logic of take_snapshot() is split between two components (table and sstables_manager) it's no longer useful Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 11:57:25 +03:00
Pavel Emelyanov	ef3b651e1b	table: Move parts of table::take_snapshot to sstables_manager Move the loop over vector of sstables that calls sstable->snapshot() into sstables manager. This makes it symmetric with sstables_manager::delete_atomically() and allows for future changes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 11:57:25 +03:00
Pavel Emelyanov	0c45a7df00	table: Introduce table::take_snapshot() The method returns all sstables vector with a guard that prevents this list from being modified. Currently this is the part of another existing table::take_snapshot() method, but the newer, smaller one, is more atomic and self-contained, next patches will benefit from it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 11:57:25 +03:00
Pavel Emelyanov	2e2cd2aa39	table: Store the result of smp::submit_to in local variable Tossing bits around not to make it in the next, larger, patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-23 11:57:25 +03:00
Botond Dénes	58b5d43538	Merge 'test: multi LWT and counters test during tablets resize and migration' from Yauheni Khatsianevich This PR extends BaseLWTTester with optional counter-table configuration and verification, enabling randomized LWT tests over tablets with counters. And introduces new LWT with counters test durng tablets resize and migration - Workload: N workers perform CAS updates - Update counter table each time CAS was successful - Enable balancing and increase min_tablet_count to force split, and lower min_tablet_count to merge. - Run tablets migrations loop - Stop workload and verify data consistency Refs: https://github.com/scylladb/qa-tasks/issues/1918 Refs: https://github.com/scylladb/qa-tasks/issues/1988 Refs https://github.com/scylladb/scylladb/issues/18068 Closes scylladb/scylladb#27170 * github.com:scylladb/scylladb: test: new LWT with counters test during tablets migration/resize - Workload: N workers perform CAS updates - Update counter table each time CAS was successful - Enable balancing and increase min_tablet_count to force split, and lower min_tablet_count to merge. - Run tablets migrations loop - Stop workload and verify data consistency test/lwt: add counter-table support to BaseLWTTester	2025-12-23 07:29:35 +02:00
Botond Dénes	bfdd4f7776	Merge 'Synchronize incremental repair and tablet split' from Raphael Raph Carvalho Split prepare can run concurrently with repair. Consider this: 1) split prepare starts 2) incremental repair starts 3) split prepare finishes 4) incremental repair produces unsplit sstable 5) split is not happening on sstable produced by repair 5.1) that sstable is not marked as repaired yet 5.2) might belong to repairing set (has compaction disabled) 6) split executes 7) repairing or repaired set has unsplit sstable If split was acked to coordinator (meaning prepare phase finished), repair must make sure that all sstables produced by it are split. It's not happening today with incremental repair because it disables split on sstables belonging to repairing group. And there's a window where sstables produced by repair belong to that group. To solve the problem, we want the invariant where all sealed sstables will be split. To achieve this, streaming consumers are patched to produce unsealed sstable, and the new variant add_new_sstable_and_update_cache() will take care of splitting the sstable while it's unsealed. If no split is needed, the new sstable will be sealed and attached. This solution was also needed to interact nicely with out of space prevention too. If disk usage is critical, split must not happen on restart, and the invariant aforementioned allows for it, since any unsplit sstable left unsealed will be discarded on restart. The streaming consumer will fail if disk usage is critical too. The reason interposer consumer doesn't fully solve the problem is because incremental repair can start before split, and the sstable being produced when split decision was emitted must be split before attached. So we need a solution which covers both scenarios. Fixes #26041. Fixes #27414. Should be backported to 2025.4 that contains incremental repair Closes scylladb/scylladb#26528 * github.com:scylladb/scylladb: test: Add reproducer for split vs intra-node migration race test: Verify split failure on behalf of repair during critical disk utilization test: boost: Add failure_when_adding_new_sstable_test test: Add reproducer for split vs incremental repair race condition compaction: Fail split of new sstable if manager is disabled replica: Don't split in do_add_sstable_and_update_cache() streaming: Leave sstables unsealed until attached to the table replica: Wire add_new_sstables_and_update_cache() into intra-node streaming replica: Wire add_new_sstable_and_update_cache() into file streaming consumer replica: Wire add_new_sstable_and_update_cache() into streaming consumer replica: Document old add_sstable_and_update_cache() variants replica: Introduce add_new_sstables_and_update_cache() replica: Introduce add_new_sstable_and_update_cache() replica: Account for sstables being added before ACKing split replica: Remove repair read lock from maybe_split_new_sstable() compaction: Preserve state of input sstable in maybe_split_new_sstable() Rename maybe_split_sstable() to maybe_split_new_sstable() sstables: Allow storage::snapshot() to leave destination sstable unsealed sstables: Add option to leave sstable unsealed in the stream sink test: Verify unsealed sstable can be compacted sstables: Allow unsealed sstable to be loaded sstables: Restore sstable_writer_config::leave_unsealed	2025-12-23 07:28:56 +02:00
Botond Dénes	bf9640457e	Merge 'test: add crash detection during tests' from Cezar Moise After tests end, an extra check is performed, looking into node logs for crashes, aborts and similar issues. The test directory is also scanned for coredumps. If any of the above are found, the test will fail with an error. The following checks are made: - Any log line matching `Assertion.failed` or containing `AddressSanitizer` is marked as a critical error - Lines matching `Aborting on shard` will only be marked as a critical error if the paterns in `manager.ignore_cores_log_patterns` are not found in that log - If any critical error is found, the log is also scanned for backtraces - Any backtraces found are decoded and saved - If the test is marked with `@pytest.mark.check_nodes_for_errors`, the logs are checked for any `ERROR` lines - Any pattern in `manager.ignore_log_patterns` and `manager.ignore_cores_log_patterns` will cause above check to ignore that line - The `expected_error` value that many methods, like `manager.decommission_node`, have will be automatically appended to `manager.ignore_log_patterns` refs: https://github.com/scylladb/qa-tasks/issues/1804 --- [Examples](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/testReport/): Following examples are run on a separate branch where changes have been made to enable these failures. `test_unfinished_writes_during_shutdown` - Errors are found in logs and are not ignored ``` failed on teardown with "Failed: Server 2096: found 1 error(s) (log: scylla-2096.log) ERROR 2025-12-15 14:20:06,563 [shard 0: gms] raft_topology - raft_topology_cmd barrier_and_drain failed with: std::runtime_error (raft topology: command::barrier_and_drain, the version has changed, version 11, current_version 12, the topology change coordinator had probably migrated to another node) Server 2101: found 4 error(s) (log: scylla-2101.log) ERROR 2025-12-15 14:20:04,674 [shard 0:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive ERROR 2025-12-15 14:20:04,674 [shard 1:strm] repair - repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: Repair 1 out of 4 ranges, keyspace=system_distributed, table=view_build_status, range=(minimum token,maximum token), peers=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], live_peers=[b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e], status=failed: mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - raft_topology_cmd stream_ranges failed with: std::runtime_error (["shard 0: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))", "shard 1: std::runtime_error (repair[c434c0c0-68da-472c-ba3e-ed80960ce0d5]: 1 out of 4 ranges failed, keyspace=system_distributed, tables=[\"view_build_status\", \"cdc_generation_timestamps\", \"service_levels\", \"cdc_streams_descriptions_v2\"], repair_reason=bootstrap, nodes_down_during_repair={27c027a6-603d-49d0-8766-1b085d8c7d29}, aborted_by_user=false, failed_because=std::runtime_error (Repair mandatory neighbor=27c027a6-603d-49d0-8766-1b085d8c7d29 is not alive, keyspace=system_distributed, mandatory_neighbors=[27c027a6-603d-49d0-8766-1b085d8c7d29, b549cb36-fae8-490b-a19e-86d42e7aa07a, f7049967-81ff-4296-9be7-9d6a4d33a29e]))"]) ERROR 2025-12-15 14:20:06,812 [shard 0:main] init - Startup failed: std::runtime_error (Bootstrap failed. See earlier errors (Rolled back: Failed stream ranges: std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94))) Server 2094: found 1 error(s) (log: scylla-2094.log) ERROR 2025-12-15 14:20:04,675 [shard 0: gms] raft_topology - send_raft_topology_cmd(stream_ranges) failed with exception (node state is bootstrapping): std::runtime_error (failed status returned from 9dd942aa-acec-4105-9719-9bda403e8e94)" ``` `test_kill_coordinator_during_op` - aborts caused by injection - `ignore_cores_log_patterns` is not set - while there are errors in logs and `ignore_log_patterns` is not set, they are ignored automatically due to the `expected_error` parameter, such as in `await manager.decommission_node(server_id=other_nodes[-1].server_id, expected_error="Decommission failed. See earlier errors")` ``` failed on teardown with "Failed: Server 1105: found 1 critical error(s), 1 backtrace(s) (log: scylla-1105.log) Aborting on shard 0, in scheduling group gossip. 1 backtrace(s) saved in scylla-1105-backtraces.txt Server 1106: found 1 critical error(s), 1 backtrace(s) (log: scylla-1106.log) Aborting on shard 0, in scheduling group gossip. 1 backtrace(s) saved in scylla-1106-backtraces.txt Server 1113: found 1 critical error(s), 1 backtrace(s) (log: scylla-1113.log) Aborting on shard 0, in scheduling group gossip. 1 backtrace(s) saved in scylla-1113-backtraces.txt Server 1148: found 1 critical error(s), 1 backtrace(s) (log: scylla-1148.log) Aborting on shard 0, in scheduling group gossip. 1 backtrace(s) saved in scylla-1148-backtraces.txt" ``` Decoded backtrace can be found in [failed_test_logs](https://jenkins.scylladb.com/job/scylla-staging/job/cezar/job/byo_build_tests_dtest/46/artifact/testlog/x86_64/dev/failed_test/test_kill_coordinator_during_op.dev.1) Closes scylladb/scylladb#26177 github.com:scylladb/scylladb: test: add logging to crash_coordinator_before_stream injection test: add crash detection during tests test.py: add pid to ServerInfo	2025-12-23 07:27:58 +02:00
Pavel Emelyanov	cd2568ad00	test: Merge and parametrize test_backup_to_non_existent_something tests There are three tests in cluster/object_store suite that check how backup fails in case either of its parameters doesn't really exists. All three greatly duplicate each other, it makes sense to merge them into one larger parametrized test. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27695	2025-12-23 07:02:18 +02:00
Avi Kivity	7586c5ccbd	Merge 'system.clients: add `client_options` map column' from Vladislav Zolotarov This pull request introduces a new caching mechanism for client options in the Alternator and transport layers, refactors how client metadata is stored and accessed, and extends the `system.clients` virtual table to surface richer client information. The changes improve efficiency by deduplicating commonly used strings (like driver names/versions and client options), and ensure that client data is handled in a way that's safe for cross-shard access. Additionally, the test suite and virtual table schema are updated to reflect the new client options data. Caching and client metadata refactoring: * The largest and most repeatable items in the connection state before this PR were a `driver_name` and a `driver_version` which were stored as an `sstring` object which means that the corresponding memory consumption was 16 bytes per each such value at least (the smallest size of the `seastar`'s `sstring` object) per-connection. In reality the driver name is usually longer than 15 characters, e.g. "ScyllaDB Python Driver" is 23 characters and this is not the longest driver name there is. In such cases the actual memory usage of a corresponding `sstring` object jumps to 8 + 4 + 1 + (string length, 23 in our example) + 1. So, for "ScyllaDB Python Driver" it would be 37 bytes (in reality it would be a bit more due to natural alignment of other allocations since the `contents` size is not well aligned (13 bytes), but let's ignore this for now). * These bytes add up quickly as there are more connections and, sometimes we are talking about millions of connections per-shard. * Using a smart pointer (`lw_shared_ptr`) referencing a corresponding cached value will effectively reduce the per-connection memory usage to be 8 bytes (a size of a pointer on 64-bit CPU platform) for each such value. While storing a corresponding `sstring` value only once. * This will would reduce the "variable" (per-connection) memory usage by at least 50%. And in case of "ScyllaDB Python Driver" driver version - by 78%! * And all this for a price of a single `loading_shared_values` object per-shard (implements a hash table) and a minor overhead for each value stored in it. * Introduced a new cache type (`client_options_cache_type`) for deduplicating and sharing client option strings, and refactored `client_data`, `client_state`, and related classes to use `foreign_ptr<std::unique_ptr<client_data>>` and cached entry types for fields like driver name, driver version, and client options. (`client_data.hh`, `service/client_state.hh`, `alternator/server.hh`, `alternator/controller.hh`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fR33-R36) [[2]](diffhunk://#diff-664a3b19e905481bdf8eb3843fc4d34691067bb97ab11cfd6e652e74aac51d9fL40-R56) [[3]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL105-R107) [[4]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[5]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL91-R92) [[6]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[7]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[8]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[9]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[10]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[11]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48) * Updated the methods for setting and getting driver name, driver version, and client options in `client_state` to be asynchronous and use the new cache. (`service/client_state.hh`, `service/client_state.cc`) [[1]](diffhunk://#diff-daadce1a2de3667511e59558f3a8f077b5ee30a14bcc6a99d588db90d0fcd2bdL154-R182) [[2]](diffhunk://#diff-99634aae22e2573f38b4e2f050ed2ac4f8173ff27f0ae8b3609d1f0cc1aeb775R347-R362) Virtual table and API enhancements: * Extended the `system.clients` virtual table schema and implementation to include a new `client_options` column (a map of option key/value pairs), and updated the table population logic to use the new cached types and foreign pointers. (`db/virtual_tables.cc`) [[1]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1R752) [[2]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L769-R770) [[3]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L809-R816) [[4]](diffhunk://#diff-05f7bff3edb39fb8759c90b445e860189f2f30e04717ed58bae42716082af3d1L828-R879) API and interface changes: * Changed the signatures of `get_client_data` methods throughout the codebase to return vectors of `foreign_ptr<std::unique_ptr<client_data>>` instead of plain `client_data` objects, to ensure safe cross-shard access. (`alternator/controller.hh`, `alternator/controller.cc`, `alternator/server.hh`, `alternator/server.cc`, `transport/controller.hh`, `transport/protocol_server.hh`) [[1]](diffhunk://#diff-31730ba8e7374f784a88dc27c1512291cf73b7f24e08768f7466a3c8cfcc7a1aL96-R96) [[2]](diffhunk://#diff-19a97c0247cc08155ee49b277e43859ca32d6ef8cbff0ed7368ec5fa19e0a11eL172-R172) [[3]](diffhunk://#diff-5fce246edf5abffb2351bd02e2eb1e9850880f7a00607ccaa90c3eee7ef57c6bL110-R111) [[4]](diffhunk://#diff-a7e2cda866c03a75afcf3b087de1c1dcd2e7aa996214db67f9a11ed6451e596dL988-R995) [[5]](diffhunk://#diff-eea7e2db5d799a25e717a72ac8ce5842bd4adb72b694d38d8f47166d9cd926faL356-R356) [[6]](diffhunk://#diff-d0b4ec3a144bbc5dc993866cf0b940850a457ff6156064f7e2b4b10ad0a95fefL80-R80) [[7]](diffhunk://#diff-4293b94c444d9bd5ecd17ce7eda8c00685d35ecf6e07f844efc91a91bbe85be1L46-R48) Testing and validation: * Updated the Python test for the `system.clients` table to verify the new `client_options` column and its contents, ensuring that driver name and version are present in the options map. (`test/cqlpy/test_virtual_tables.py`) [[1]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R79) [[2]](diffhunk://#diff-6dd8bd4a6a82cd642252a29dc70726f89a46ceefb991c3e63fc67e283f323f03R88-R90) Closes scylladb/scylladb#25746 * github.com:scylladb/scylladb: transport/server: declare a new "CLIENT_OPTIONS" option as supported service/client_state and alternator/server: use cached values for driver_name and driver_version fields system.clients: add a client_options column controller: update get_client_data to use foreign_ptr for client_data	2025-12-22 20:02:40 +02:00
Emil Maskovsky	d60b908a8e	test/raft: improve reporting in the randomized_nemesis_test digest functions The Boost ASSERTs in the digest functions of the randomized_nemesis_test were not working well inside the state machine digest functions, leading to unhelpful boost::execution_exception errors that terminated the apply fiber, and didn't provide any helpful information. Replaced by explicit checks with on_fatal_internal_error calls that provide more context about the failure. Also added validation of the digest value after appending or removing an element, which allows to determine which operation resulted in causing the wrong value. This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282, but adds improved error reporting. Refs: scylladb/scylladb#27307 Refs: scylladb/scylladb#17030 Closes scylladb/scylladb#27791	2025-12-22 20:02:40 +02:00
Nikos Dragazis	20ff2fcc18	docs: Amend limitations for keyspace RF changes The doc about DDL statements claims that an `ALTER KEYSPACE` will fail in the presence of an ongoing global topology operation. This limitation was specifically referring to RF changes, which Scylla implements as global topology requests (`keyspace_rf_change`), and it was true when it was first introduced (`1b913dd880`) because there was no global topology request queue at that time, so only one ongoing global request was allowed in the cluster. This limitation was lifted with the introduction of the global topology request queue (`6489308ebc`), and it was re-introduced again very recently (`2e7ba1f8ce`) in a slightly different form; it now applies only to RF changes (not to any request type) and only those that affect the same keyspace. None of these two changes were ever reflected in the doc. Synchronize the doc with the current state. Fixes #27776. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#27786	2025-12-22 20:02:40 +02:00
Andrei Chekun	6ffdada0ea	test.py: modify JUnit report for easier rerun on CI This will allow to add custom XML attribute to the JUnit report. In this case there will be path to the function that can be used to run with pytest command. Parametrized tests will have path to the function excluding parameter. Closes scylladb/scylladb#27707	2025-12-22 20:02:40 +02:00
Anna Stuchlik	4c247a5d08	doc: document support for i8g and i8ge instances Fixes https://github.com/scylladb/scylladb/issues/27703 Closes scylladb/scylladb#27754	2025-12-22 20:02:40 +02:00
copilot-swe-agent[bot]	288d4b49e9	Skip backtrace in lsa-timing logs for preemptible reclaim Preemptible reclaim is only done from the background reclaimer, so backtrace is not useful. It's also normal that it takes a long time. Skip the backtrace when reclaim is preemptible to reduce log noise. Fixes the issue where background reclaim was printing unnecessary backtraces in lsa-timing logs when operations took longer than the stall detection threshold. Closes: #27692 Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>	2025-12-22 20:02:40 +02:00
Pavel Emelyanov	e304d912b4	Merge 'db/view/view_building_worker: follow-ups' from Michał Jadwiszczak This patch consists of a few smaller follow-ups to the view building worker: - catch general execption in staging task registrator - remove unnecessary CV broadcast - don't pollute function context with conditionally compiled variable - avoid creating a copy of tasks map - fix some typos Refs https://github.com/scylladb/scylladb/issues/25929 Refs https://github.com/scylladb/scylladb/pull/26897 This PR doesn't fix any bugs but recently we're backporting some PRs to 2025.4, so let's also backport this one to avoid painful conflicts. Closes scylladb/scylladb#26558 * github.com:scylladb/scylladb: docs/dev/view-building-coordinator: fix typos db/view/view_building_worker: remove unnnecessary empty lines db/view/view_building_worker: fix typo db/view/view_building_worker: avoid creating a copy of tasks map db/view/view_building_worker: wrap conditionally compiled code in a scope db/view/view_building_worker: remove unnecessary CV broadcast db/view/view_building_worker: catch general execption in staging task registrator	2025-12-22 20:02:40 +02:00
Botond Dénes	846a6e700b	Merge 'get_snapshot_details: process also staging directory' from Benny Halevy Currently, we determine the live vs. total snapshot size by listing all files in the snapshot directory, and for each name, look it up in the base table directory and see if it exists there, and if so, if it's the same file as in the snapshot by looking to the fstat data for the dev id and inode number. However, we do not look the names in the staging directory so staging sstable would skew the results as the will falsely contribute to the live size, since they wouldn't be found in the base directory. This change processes both the staging directory and base table directory and keeps the file capacity in a map, indexed by the files inode number, allowing us to easily detect hard links and be resilient against concurrent move of files from the staging sub-directory back into the base table directory. Fixes #27635 * Minor issue, no backport required Closes scylladb/scylladb#27636 * github.com:scylladb/scylladb: table: get_snapshot_details: add FIXME comments table: get_snapshot_details: lookup entries also in the staging directory table: get_snapshot_details: optimize using the entry number_of_links table: get_snapshot_details: continue loop for manifest and schema entries table: get_snapshot_details: use directory_lister	2025-12-22 20:02:40 +02:00
Botond Dénes	af5e73def9	Merge 'test/cqlpy: remove unused variables' from Nadav Har'El These patches fix a bunch of variables defined in test/cqlpy tests, but not used. Besides wasting a few bytes on disk, these unused variables can add confusion for readers who see them and might think they have some use which they are missing. All these unused variables were found by Copilot's "code quality" scanner, but I considered each of them, and fixed them manually. Closes scylladb/scylladb#27667 * github.com:scylladb/scylladb: test/cqlpy: remove unused variables test/cqlpy: use unique partition in test	2025-12-22 20:02:39 +02:00
Avi Kivity	8e462d06be	build: apply sccache to rust builds too sccache works for rust as well as for C++; use it for rust builds as well.	2025-12-22 15:36:15 +02:00
Avi Kivity	9ac82d63e9	build: prevent double caching by compiler cache To avoid both sccache and ccache being used simultaneously, detect if ccache is in $PATH and use whatever compiler would have been called instead.	2025-12-22 15:36:14 +02:00
Anna Stuchlik	9793a45288	doc: add a Vector Search page under Features This commit adds a page with an overview of Vector Search under the Features section. It includes a link to the VS documentation in ScyllaDB Cloud, as the feature is only available in ScyllaDB Cloud. The purpose of the page is to raise awareness of the feature. Fixes https://scylladb.atlassian.net/browse/VECTOR-215 Closes scylladb/scylladb#27787	2025-12-22 15:29:45 +02:00
Avi Kivity	afb96b6387	build: allow selecting compiler cache, including sccache Add a new configuration option for selecting the compiler cache. Prefer sccache if found, since it supports rust as well as C++, has better support for distributed compilation, and is slated to receive module support soon. cmake is also supported.	2025-12-22 15:27:15 +02:00
Alex	033579ad6f	db: api: service: Fix ClientConnectorError in test_client_routes The bug was caused by capturing local variables by reference in lambdas passed to with_retry(), which is a coroutine. When the coroutine suspends, the lambda frame exits and the referenced locals are destroyed, leading to use-after-lifetime issues. This change fixes the problem by ensuring safe ownership across suspension points and also refactors how route_keys and route_entries are passed from the caller. Previously they were passed as const lvalue references, which cannot be moved and therefore ended up being repeatedly copied across function calls and lambda invocations. The new approach avoids unnecessary copies and makes the lifetime semantics explicit and safe. Fixes: 27792 no backport needed private link is only in master branch Closes scylladb/scylladb#27795	2025-12-22 14:52:47 +02:00
Yaniv Michael Kaul	c1da552fa4	test/pylib/scylla_cluster.py:get_scylla_2025_1_executable() - retry curl download of 2025.1 For some reason, we might fail. Retry 10 times, and fail with an error code instead of 404 or whatnot. Benign, I hope - no need to backport. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#27746	2025-12-22 14:45:06 +02:00
Piotr Smaron	cb3b96b8f4	raft: correct lease->least typo in a comment Funny, when researching if our raft implementation relies on the 'lease' mechanism, I noticed this typo. Closes scylladb/scylladb#27803	2025-12-22 14:39:55 +02:00
Avi Kivity	b105ad8379	build: drop -fexperimental-assignment-tracking clang option `-fexperimental-assignment-tracking` was added in `fdd8b03d4b` to make coroutine debugging work. However, since then, it became unnecessary, perhaps due to `87c0adb2fe`, or perhaps to a toolchain fix. Drop it, so we can benefit from assignment tracking (whatever it is), and to improve compatibility with sccache, which rejects this option. I verified that the test added in `fdd8b03d4b` fails without the option and passes with this patch; in other words we're not introducing a regression here. Closes scylladb/scylladb#27763	2025-12-22 14:33:48 +02:00
Michael Litvak	9f8aea21e3	docs: update RF-rack restrictions Update the documentation about restrictions to tablets keyspaces related to RF-rack. * MV/SI require the keyspace to be RF-rack-valid * topology operations are restricted if a keyspace has views to preserve RF-rack-validity	2025-12-22 09:21:07 +01:00
Michael Litvak	75b5285cdf	cql3: don't apply RF-rack restrictions on vector indexes When creating an index we validate that the keyspace is RF-rack-valid and print a warning that the keyspace must remain RF-rack-valid. This should apply only to indexes that are based on materialized views for which there are consistency concerns when the keyspace is not RF-rack-valid. vector indexes are not based on materialized views, hence these restrictions should not apply to them.	2025-12-22 09:21:07 +01:00
Michael Litvak	06343b58a2	cql3: add warning when creating mv/index with tablets about rf-rack Creating a MV or index in a tablets-based keyspace now forces additional restrictions on the keyspace. The keyspace must be RF-rack-valid and it must remain RF-rack-valid while the view exists. Add a CQL warning about these restrictions.	2025-12-22 09:21:06 +01:00
Michael Litvak	df801d16da	service/tablet_allocator: always allow tablet merge of tables with views allow tablet merge of tables with views even if the rf_rack_valid_keyspaces option is not set, because now keyspaces that have views are enforced to always be rf-rack-valid, regardless of the option value.	2025-12-22 09:14:30 +01:00
Michael Litvak	07d85af433	locator: extend rf-rack validation for rack lists Extend the RF-rack validation in `assert_rf_rack_valid_keyspace` to validate rack-list-based replication as well. Previously, validation was done only for numeric replication. If the replication is based on a rack list, we validate that all racks that are required for replication are present in the topology rack map. If some rack is needed for replication but is missing, or it doesn't have normal token owner nodes, the validation fails with an error.	2025-12-22 09:14:30 +01:00
Michael Litvak	d40d06c7ad	test: test rf-rack validity when creating keyspace during node ops add tests that attempt to create a keyspace during different stages of node join or remove, and verify that the rf-rack condition can't be broken - either creating the keyspace should fail or the node operation should fail, depending on the stage.	2025-12-22 09:14:30 +01:00
Michael Litvak	a738905a4b	locator: fix rf-rack validation during node join/remove If a keyspace is created while a node is joining or being removed, it could break the rf-rack invariant. For example: 1. We have 3 nodes in 3 racks, no keyspaces 2. A new node starts to join in a new rack - passes validation because there are no keyspaces 3. Create a keyspace with rf=3 - passes validation because the joining node is not a normal token owner yet 4. The new node becomes a normal token owner 5. The rf-rack invariant is broken. We have rf=3 and 4 racks To fix this, we change the rf-rack check to consider a node as a token owner if it's either a normal token owner or it has bootstrap tokens and is about to become a normal token owner. Now the condition can't be broken. Consider keyspace creation at different stages of adding a node in our example: * Before the node is assigned bootstrap tokens: the node is not considered. We can create a keyspace with rf=3 as if the node doesn't exist, and then node join will fail in the group0 operation that assigns bootstrap tokens, because during this operation we check rf-rack validity. * Assigning bootstrap tokens is a single group0 operation that is serialized with keyspace creation. During this operation we check that adding the node as a token owner will maintain rf-rack validity for all keyspaces. * After the node is assigned bootstrap tokens and until it becomes a normal token owner: it is considered as a transitioning token owner by the rf-rack check and the rack is considered a transitioning rack. We can't count the rack as a normal rack because the node join may still fail and rollback. Trying to create a keyspace with either rf=3 or rf=4 will fail because we can end up with either 3 or 4 racks. Similarly, when removing a node, we validate that removing the node will maintain rf-rack validity in the same group0 operation that changes the node state to removing/decommissioning, after which the node becomes a leaving endpoint, and it's not considered a normal token owner anymore for the rf-rack check.	2025-12-22 09:14:30 +01:00
Michael Litvak	9940dcefa7	test: test topology restrictions for views with tablets Add tests that verify the restrictions on topology operations when there are keyspaces with tablets and materialized views. For such keyspaces, RF=Racks must be enforced while they have materialized views, therefore adding a node in a new rack or removing a node that would eliminate a rack should be rejected.	2025-12-22 09:14:30 +01:00
Michael Litvak	870aad7f71	test: add test_topology_ops_with_rf_rack_valid add new tests for testing that RF-rack validity is maintained when doing topology operations that may break them, such as adding nodes in new racks or removing nodes.	2025-12-22 09:14:30 +01:00
Michael Litvak	8f1a566be8	topology coordinator: restrict node join/remove to preserve RF-rack validity when a new node joins or an existing node is removed / decommissioned, check if the operation would violate the RF-rack-validity of some keyspace. if so - reject the operation in order to preserve RF-rack-validity. Fixes scylladb/scylladb#23345 Fixes scylladb/scylladb#26820	2025-12-22 09:14:30 +01:00
Michael Litvak	aa8db3b8da	topology coordinator: add validation to node remove add validation to node remove / decommission, similar to node validation when a node joins. when starting node remove or decommission, the validation function checks if the operation is valid and can proceed. if not, it's aborted with an error message. we change the return type of validate_joining_node so that it will be similar and consistent with the new validate_removing_node.	2025-12-22 09:14:29 +01:00
Michael Litvak	9e1f78d162	locator: extend rf-rack validation functions Extend the locator function assert_rf_rack_valid_keyspace to accept arbitrary topology dc-rack maps and nodes instead of using the current token metadata. This allows us to add a new variant of the function that checks rf-rack validity given a topology change that we want to apply. we will use it to check that rf-rack validity will be maintained before applying the topology change. The possible topology changes for the check are node add and node remove / decommission. These operations can change the number of normal racks - if a new node is added to a new rack, or the last node is removed from a rack.	2025-12-22 09:14:29 +01:00
Michael Litvak	8df61f6d99	view: change validate_view_keyspace to allow MVs if RF=Racks The function validate_view_keyspace checks if a keyspace is eligible for having materialized views, and it is used for validation when creating a MV or a MV-based index. Previously, it was required that the rf_rack_valid_keyspaces option is set in order for tablets-based keyspaces to be considered eligible, and the RF-rack condition was enforced when the option is set. Instead of this, we change the validation to allow MVs in a keyspace if the RF-rack condition is satisfied for the keyspace - regardless of the config option. We remove the config validation for views on startup that validates the option `rf_rack_valid_keyspaces` is set if there are any views with tablets, since this is not required anymore. We can do this without worrying about upgrades because this change will be effective from 2025.4 where MVs with tablets are first out of experimental phase. We update the test for MV and index restrictions in tablets keyspaces according to the new requirements. * Create MV/index: previously the test checked that it's allowed only if the config option `rf_rack_valid_keyspaces` is set. This is changed now so it's always allowed to create MV/index if the keyspace is RF-rack-valid. Update the test to verify that we can create MV/index when the keyspace is RF-rack-valid, even if the rf_rack option is not set, and verify that it fails when the keyspace is RF-rack-invalid. * Alter: Add a new test to verify that while a keyspace has views, it can't be altered to become RF-rack-invalid.	2025-12-22 09:14:29 +01:00
Michael Litvak	de1bb84fca	db: enforce rf-rack-validity for keyspaces with views Extend the RF-rack-validity enforcement to keyspaces that have views, regardless of the option `rf_rack_valid_keyspaces`. Previously, RF-rack-validity was enforced when `rf_rack_valid_keyspaces` was set for all keyspaces. Now we want to allow creating MVs in tablet keyspaces that are RF-rack-valid and enforce the RF-rack-validity even if the config option is not set.	2025-12-22 09:13:49 +01:00
Michael Litvak	75b269229d	replica/db: add enforce_rf_rack_validity_for_keyspace helper Add the helper function enforce_rf_rack_validity_for_keyspace that returns true if RF-rack-validity should be enforced for a keyspace, and use it wherever we need to check this instead of checking the config option directly. This is useful because this condition is used in multiple places, and having it defined in a single helper function will make it easier to see and change the enforcement conditions.	2025-12-22 09:13:49 +01:00
Michael Litvak	8b0b0c4d80	db: remove enforce parameter from check_rf_rack_validity simple refactoring: the enforce parameter is always given the value of the `rf_rack_valid_keyspaces` option. remove the parameter and use the option value directly from the db config. this will be useful for a later change to the enforcement conditions.	2025-12-22 09:13:49 +01:00
Michael Litvak	61ae653693	test: adjust test to not break rf-rack validity the test test_unfinished_writes_during_shutdown starts 3 nodes in 3 racks and creates a keyspace with RF=3, then adds a new node in a 4th rack. this breaks rf-rack validity for the keyspace. we change it instead to add the new node in an existing rack. it doesn't matter for the test - the test only wants to add a new node to trigger some topology change.	2025-12-22 09:13:48 +01:00
Karol Nowacki	addac8b3f7	vector_search: test: Fix flaky DNS resolution test The `vector_store_client_test_dns_resolving_repeated` test had race conditions causing it to be flaky. Two main issues were identified: 1. Race between initial refresh and manual trigger: The test assumes a specific resolution sequence, but timing variations between the initial DNS refresh (on client creation) and the first manual trigger (in the test loop) can cause unexpected delayed scheduling. 2. Extra triggers from resolve_hostname fiber: During the client refresh phase, the background DNS fiber clears the client list. If resolve_hostname executes in the window after clearing but before the update completes, pending triggers are processed, incrementing the resolution count unexpectedly. At count 6, the mock resolver returns a valid address (count % 3 == 0), causing the test to fail. The fix relaxes test assertions to verify retry behavior and client clearing on DNS address loss, rather than enforcing exact resolution counts. Fixes: #27074 Closes scylladb/scylladb#27685	2025-12-21 20:02:16 +02:00
Vlad Zolotarov	ea95cdaaec	transport/server: declare a new "CLIENT_OPTIONS" option as supported Declare support for a 'CLIENT_OPTIONS' startup key. This key is meant to be used by drivers for sending client-specific configurations like request timeouts values, retry policy configuration, etc. The value of this key can be any string in general (according to the CQL binary protocol), however, it's expected to be some structured format, e.g. JSON. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2025-12-20 12:26:22 -05:00
Vlad Zolotarov	28cbaef110	service/client_state and alternator/server: use cached values for driver_name and driver_version fields Optimize memory usage changing types of driver_name and driver_version be a reference to a cached value instead of an sstring. These fields very often have the same value among different connections hence it makes sense to cache these values and use references to them instead of duplicating such strings in each connection state. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2025-12-20 12:26:22 -05:00
Vlad Zolotarov	85adf6bdb1	system.clients: add a client_options column This new column is going to contain all OPTIONS sent in the STARTUP frame of the corresponding CQL session. The new column has a `frozen<map<text, text>>` type, and we are also optimizing the amount of required memory for storing corresponding keys and values by caching them on each shard level. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2025-12-20 12:26:15 -05:00
Vlad Zolotarov	3a54bab193	controller: update get_client_data to use foreign_ptr for client_data get_client_data() is used to assemble `client_data` objects from each connection on each CPU in the context of generation of the `system.clients` virtual table data. After collected, `client_data` objects were std::moved and arranged into a different structure to match the table's sorting requirements. This didn't allow having not-cross-shard-movable objects as fields in the `client_data`, e.g. lw_shared_ptr objects. Since we are planning to add such fields to `client_data` in following patches this patch is solving the limitation above by making get_client_data() return `foreign_ptr<std::unique_ptr<client_data>>` objects instead of naked `client_data` ones. Signed-off-by: Vlad Zolotarov <vladz@scylladb.com>	2025-12-19 11:01:41 -05:00
Anna Stuchlik	f65db4e8eb	doc: remove the links to the Download Center This commit removes the remaining links to the Download Center on the website. We no longer use it for installation, and we don't want users to infer that something like that still exists. Fixes https://github.com/scylladb/scylladb/issues/27753 Closes scylladb/scylladb#27756	2025-12-19 12:53:40 +01:00
Botond Dénes	df2ac0f257	Merge 'test: dtest: schema_management_test.py: migrate from dtest' from Dario Mirovic This PR migrates schema management tests from dtest to this repository. One reason is that there is an ongoing effort to migrate tests from dtest to here. Test `TestLargePartitionAlterSchema.test_large_partition_with_drop_column` failed with timeout error once. The main suspect so far are infra related problems, like infra congestion. The [logs from the test execution](https://jenkins.scylladb.com/job/scylla-master/job/dtest-release/1062/testReport/junit/schema_management_test/TestLargePartitionAlterSchema/Run_Dtest_Parallel_Cloud_Machines___Dtest___full_split001___test_large_partition_with_drop_column/), linked in the issue [test_large_partition_with_drop_column failed on TimeoutError #26932](https://github.com/scylladb/scylladb/issues/26932) show the following: - `populate` works as intended - it starts, then during populate/insert drop column happened, then an exception is raised and intentionally ignored in the test, so no `Finish populate DB` for 50 x 1490 records - expected - drop column works as intended - interrupts `populate` and proceeds to flush - flush probably works as intended - logs are consistent with what we expect and what I got in local test runs - `read` is the only thing that visibly got stuck, all the way until timeout happened, 5 minutes after the start Migrating the test to this repo will also give us test start and end times on CI machines, in the sql report database. It has start and end timestamp for each test executed. We will be able to see how long does it usually take when the test is successful. It can not be seen from the logs, because logs are not kept for successful tests. Another thing this PR does is adding a log message at the end of `database::flush_all_tables`. This will let us know if a thread got stuck inside or finished successfully. This addresses the probably part of the flush analysis step described above. If the issue reoccurs, we will have more information. The test `test_large_partition_with_add_column` has not been executing for ~5 years. It was never migrated to pytest. The name was left as `large_partition_with_add_column_test`, and was skipped. Now it is enabled and updated. Both `test_large_partition_with_add_column` and `test_large_partition_with_drop_column` are improved. Small performance improvements: - Regex compilation extracted from the stress function to the module level, to avoid recompilation. - Do not materialize list in `stress_object` for loop. Use a generator expression. The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column` and `test_large_partition_with_drop_column`. These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago. The scenario in which the problem could have happened has to involve: - a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition. - appending writes to the partition (not overwrites) - scans of the partition - schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped column lands in the middle (in alphabetical order) of the old column set. The way the test is set up is: - fixed number of writes per populate call - fixed number of reads This has the following implications: - if the machine executing the test is fast, all the writes are done before the 10 seconds sleep - there are too many reads - most of them get executed after the test logic is done This patch solves these issues in the following way: - populate lazily generates write data, and stops when instructed by `stop_populating` event - read, which is done sequentially, stops when instructed by `stop_reading` event - number of max operations is increased significantly, but the operations are stopped 1 second after node flush; this makes sure there are enough operations during the test, but also that the test does not take unnecessary time Test execution time has been reduced severalfold. On dev machine the time the tests take is reduced from 110 seconds to 34 seconds. scylla-dtest PR that removes migrated tests: [schema_management_test.py: remove tests already ported to scylladb repo #6427](https://github.com/scylladb/scylla-dtest/pull/6427) Fixes #26932 This is a migration of existing tests to this repository. No need for backport. Closes scylladb/scylladb#27106 * github.com:scylladb/scylladb: test: dtest: schema_management_test.py: speed up `TestLargePartitionAlterSchema` tests test: dtest: schema_management_test.py: fix large partition add column test test: dtest: schema_management_test.py: add `TestSchemaManagement.prepare` test: dtest: schema_management_test.py: test enhancements test: dtest: schema_management_test.py: make the tests work test: dtest: migrate setup and tools from dtest test: dtest: copy unmodified schema_management_test.py replica: database: flush_all_tables log on completion	2025-12-19 12:30:00 +02:00
Botond Dénes	093e97a539	Merge 'test: increase num of requests in driver_service_level tests' from Andrzej Jackowski `_verify_tasks_processed_metrics()` is used to check that the correct service level is used to process requests. It takes two service levels as arguments and executes numerous requests. After that, the number of tasks processed by one of the service levels is expected to rise by at least the number of executed requests. In contrast, the second service level is expected to process fewer tasks than the number of requests. Unfortunately, background noise may cause some tasks to be executed on the service level that is not supposed to process requests. This patch increases the number of executed requests to eliminate the chance of noise causing test failures. Additionally, this commit extends logging to make future investigation easier. Fixes: https://github.com/scylladb/scylladb/issues/27715 No backport, fix for test on master. Closes scylladb/scylladb#27735 * github.com:scylladb/scylladb: test: remove unused `get_processed_tasks_for_group` test: increase num of requests in driver_service_level tests	2025-12-19 10:54:14 +02:00
Emil Maskovsky	fa6e5d0754	test/random_failures: fix handling of banned notification After `39cec4a` node join may fail with either "init - Startup failed" notification or occasionally because it was banned, depending on timing. The change updates the test to handle both cases. Fixes: scylladb/scylladb#27697 No backport: This failure is only present in master. Closes scylladb/scylladb#27768	2025-12-19 09:55:31 +02:00
Emil Maskovsky	08518b2c12	test/raft: fix `test_joining_old_node_fails` flakiness When a node without the required feature attempts to join a Raft-based cluster with the feature enabled, there is a race between the join rejection response ("Feature check failed") and the ban notification ("received notification of being banned"). Depending on timing, either message may appear in the joining node's log. This starts to happen after `39cec4a` (which introduced informing the nodes about being banned). Updated the test to accept both error messages as valid, making the test robust against this race condition, which is more likely in debug mode or under slow execution. Fixes: scylladb/scylladb#27603 No backport: This failure is only present in master. Closes scylladb/scylladb#27760	2025-12-19 09:44:09 +02:00
Emil Maskovsky	2a75b1374e	test/raft: fix race condition in failure_detector_test The test had a sporadic failure due to a broken promise exception. The issue was in `test_pinger::ping()` which captured the promise by move into the subscription lambda, causing the promise to be destroyed when the lambda was destroyed during coroutine unwinding. Simplify `test_pinger::ping()` by replacing manual abort_source/promise logic with `seastar::sleep_abortable()`. This removes the risk of promise lifetime/race issues and makes the code simpler and more robust. Fixes: scylladb/scylladb#27136 Backport to active branches: This fixes a CI test issue, so it is beneficial to backport the fix. As this is a test-only fix, it is a low risk change. Closes scylladb/scylladb#27737	2025-12-19 09:42:19 +02:00
Łukasz Paszkowski	2cb9bb8f3a	test_user_writes_rejection: Disable speculative retries This test starts a 3-node cluster and creates a large blob file so that one node reaches critical disk utilization, triggering write rejections on that node. The test then writes data with CL=QUORUM and validates that the data: - did not reach the critically utilized node - did reach the remaining two nodes By default, tables use speculative retries to determine when coordinators may query additional replicas. Since the validation uses CL=ONE, it is possible that an additional request is sent to satisfy the consistency level. As a result: - the first check may fail if the additional request is sent to a node that already contains data, making it appear as if data reached the critically utilized node - the second check may fail if the additional request is sent to the critically utilized node, making it appear as if data did not reach the healthy node The patch fixes the flakiness by disabling the speculative retries. Fixes https://github.com/scylladb/scylladb/issues/27212 Closes scylladb/scylladb#27488	2025-12-19 09:39:09 +02:00
Botond Dénes	c899c117c7	scylla-gdb.py: scylla read-stats: include all permit lists The current code which collects permit stats is out-of-date (by a few years), as it only iterates through _permit_list. There are 4 additional lists that permits can be part of now (all intrusive). Include all of these in the stat collection. As a bonus, also print the semaphore pointer in the printout, so the user can hand-examine it, should they wish to.	2025-12-18 18:26:07 +02:00
Botond Dénes	db9f9e1b7e	scylla-gdb.py: scylla fiber: add --direction command-line param Can be "forward", "backward" or "both" (default). Allows traversing the fiber in just one direction. Useful when scylla fiber fails to traverse through a task and the user has to locate the next one in the chain manually. When resuming from this next item, the user might want to skip the already seen part of the fiber, to save time on the invokation.	2025-12-18 18:26:07 +02:00
Botond Dénes	4c6e508811	scylla-gdb.py: scylla fiber: add support for traversing through coroutines backward Traversing through coroutines forward (finding task waiting on this coroutine) is already supported. This patch adds support for traversing through coroutines backwards (finding task waited on by coroutine). Coroutines need special handling: the future<> object is most likely allocated on the coroutine frame, so we have to search throgh that to find it. When doing so the first two pointers on the frame have to be skipped: these are pointers to .resume and .destroy respectively and will halt the search algorithm if seen.	2025-12-18 18:26:06 +02:00
Dario Mirovic	f1d63d014c	test: dtest: schema_management_test.py: speed up `TestLargePartitionAlterSchema` tests The tests in `TestLargePartitionAlterSchema` are `test_large_partition_with_add_column` and `test_large_partition_with_drop_column`. These tests need to replicate the following conditions that led to a bug before a fix from around 5 years ago. The scenario in which the problem could have happened has to involve: - a large partition with many rows, large enough for preemption (every 0.5ms) to happen during the scan of the partition. - appending writes to the partition (not overwrites) - scans of the partition - schema alter of that table. The issue is exposed only by adding or dropping a column, such that the added/dropped column lands in the middle (in alphabetical order) of the old column set. The way the test is set up is: - fixed number of writes per populate call - fixed number of reads This has the following implications: - if the machine executing the test is fast, all the writes are done before the 10 seconds sleep - there are too many reads - most of them get executed after the test logic is done This patch solves these issues in the following way: - populate lazily generates write data, and stops when instructed by `stop_populating` event - read, which is done sequentially, stops when instructed by `stop_reading` event - number of max operations is increased significantly, but the operations are stopped 1 second after node flush; this makes sure there are enough operations during the test, but also that the test does not take unnecessary time Test execution time has been reduced severalfold. On dev machine the time the tests take is reduced from 110 seconds to 34 seconds. The patch also introduces a few small improvements: - `cs_run` renamed to `run_stress` for clarity - Stopped checking if cluster is `ScyllaCluster`, since it is the only one we use - `case_map` removed from `test_alter_table_in_parallel_to_read_and_write`, used `mixed` param directly - Added explanation comment on why we do `data[i].append(None)` - Replaced `alter_table` inner function with its body, for simplicity - Removed unnecessary `ck_rows` variable in `populate` - Removed unnecessary `isinstance(self.cluster. ScyllaCluster)` - Adjusted `ThreadPoolExecutor` size in several places where 5 workers are not needed - Replaced functional programming style expressions for `new_versions` and `columns_list` with comprehension/generator statement python style code, improving readability Refs #26932 fix	2025-12-18 17:07:27 +01:00
Michael Litvak	33f7bc28da	docs: document restrictions of colocated tables Currently some things are not supported for colocated tables: it's not possible to repair a colocated table, and due to this it's also not possible to use the tombstone_gc=repair mode on a colocated table. Extend the documentation to explain what colocated tables are and document these restrictions. Fixes scylladb/scylladb#27261 Closes scylladb/scylladb#27516	2025-12-18 15:38:29 +01:00
Cezar Moise	0ef8ca4c57	test: add logging to crash_coordinator_before_stream injection In order to have the test ignore crashes caused by the injection, it needs to log its occurence.	2025-12-18 16:28:13 +02:00
Cezar Moise	95d0782f89	test: add crash detection during tests After tests end, an extra check if performed, looking into node logs. By default, it only searches for critical errors and scans for coredumps. If the test has the fixture `check_nodes_for_errors`, it will search for all errors. Both checks can be ignored by setting `ignore_cores_log_patterns` and `ignore_log_patterns`. If any of the above are found, the test will fail with an error.	2025-12-18 16:28:13 +02:00
Dario Mirovic	f831ca5ab5	test: dtest: schema_management_test.py: fix large partition add column test `large_partition_with_add_column_test` and `large_partition_with_drop_column_test` were added on August 17th, 2020 in scylladb/scylla-dtest#1569. Only `large_partition_with_drop_column_test` was migrated to pytest, and renamed to `test_large_partition_with_drop_column` on March 31st, 2021 in scylladb/scylla-dtest#2051. Since then this test has not been running. This patch fixes it - the test is updated and renamed and the testing environment now properly picks it up. Refs #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	1fe0509a9b	test: dtest: schema_management_test.py: add `TestSchemaManagement.prepare` Extract repeated cluster initialization code in `TestSchemaManagement` into a separate `prepare` method. It holds all the common code for cluster preparation, with just the necessary parameters. Refs #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	e7d76fd8f3	test: dtest: schema_management_test.py: test enhancements Extract regex compilation from the stress functions to the module level, to avoid unnecessary regex compilation repetition. Add descriptions to the stress functions. Do not materialize list in `stress_object` for loop. Use a generator expression. Make `_set_stress_val` an object method. Refs #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	700853740d	test: dtest: schema_management_test.py: make the tests work Remove unused function markers. Add wait_other_notice=True to cluster start method in TestSchemaHistory.prepare function to make the test stable. Enable the test in suite.yaml for dev and debug modes. Fixes #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	3c5dd5e5ae	test: dtest: migrate setup and tools from dtest Migrate several functionalities from dtest. These will be used by the schema_management_test.py tests when they are enabled. Refs #26932	2025-12-18 12:54:43 +01:00
Dario Mirovic	5971b2ad97	test: dtest: copy unmodified schema_management_test.py Copy schema_management_test.py from scylla-dtest to test/cluster/dtest/schema_management_test.py. Add license header. Disable it for debug, dev, and release mode. Refs #26932	2025-12-18 12:54:42 +01:00
Dario Mirovic	f89315d02f	replica: database: flush_all_tables log on completion In database::flush_all_tables add log on completion. This slightly improves the readability of logs when debugging an issue. Refs #26932	2025-12-18 12:54:42 +01:00
Patryk Jędrzejczak	d5c205194b	Merge 'topology: Make removenode use left_token_ring state for global barrier' from Emil Maskovsky Make the removenode operation go through the `left_token_ring` state, similar to decommission. This ensures that when removenode completes, all nodes in the cluster are aware of the topology change through a global token metadata barrier. Previously, removenode would skip the `left_token_ring` state and go directly from `write_both_read_new` to `left` state. This meant that when the operation completed, some nodes might not yet know about the topology change, potentially causing issues with subsequent data plane requests. Key changes: - Both decommission and removenode now transition to `left_token_ring` state in the `write_both_read_new` handler - In `left_token_ring` state, only decommissioning nodes receive the shutdown RPC (removed nodes are already dead) - Updated documentation to reflect that both operations use this state This change improves consistency guarantees for removenode operations by ensuring cluster-wide awareness before completion. The change is protected by "REMOVENODE_WITH_LEFT_TOKEN_RING" feature flag to also support mixed clusters during e.g. upgrade. Fixes: scylladb/scylladb#25530 No backport: This fixes and issue found in tests. It can theoretically happen in production too, but wasn't reported in any customer issue, so a backport is not needed. Closes scylladb/scylladb#26931 * https://github.com/scylladb/scylladb: topology: make removenode use left_token_ring state for global barrier topology: allow removing nodes not having tokens features: add feature flag for removenode via left token ring	2025-12-18 09:34:38 +01:00
Andrzej Jackowski	6ad10b141a	test: remove unused `get_processed_tasks_for_group` The function `get_processed_tasks_for_group` was defined twice in `test_raft_service_levels.py`. This change removes the unused definition to avoid confusion and clean up the code.	2025-12-17 20:45:53 +01:00
Andrzej Jackowski	8cf8e6c87d	test: increase num of requests in driver_service_level tests `_verify_tasks_processed_metrics()` is used to check that the correct service level is used to process requests. It takes two service levels as arguments and executes numerous requests. After that, the number of tasks processed by one of the service levels is expected to rise by at least the number of executed requests. In contrast, the second service level is expected to process fewer tasks than the number of requests. Unfortunately, background noise may cause some tasks to be executed on the service level that is not supposed to process requests. This patch increases the number of executed requests to eliminate the chance of noise causing test failures. Additionally, this commit extends logging to make future investigation easier. Fixes: scylladb/scylladb#27715	2025-12-17 20:45:48 +01:00
Michael Litvak	3a06c32749	schema_registry: fix learning a schema with cdc schema When learning a schema that has a linked cdc schema, we need to learn also the cdc schema, and at the end the schema should point to the learned cdc schema. This is needed because the linked cdc schema is used for generating cdc mutations, and when we process the mutations later it is assumed in some places that the mutation's schema has a schema registry entry. We fix a scenario where we could end up with a schema that points to a cdc schema that doesn't have a schema registry entry. This could happen for example if the schema is loaded before it is learned, so when we learn it we see that it already has an entry. In that case, we need to set the cdc schema to the learned cdc schema as well, because it could have been loaded previously with a cdc schema that was not learned. Fixes scylladb/scylladb#27610 Closes scylladb/scylladb#27704	2025-12-17 20:01:00 +02:00
Michał Jadwiszczak	74ab5addd3	test/cluster/test_view_building_coordinator: fix flakiness in test_file_streaming The test generates a staging sstable on a node and verifies whether the view is correctly populated. However view updates generated by a staging sstable (`view_update_generator::generate_and_propagate_view_updates()`) aren't awaited by sstable consumer. It's possible that the view building coordinator may see the task as finished (so the staging sstable was processed) but not all view updates were writted yet. This patch fixes the flakiness by waiting until `scylla_database_view_update_backlog` drops down to 0 on all shards. Fixes scylladb/scylladb#26683 Closes scylladb/scylladb#27389	2025-12-17 17:29:15 +01:00
Michael Litvak	55f4a2b754	migration_listener: fix deadlock in nested notifications When calling a migration notification from the context of a notification callback, this could lead to a deadlock with unregistering a listener: A: the parent notification is called. it calls thread_for_each, where it acquires a read lock on the vector of listeners, and calls the callback function for each listener while holding the lock. B: a listener is unregistered. it calls `remove` and tries to acquire a write lock on the vector of listeners. it waits because the lock is held. A: the callback function calls another notification and calls thread_for_each which tries to acquire the read lock again. but it waits since there is a waiter. Currently we have such concrete scenario when creating a table, where the callback of `before_create_column_family` in the tablet allocator calls `before_allocate_tablet_map`, and this could deadlock with node shutdown where we unregister listeners. Fix this by not acquiring the read lock again in the nested notification. There is no need because the read lock is already held by the parent notification while the child notification is running. We add a function `thread_for_each_nested` that is similar to `thread_for_each` except it assumes the read lock is already held and doesn't acquire it, and it should be used for nested notifications instead of `thread_for_each`. Fixes scylladb/scylladb#27364 Closes scylladb/scylladb#27637	2025-12-17 14:00:28 +01:00
Emil Maskovsky	1642c686c2	topology: make removenode use left_token_ring state for global barrier Make the removenode operation go through the `left_token_ring` state, similar to decommission. This ensures that when removenode completes, all nodes in the cluster are aware of the topology change through a global token metadata barrier. Previously, removenode would skip the `left_token_ring` state and go directly from `write_both_read_new` to `left` state. This meant that when the operation completed, some nodes might not yet know about the topology change, potentially causing issues with subsequent data plane requests. Key changes: - Both decommission and removenode now transition to `left_token_ring` state in the `write_both_read_new` handler - In `left_token_ring` state, only decommissioning nodes receive the shutdown RPC (removed nodes may already be dead) - Updated documentation to reflect that both operations use this state This change improves consistency guarantees for removenode operations by ensuring cluster-wide awareness before completion. Fixes: scylladb/scylladb#25530	2025-12-17 13:31:11 +01:00
Emil Maskovsky	9431826c52	topology: allow removing nodes not having tokens For the changes to go through the left_token_ring state when REMOVENODE_WITH_LEFT_TOKEN_RING feature is enabled, we need to allow removing nodes to not have any tokens (similarly to decommissioning nodes, which use the same sequence of states). This means the tests also need to change to allow for this new behavior - it can temporarily happen that a removing node has no tokens but is still part of Raft group 0 (so there may be a temporary mismatch between the token ring and group 0 membership). Therefore, the `check_token_ring_and_group0_consistency` function is replaced by `wait_for_token_ring_and_group0_consistency`, which waits up to 30 seconds for consistency to be reached.	2025-12-17 13:31:11 +01:00
Emil Maskovsky	ba6fabfc88	features: add feature flag for removenode via left token ring To improve the behavior of the removenode operation, we want to issue a global topology barrier after the removenode has been applied. However, this requires changing the topology state machine to add a new state (left_token_ring) to the removenode flow, which is not supported by older nodes. To allow rolling upgrades, we add a feature flag REMOVENODE_WITH_LEFT_TOKEN_RING that controls whether the new removenode flow is used.	2025-12-17 13:31:11 +01:00
Avi Kivity	cfd91545f9	test: add proxy protocol tests Test that the new configuration options work and that we can connect to them. Use direct connections with an inline implementation of the proxy protocol and the CQL native protocol, since we want to maintain direct control over the source port number (for shard-aware ports). Also test we land on the expected shard.	2025-12-17 14:18:04 +02:00
Avi Kivity	1382b47d45	config, transport: support proxy protocol v2 enhanced connections We have four native transport ports: two for plain/TLS, and two more for shard-aware (plain/TLS as well). Add four more that expect the proxy protocol v2 header. This allows nodes behind a reverse proxy to record the correct source address and port in system.clients, and the shard-aware port to see the correct source port selection made my the client.	2025-12-17 14:18:04 +02:00
Pavel Emelyanov	a6618f225c	object_storage_endpoint_param: Make it formattable for real Currently the formatter converts it to json and then tries to emit into the output context with the "...{{}}" format string. The intent was to have the "...{<json text>}" output. However, the double curly brace in format string means "print a curly brace", so the output of the above formatting is "...{}", literally. Fix by keeping a single curly brace. The "<json text>" thing will have its own surrounding curly braces. Fixes #27718 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27687	2025-12-17 11:48:39 +01:00
Dani Tweig	0bfd07a268	github: synching scylladb.git new milestones with Jira releases This action is triggered when a new milestone is created in scylladb.git It will call the main logic which will create the same milestone as Jira releases in the SCYLLADB and CUSTOMER Jira projects. Fixes: PM-100 Closes scylladb/scylladb#27717	2025-12-17 10:05:06 +01:00
Tomasz Grabiec	c077283352	Merge 'service: support conversion of tablet keyspaces to rack-list using ALTER KEYSPACE' from Aleksandra Martyniuk If a keyspace has a numeric replication factor in a DC and rf < #racks, then the replicas of tablets in this keyspace can be distributed among all racks in the DC (different for each tablet). With rack list, we need all tablet replicas to be placed on the same racks. Hence, the conversion requires tablet co-location. After this series, the conversion can be done using ALTER KEYSPACE statement. The statement that does this conversion in any DC is not allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks each and a keyspace ks then with a single ALTER KEYSPACE we can do: - {dc1 : 2} -> {dc1 : [r1, r2]}; - {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: [r2,r3]}; - {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2} - {dc1 : 2} -> {dc1 : 2, dc2 : [r1]} But we cannot do: - {dc1 : 2} -> {dc1 : [r1, r2, r3]}; - {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1]. In order to do the co-locations rf change request is paused. Tablet load balancer examines the paused rf change requests and schedules necessary tablet migrations. During the process of co-location, no other cross-rack migration is allowed. Load balancer checks whether any paused rf change request is ready to be resumed. If so, it puts the request back to global topology request queue. While an rf change request for a keyspace is running, any other rf change of this keyspace will fail. Fixes: #26398. New feature, no backport Closes scylladb/scylladb#27279 * github.com:scylladb/scylladb: test: add est_rack_list_conversion_with_two_replicas_in_rack test: test creating tablet_rack_list_colocation_plan test: add test_numeric_rf_to_rack_list_conversion test tasks: service: add global_topology_request_virtual_task cql3: statements: allow altering from numeric rf to rack list service: topology_coordinator: pause keyspace_rf_change request service: implement make_rack_list_colocation_plan service: add tablet_rack_list_colocation_plan cql3: reject concurrent alter of the same keyspace test: check paused rf change requests persistence db: service: add paused_rf_change_requests to system.topology service: pass topology and system_keyspace to load_balancer ctor service: tablet_allocator: extract load updates service: tablet_allocator: extract ensure_node tasks, system_keyspace: Introduce get_topology_request_entry_opt() node_ops: Drop get_pending_ids() node_ops: Drop redundant get_status_helper()	2025-12-17 10:05:06 +01:00
Anna Stuchlik	7061384a27	doc: replace "Scylla" with "ScyllaDB" in the Alternator docs Fixes https://github.com/scylladb/scylladb/issues/27708 Closes scylladb/scylladb#27709	2025-12-17 10:05:06 +01:00
Tomasz Grabiec	7bc59e93b2	Fix lambda-coroutine fiasco in hint_endpoint_manager.cc Found by copilot. No issue was observed yet. Fixes #27520 Closes scylladb/scylladb#27477	2025-12-16 20:16:41 +03:00
Łukasz Paszkowski	a61c221902	test/pylib/suite/python.py: Handle extra_cmdline_options correctly runner.py defines a command-line option `--extra-scylla-cmdline-options` with the default type=str. However, the function `merge_cmdline_options`, which consumes this value to merge command-line options from multiple sources, expects a list of strings. This mismatch results in the following exception: ``` raise ValueError(f'invalid argument name {name}, all args {args}') ValueError: invalid argument name o, all args --logger-log-level repair=debug --default-log-level=error ``` when a test is run with pytest using: `--extra-scylla-cmdline-options='--logger-log-level repair=debug --default-log-level=error'` Fix this by handling the option consistently and calling `.split()`. Also change the default value from an empty list to an empty string to avoid confusion both in runner.py and test.py. Closes scylladb/scylladb#27523	2025-12-16 20:14:43 +03:00
Benny Halevy	798714183e	table: get_snapshot_details: add FIXME comments Ref https://github.com/scylladb/seastar/pull/3163 We can optimize the stat calls we use here by using open_directory to open the snapshot, base, and staging directory once, and using statat calls for the relative name instead of the full blown file_stat that needs to traverse the whole path prefix for every call (the dirents are likely to be cached, but still why waste cpu cycles on that over and over again). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 18:45:56 +02:00
Benny Halevy	f5ca3657e2	table: get_snapshot_details: lookup entries also in the staging directory Since the sstables in the snapshot may still be in the staging dir. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 18:42:05 +02:00
Benny Halevy	dc00461adf	table: get_snapshot_details: optimize using the entry number_of_links If the number_of_linkes equals 1, we can be sure that the file exists only in the snapshot directory so there is no need to look it up in the data directory. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 18:42:05 +02:00
Benny Halevy	be6d87648c	table: get_snapshot_details: continue loop for manifest and schema entries Now that we're using a simple loop in the coroutine just continue the loop for files we want to ignore. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 18:42:05 +02:00
Benny Halevy	004c08f525	table: get_snapshot_details: use directory_lister It is more efficient to use the coroutine generator to list the directory. Brewing changes in seastar would make the generator buffered as well as adding an extended generation that would return the file stat data for each entry, that would become useful in the next patch that optimizes the algorithm by considering the entry's link count. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 18:42:05 +02:00
David Garcia	386ec0af4e	docs: fix multiversion build Installs sphinx-scylladb-markdown = "^0.1.4" to fix the multiversion build. Details in https://github.com/scylladb/sphinx-scylladb-theme/pull/1552 fix: update poetry Closes scylladb/scylladb#27619	2025-12-16 19:41:03 +03:00
Pavel Emelyanov	c4496dd63c	Merge 'test/cqlpy: rename tests with duplicate name' from Nadav Har'El When translating Cassandra's unit tests, in a couple of places I accidentally used the same name for two tests, resulting in the first of each pair to never running. Let's fix the name of the second of the each pair to be the real name it had in the original Cassandra test. Closes scylladb/scylladb#27644 * github.com:scylladb/scylladb: test/cqlpy: rename test with duplicate name test/cqlpy: rename test with duplicate name	2025-12-16 19:32:20 +03:00
Nadav Har'El	84df5cfaf8	test/alternator: delete unnecessary "pass" Fixing something that never bothered anyone but our automated "code quality" tool: there's an unnecessary call to "pass" in one of our tests. Just remove it. Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27645	2025-12-16 19:29:23 +03:00
Botond Dénes	f06db096bd	test/boost/reader_concurrency_semaphore_test: un-flake memory limit engages test This test was observed to fail multiple times recently in promotion, because there were successful reads. The failure only reproduces on arm64, it doesn't reproduce on x86. The suspected reason is that the data set is too close to the edge, where all reads fail due to too high memory consumption. Reduce the number of sstables used by this test to 54 (from 64). Fixes: #27248 Closes scylladb/scylladb#27650	2025-12-16 19:24:49 +03:00
Pavel Emelyanov	31f90c089c	Merge 'test/alternator: remove unused variable assignments and statements' from Nadav Har'El Copilot found in test/alternator a bunch of places where we unnecessarily assign a variable that we don't use, or had a duplicated statement which doesn't do anything. This patch fixes all of them. AI still doesn't know how to prepare a patch that looks anything close to reasonable, so I did this part manually, and also carefully investigated each and every change (this took a lot of human time). These patches don't change anything in the functionality of any of the tests. It's all cosmetic. Closes scylladb/scylladb#27655 * github.com:scylladb/scylladb: test/alternator: remove unnecessary duplicate statement test/alternator: remove unused variable assignments	2025-12-16 19:23:34 +03:00
Nadav Har'El	c58739de6a	Fix for Unused local variable Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27665	2025-12-16 19:20:53 +03:00
Benny Halevy	9e18cfbe17	sstable: add _mutate_sem to serialize link/move with components rewrite We currently have races, like between moving an sstable from staging using change_state, or when taking a snapshot, to e.g. rewrite_statistics that replaces one of the sstable component files when called, for example, from update_repaired_at by incremental repair. Use a semaphore as a mutex to serialize those functions. Note that there is no need for rwlock since the operations are rare and read-only operations like snapshot don't need to run in parallel. Fixes #25919 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 17:06:45 +02:00
Piotr Dulikowski	7900aa5319	Merge 'server: fix scheduling group update timing in system.clients' from Alex Dathskovsky Previously, the scheduling_group column was updated during the switch_tenant function, which meant the update occurred only after the tenant change operation completed—updating rows one by one. With this change, the scheduling_group column is now updated before the switch_tenant logic runs, ensuring that the table reflects the correct scheduling groups for all rows as early as possible. fixes: #26060 fixes: #27295 backport: not required this is a minor bug fix. Internal logic worked but the user couldnt see the change in the table if they would read the system.clients table Closes scylladb/scylladb#26404 * github.com:scylladb/scylladb: test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft. The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode. Therefore, the test was moved from cqlpy to the cluster-based framework. This commit both adds the test in cluster/ and removes the old version in cqlpy/. server: Refactor update_control_connection_scheduling_group functionality This refactoring moves the logic that retrieves the scheduling group for driver_service_level_name out of switch_tenant. This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP). The lookup function is fully synchronized, non-coroutine, and returns immediately. For that reason, it’s better to perform this lookup outside of the switch_tenant function. server: Refactor scheduling group update functionality. This change generalizes the scheduling-group update functionality and removes some copy-paste code, improving overall readability and maintainability. To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”). server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group. Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case. This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage. server: Add update_service_level_scheduling_group_v1 functions to create placehholder for functionality that will introduce v2 implementation. The new functionality will allow usage of service level cache	2025-12-16 15:39:49 +01:00
Aleksandra Martyniuk	9d20f0a3d2	test: add est_rack_list_conversion_with_two_replicas_in_rack	2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk	0476e8d272	test: test creating tablet_rack_list_colocation_plan	2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk	e48789cf6c	test: add test_numeric_rf_to_rack_list_conversion test	2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk	9039dfa4a5	tasks: service: add global_topology_request_virtual_task Add a service::topo::global_topology_request_virtual_task, which covers the replication factor changes. Currently, the global_topology_request_virtual_task can be aborted only if it is paused. The progress of the rf change isn't counted.	2025-12-16 13:31:22 +01:00
Aleksandra Martyniuk	1884e655d6	cql3: statements: allow altering from numeric rf to rack list Allow altering from numeric replication factor to rack list. Ensure that a single ALTER KEYSPACE statement doesn't try to both convert to rack list and change rf.	2025-12-16 13:29:08 +01:00
Aleksandra Martyniuk	640c491388	service: topology_coordinator: pause keyspace_rf_change request To do the conversion from numeric rf to rack list, we need to co-locate tablets of the keyspace, so that all of them have replicas on the same racks. Pause the keyspace_rf_change global topology request, so that the co-location could be done before the ALTER KEYSPACE changes are applied. The pause is needed if in any dc rf changes from numeric to rack list and the co-location is necessary. In this case we don't finish the request. Instead, we add the request to the paused request vector. No migrations are started.	2025-12-16 13:29:08 +01:00
Aleksandra Martyniuk	cd83d1d4dc	service: implement make_rack_list_colocation_plan The make_rack_list_colocation_plan consists of two phases. In the first phase (realized with find_required_rack_list_colocations), we find the pairs of (replica to be co-located, destination dc and rack). We skip the pairs related to the tablets that are in transition or for which the load balancer migration is already planned. We group the pairs by destination dc and rack. Thanks to that in the second phase we can calculate the least loaded nodes and shards only once for each rack. In the second phase, we calculate the load of the nodes in a cluster based on current transition and previously scheduled migrations. We utilize the map created in the first phase and choose the least loaded targets in each rack. We skip the tablets for which the co-location was already scheduled. find_required_rack_list_colocations isn't a method of load_balancer, because in the following changes it is going to be reused by topology coordinator to determine whether the rf change should be paused.	2025-12-16 13:29:05 +01:00
Aleksandra Martyniuk	bbe0b01b14	service: add tablet_rack_list_colocation_plan Add tablet_rack_list_colocation_plan. Keep it in migration_plan. The plan includes a request that is ready to resume. There can be more than one such request at the time, but we consider them one by one for clarity of code. Rack list co-locations will be kept together with normal load balancer migrations. Consider normal load balancer migrations before rack list co-locations. During rack list co-location, allow load balancer migrations to happen only within a single rack. Do not create the merge co-location plan if there is ongoing rack list co-location (if there are any rf changes paused). Generate rf change resume based on the plan. Add _request_to_resume back to global requests queue. make_rack_list_colocation_plan will be implemented in the following change.	2025-12-16 13:27:50 +01:00
Aleksandra Martyniuk	2e7ba1f8ce	cql3: reject concurrent alter of the same keyspace Reject ALTER KEYSPACE request if there is unfinished (queued, pending, or paused) alter request of the same keyspace. This is required as in the following changes, global request queue will contain rf change requests meant to be resumed.	2025-12-16 13:27:48 +01:00
Aleksandra Martyniuk	b3a0e4c2dc	test: check paused rf change requests persistence	2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk	08e5f35527	db: service: add paused_rf_change_requests to system.topology In the following changes, we allow to alter from numeric rf to rack list. Before the alter, two tablets of the same keyspace can have replicas on different racks. To switch to rack list, we need to co-locate the replicas. It will be achieved by pausing the keyspace_rf_change and scheduling migrations. We need to persist the ids of requests that are paused. A new column - paused_rf_change_requests is added to system.topology table. In this commit no data is kept in the new column.	2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk	d66a36058b	service: pass topology and system_keyspace to load_balancer ctor Pass a pointer to service::topology and db::system_keyspace to load balancer. It will be used in the following patches to create rack_list_colocation plan.	2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk	6681c0f33f	service: tablet_allocator: extract load updates Extract consider_scheduled_load and consider_planned_load so that they can be reused.	2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk	13e9ee3f6f	service: tablet_allocator: extract ensure_node Extract ensure_node method so that it can be reused.	2025-12-16 13:25:38 +01:00
Tomasz Grabiec	71e6ef90f4	tasks, system_keyspace: Introduce get_topology_request_entry_opt() It's a cleanup. Better to return std::nullopt than faking an entry with an id when require_entry == false.	2025-12-16 13:25:34 +01:00
Tomasz Grabiec	902803babd	node_ops: Drop get_pending_ids() Every pending request should also have an entry in system.topology_requests so it's redundant. And problematic, because we cannot build a full request entry from just an id alone, so if we would return those requests, they would have blank information, and logic which needs more information would not work.	2025-12-16 13:25:11 +01:00
Dawid Mędrek	df0830044d	cql3: Extend DESC INDEX by view properties We're extending the logic of DESCRIBE INDEX to include properties of the underlying materialized view. Tests are provided to ensure the implementation works as intended.	2025-12-16 11:43:38 +01:00
Dawid Mędrek	d0a42852c0	cql3: Forbid using CLUSTERING ORDER BY when creating index This is a temporary solution as handling this property may require a bit more attention or at least a bit more focus. For now, let's forbid using it so it's clear it won't get applied. A simple test is provided to cover it. We document the restriction.	2025-12-16 11:43:38 +01:00
Dawid Mędrek	6541362a0b	cql3: Extend CREATE INDEX by MV properties After the previous patch that extended the grammar and provided basic functionalities to accommodate properties of materialized views in indexes, this commit takes another step and actually applies them to the underlying view when it's being created. We're providing validation tests for each property, with the single exception of CLUSTERING ORDER BY. That one will be handled separately in an upcoming commit. We also update the user documentation.	2025-12-16 11:43:38 +01:00
Dawid Mędrek	e1f1071bc5	cql3/statements/create_index_statement: Allow for view options We're allowing CREATE INDEX to accept the same set of properties as materialized views do. Our goal is to give the user an ability to configure the underlying materialized view of an index directly, when creating it. This commit doesn't do anything except for extending the grammar and passing the right pieces of information to the right destinations. There's no validation and the options have no effect yet. That will be done in the following patch.	2025-12-16 11:43:37 +01:00
Dawid Mędrek	2cc01eeffb	cql3/statements/create_index_statement: Rename member	2025-12-16 11:43:37 +01:00
Dawid Mędrek	b3099e2f2c	cql3/statements/index_prop_defs: Re-introduce index_prop_defs The type represents a mix of both index-specific and view properties. Since we cannot easily distinguish which properties belong to which entity, let's use this abstraction and filter them from the C++ level. This is a prerequisite for extending the capabilities of CREATE INDEX by allowing it to configuring the underlying materialized view.	2025-12-16 11:43:37 +01:00
Dawid Mędrek	6dfebc0b6e	cql3/statements/property_definitions: Add extract_property() The method will be useful in an upcoming commit where we'll want to split a type inherited from `property_definitions` into two separate entities.	2025-12-16 11:43:37 +01:00
Dawid Mędrek	62a8dd1b7f	cql3/statements/index_prop_defs.cc: Add namespace	2025-12-16 11:43:37 +01:00
Dawid Mędrek	dcf2c71204	cql3/statements/index_prop_defs.hh: Rename type We rename the type `index_prop_defs` to `index_specific_prop_defs`. The rationale for the change is to distinguish between properties related directly to a index and properties related to the underlying view (if applicable). The type `index_prop_defs` will be re-introduced in an upcomming commit where it'll encompass both index-related and view-related properties. This is a prerequisite for it.	2025-12-16 11:43:37 +01:00
Dawid Mędrek	e9108677f7	cql3/statements/view_prop_defs.cc: Move validation logic into file We're moving the rest of the validation logic that can be moved from `cql3/statements/{create,alter}_view_statement.cc` to the new file.	2025-12-16 11:43:35 +01:00
Dawid Mędrek	f5f7aeaa0a	cql3/statements: Introduce view_prop_defs.{hh,cc} We're introducing a new type wrapping properties that can be used with materialized views. Doing that, we achieve the following things: (1) We can keep validation logic in one place. (2) We differentiate between properties of a regular table and properties of a materialized view. (3) It provides better modularization and allows for reusing the code. (4) It gets rid of inconsistencies in the existing code, e.g. CREATE MV using one type for properties, while ALTER MV another. The actual end goal of this commit is to be able to reuse at least part of the validation logic of MVs in CREATE INDEX and, when it gets added, ALTER INDEX: we want to endow those statements with an ability to modify the underlying materialized view without having to modify it directly. This patch does NOT implement the whole validation logic yet. It will be done in a following commit. Refs scylladb/scylladb#16454	2025-12-16 11:43:03 +01:00
Tomasz Grabiec	4ed17c9e88	node_ops: Drop redundant get_status_helper()	2025-12-16 11:09:11 +01:00
Patryk Jędrzejczak	73db5c94de	Merge 'db: api: service: introduce system.client_routes table and related API endpoints' from Andrzej Jackowski `system.client_routes` is a system table that sets the target address and ports for each `host_id`, for one or more connection (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`. This patch series contains: - Introduction of `CLIENT_ROUTES` feature flag. - Implementation of raft-based `system.client_routes` table - Implementation of `v2/client-routes` POST/DELETE/GET endpoints - Implementation of new `CLIENT_ROUTES_CHANGE` event that is sent to drivers when `system.client_routes` is changed - New tests that verifies the aforementioned features Ref: scylladb/scylla-enterprise#5699 For now, no automatic backport. However, the changes are planned to be release on `2025.4` either as a backport or a private build. Closes scylladb/scylladb#27323 * https://github.com/scylladb/scylladb: docs: describe CLIENT_ROUTES_CHANGE extension test: add test for CLIENT_ROUTES event service: transport: add CLIENT_ROUTES_CHANGE event test: add cluster tests for client routes test: add API tests for client_routes endpoints test: add `timeout` parameter to `delete` in RESTClient test: allow json_body in send api: implement client_routes endpoints api: add client_routes.json service: main: add client_routes_service db: add system.client_routes table gms: add CLIENT_ROUTES feature	2025-12-16 10:38:27 +01:00
Botond Dénes	85f05fbe1b	Revert "Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk" This reverts commit `866c96f536`, reversing changes made to `367633270a`. This change caused all longevities to fail, with a crash in parsing scylla-metadata. The investigation is still ongoing, with no quick fix in sight yet. Fixes: #27496 Closes scylladb/scylladb#27518	2025-12-16 11:34:40 +02:00
Botond Dénes	83f46fa7f5	doc: add video link for TTL Closes: #26210	2025-12-16 10:43:03 +02:00
Anna Stuchlik	ea6f2a21c6	doc: remove references to ScyllaDB versions 4.3 and 4.4 We should never refer to the no longer supported OSS versions. This is a leftover - other mentions were removed long time ago. Fixes https://github.com/scylladb/scylladb/issues/19569 Closes scylladb/scylladb#27656	2025-12-16 06:58:53 +02:00
Yaniv Kaul	30c4bc3f96	Fix for `__iter__` method returns a non-iterator To fix this issue, the std_list_iterator class defined within std_list.__iter__ should implement the full iterator protocol by defining an __iter__() method that returns self. This change ensures any instance of std_list_iterator can be used as an iterator in Python for loops and other iteration contexts, as required. The fix is to add a small method definition inside the std_list_iterator class, ideally after the __init__ or in a logical place with the other dunder methods. Only the code inside the std_list class's __iter__ function (lines around the definition of the inner class and its methods) needs to be edited. Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com> Closes scylladb/scylladb#27642	2025-12-16 06:57:19 +02:00
Piotr Smaron	77fa936edc	doc: audit: update to present how to enable both syslog and table Supporting both sinks have been introduced in https://github.com/scylladb/scylladb/pull/26613, but it missed the docs changes, so here they are. Closes scylladb/scylladb#27607	2025-12-16 06:56:39 +02:00
Piotr Smaron	0ec485845b	Clarify documentation build instructions Closes scylladb/scylladb#27606	2025-12-16 06:56:00 +02:00
Botond Dénes	dace39fd6c	Merge 'Make commitlog replay handle files with corrupt file header (non-zero) as data loss, not startup failure' from Calle Wilund Fixes #26744 If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category. However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up. The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of losing data as expected. Closes scylladb/scylladb#27556 * github.com:scylladb/scylladb: test::cluster::dtest::tools::files: Remove file commitlog_replay: Handle fully corrupt files same as partial corruption. test::pylib::suite::base: Split options.name test specifier only once	2025-12-16 06:55:42 +02:00
Calle Wilund	5f8f724d78	repair: Don't use off-strategy as repair destination with tablet tables Fixes #17384 Bypasses enabling off-strategy storage/placement for repair streams when table repaired is using tablets. Instead, the resulting sstable(s) will be placed in the "normal" set of sstables, and bypass a post-repair off-strategy compaction. v2: Bypass off-strat for whatever reason iff dest is tablets. Closes scylladb/scylladb#27500	2025-12-16 06:54:07 +02:00
Michał Chojnowski	df93ea626b	test/scylla_gdb: use `gcore` instead of `signal SIGSEGV` to generate a coredump on failure The test fails in CI sometimes, and we want a coredump from a failure to debug that. We made the test send a `signal SIGSEGV` to Scylla on failure, but apparently that doesn't work as intended on our CI hosts. (The CI runner seemingly can't find any coredump afterwards). We can use gdb's `gcore` command to produce a coredump in a more predictable way. Refs scylladb/scylladb#22501 Closes scylladb/scylladb#27498	2025-12-16 06:53:43 +02:00
Botond Dénes	74347625f9	Merge 'test/alternator: add reproducers for more issues' from Nadav Har'El This series adds an xfailing reproducers for two issue: #8070 and #27037: 27037 is about where even with alternator_streams_increased_compatibility set to true, if an attribute is set to the same value it had but using a different JSON representation - a Alternator Streams event is unduly produced. 8070 is about the ability to write malformed values into the database and then fail during read - instead of failing, as expected, during the write. This issue was known for years, but we never really had a reproducer for it - it's not possible to reproduce it using clean boto3 code and we need to build a request manually. The first two patches are two small cleanups (including fixes #27372) that I did while preparing the real tests - which are in the final two patches. Closes scylladb/scylladb#27376 * github.com:scylladb/scylladb: test/alternator: add reproducer for bug with storing invalid values test/alternator: reproducer for issue 27375 utils/rjson: fix error messages from rjson::parse() test/alternator: extract get_signed_request() to util.py	2025-12-16 06:53:14 +02:00
Andrzej Jackowski	f1fc5cc808	docs: describe CLIENT_ROUTES_CHANGE extension Ref: scylladb/scylla-enterprise#5699	2025-12-15 18:19:37 +01:00
Andrzej Jackowski	61bbea51ad	test: add test for CLIENT_ROUTES event Ref: scylladb/scylla-enterprise#5699	2025-12-15 18:19:37 +01:00
Andrzej Jackowski	c2b1b10ca0	service: transport: add CLIENT_ROUTES_CHANGE event Introduce the CLIENT_ROUTES_CHANGE event to let drivers refresh connections when `system.client_routes` is modified. Some deployments (e.g., Private Link) require specific address/port mappings that can change without topology changes and drivers need to adapt promptly to avoid connectivity issues. This new EVENT type carries a change indicator plus the affected `connection_ids` and `host_ids`. The only change value is `UPDATE_NODES`, meaning one or more client routes were inserted, updated, or deleted. Drivers subscribe using the existing events mechanism, so no additional `cql_protocol_extension` key is required. Ref: scylladb/scylla-enterprise#5699	2025-12-15 18:19:37 +01:00
Andrzej Jackowski	ec87b92ba1	test: add cluster tests for client routes Ref: scylladb/scylla-enterprise#5699	2025-12-15 18:17:15 +01:00
Andrzej Jackowski	9c9371511f	test: add API tests for client_routes endpoints Ref: scylladb/scylla-enterprise#5699	2025-12-15 17:46:14 +01:00
Andrzej Jackowski	2e80997630	test: add `timeout` parameter to `delete` in RESTClient The parameter was missing and is needed to implement a test case later in this patch series.	2025-12-15 17:44:48 +01:00
Andrzej Jackowski	1143acaf5b	test: allow json_body in send Needed to test `/v2/connection_metadata` endpoints that receive JSON input. Ref: scylladb/scylla-enterprise#5699	2025-12-15 17:44:48 +01:00
Andrzej Jackowski	e153cc434f	api: implement client_routes endpoints Ref: scylladb/scylla-enterprise#5699	2025-12-15 17:36:47 +01:00
Nadav Har'El	4e106b9820	test/cqlpy: remove unused variables Copilot detected a few cases of cqlpy tests setting a variable which they don't use. In all the cases in this patch, we can just remove the variable. Although the AI found all these unused variables, I verified each case carefully before changing it in this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 18:11:04 +02:00
Nadav Har'El	64d9c370ee	test/alternator: remove unnecessary duplicate statement copilot noticed that test/alternator/test_scan.py had a duplicate statement (call to full_scan()). It doesn't break the test, but also adds nothing but confusion - so let's just remove it. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 18:07:45 +02:00
Nadav Har'El	a3959fe3db	test/alternator: remove unused variable assignments copilot noticed in that in in many of Alternator tests, we have some unnecessary assignments. For example, in a few places, we use the idiom: with pytest.raises(...): ret = ... The "ret=" part is unnecessary, as this test expects the statement to fail (hence the raises()), and ret is never assigned. The assignment was only there because we copied this statement from another place in the test, which does expect the statement to pass and wants to validate the returned value. So we should just drop the "ret=" from these tests. Another common occurance is that we used the idiom response = table.do_something() Without checking the response and no intention to check it (either we know it will work, or we just want to check it doesn't throw). So we can drop the "response=" here too. All of the unused variables in this patch were discovered by Copilot, but I reviewed each of them carefully myself and prepared this patch. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 18:07:05 +02:00
Nadav Har'El	4fa4f40712	test/cqlpy: use unique partition in test It is traditional to use a unique (or random) partition key in cqlpy tests, to allow multiple tests to share the same table and make the test suite a bit faster. One of the tests, test_multi_column_relation_desc, set up a unique key "k", but then forgot to use it and used partition key 0 instead. Fix the test to use this k. This problem was spotted by Copilot, who saw the unused variable k. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 17:08:51 +02:00
Yauheni Khatsianevich	07867a9a0d	test: new LWT with counters test during tablets migration/resize - Workload: N workers perform CAS updates - Update counter table each time CAS was successful - Enable balancing and increase min_tablet_count to force split, and lower min_tablet_count to merge. - Run tablets migrations loop - Stop workload and verify data consistency	2025-12-15 14:32:30 +01:00
Patryk Jędrzejczak	844545bb74	Merge 'treewide: fix cases of improper re-throwing of `std::exception_ptr`' from Emil Maskovsky Fix multiple cases where the captured `std::exception_ptr` has been re-thrown via simple `throw eptr;`, which results in losing the original exception type and details. Resolved at various places found by clang-tidy: 1. db::schema_applier When applying schema changes, the previous implementation attempted to handle exceptions by catching and rethrowing them, but did so incorrectly: using `throw ex` with a `std::exception_ptr` loses the original exception type and details. However, in this case, explicit exception handling is unnecessary. The only reason for catching was to ensure `ap.destroy()` is called before propagating the exception. This can be more cleanly and safely achieved using Seastar's `.finally()` continuation, which guarantees cleanup regardless of success or failure. 2. directories The `std::exception_ptr()` has been captured for logging and then again re-thrown incorrectly via `throw ex;`. We could use `std::rethrow_exception()` here instead, but it seems to be simpler to just use regular `throw;` to rethrow the original exception, and only use the `std::current_exception()` for logging (which is a pattern used in other places as well). 3. storage_service Here the exception has been re-thrown incorrectly in a coroutine. There it is best to use the `co_await coroutine::return_exception_ptr` to propagate exception more efficiently in a coroutine-friendly manner. Fixes: SCYLLADB-94 Refs: scylladb/scylladb#27501 No backport: This fixes an error logging issue, that isn't a production problem by itself (only found in test), therefore not backporting to older branches. Closes scylladb/scylladb#27613 * https://github.com/scylladb/scylladb: db: schema_applier: improve exception-safe cleanup directories: fix exception rethrowing storage_service: use coroutine-friendly exception propagation in join_node_response_handler	2025-12-15 13:56:45 +01:00
Nadav Har'El	ccacea621f	test/cqlpy: fix flaky test test_view_in_system_tables The cqlpy test test_materialized_view.py::test_view_in_system_tables checks that the system table "system.built_views" can inform us that a view has been built. This test was flaky, starting to fail quite often recently, and this patch fixes the problem in the test. For historic reasons this test began by calling a utility function wait_for_view_built() - which uses a different system table, system_distributed.view_build_status, to wait until the view was built. The test then immediately tries to verify that also system.built_views lists this view. But there is no real reason why we could assume - or want to assume - that these two tables are updated in this order, or how much time passed between the two tables being changed. The authors of this test already acknowledged there is a problem - they included a hack purporting to be a "read barrier" that claimed to solve this exact problem - but it seems it doesn't, or at least no longer does after recent changes to the view builder's implementation. The solution is simple - just remove the call to wait_for_view_built() and the "hack" after it. We should just wait in a loop (until a timeout) for the system table that we really wanted to check - system.built_views. It's as simple as that. No need for any other assumptions or hacks. Fixes #27296 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27626	2025-12-15 15:29:08 +03:00
Nadav Har'El	f287484f4d	test/cqlpy: rename test with duplicate name When translating Cassandra's test validation/operations/CreateTest.java I accidentally used the same name for two tests, resulting in the first of them never being run. Let's fix the name of the second of the two to be the real name it had in the original Cassandra test. After this patch pytest reports 16 tests in this file, instead of 15 before this patch. The previously-ignored test was correct, and it now passes in both Scylla and Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 14:19:24 +02:00
Benny Halevy	f4a4671ad6	table: seal_snapshot: avoid oversized allocation when dumping manifest.json Currently, we first print the json contents into a stringstream buffer and then we write it as a whole to the manifest.json file output stream. This is not scalable and may cause large allocation for large enough number of files. Fixes #24216 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27542	2025-12-15 15:19:24 +03:00
Dawid Mędrek	8691d287be	cql3/statements/create_view_statement.cc: Move validation of ID The end goal we have in mind in this commit is to extract the validation logic of the options used for creating and altering an MV to a separate place and be able to call from different places in the code. It will be useful when extending the capabilities of the CREATE INDEX statement. In this patch, we move the part of validation responsible for checking the ID option to keep it close to the other parts of validation of the options in their "raw" form.	2025-12-15 13:18:48 +01:00
Dawid Mędrek	11c109c623	schema/schema.hh: Do not include index_prop_defs.hh One of the upcoming commits will lead to a cyclic dependency of headers because `schema.hh` includes `index_prop_defs.hh`. To prevent that, we remove the include and replace it with a manually added alias. This is not a perfect solution, but doing it properly would require comprehensive changes. We can do that in a separate task.	2025-12-15 13:18:48 +01:00
Andrzej Jackowski	70a0418102	api: add client_routes.json Add the JSON definitions for the POST, GET, and DELETE endpoints used to modify client routes. These endpoints are intended for Cloud to update the `system.client_routes` table. The API is implemented in `/v2/` because the endpoints process arrays of objects. Handling of such structures was improved between Swagger 1.2 and 2.0 versions. There are already similar `get_metrics_config` and `set_metrics_config` endpoints that operate on similar structures and they are also in /v2/. The introduced JSON files start with `, ` but it's intended because the files are concatenated to the existing (metrics) JSON files, and they need to represent valid JSON after the concatenation. Ref: scylladb/scylla-enterprise#5699	2025-12-15 13:13:46 +01:00
Andrzej Jackowski	6fcc1ecf94	service: main: add client_routes_service Introduce `client_routes_service` for managing `system.client_routes` table. Ref: scylladb/scylla-enterprise#5699	2025-12-15 13:13:40 +01:00
Andrzej Jackowski	8dde70d04c	db: add system.client_routes table Introduce `system.client_routes`, a system table that sets the target address and ports for each `host_id`, for one or more connections (e.g., Private Link) represented by `connection_id`. Cloud will write the table via REST, and drivers will read it via CQL to override values obtained from `system.local` and `system.peers`. The table is Raft-managed to provide consistent replication across nodes. Schema overview: each row is identified by `(connection_id, host_id)` and describes where clients should connect: `address` and one or more of `port`, `tls_port`, `alternator_port`, `alternator_https_port`. `host_id` is a UUID (just as in ScyllaDB) but `connection_id` can be any string to accept formats of all cloud providers. `address` is also a regular string because it can represent either an IP address or a domain. Ports are optional in the sense that at least one of the four must be provided. Ref: scylladb/scylla-enterprise#5699	2025-12-15 13:08:05 +01:00
Andrzej Jackowski	2e7070d3b7	gms: add CLIENT_ROUTES feature The feature will be used later in this patch series: - To avoid unnecessary operations when the feature is not enabled - To guard new API endpoints from being used before the cluster is ready to use them. - To implement update tests (by disabling/enabling the feature) Ref: scylladb/scylla-enterprise#5699	2025-12-15 13:08:04 +01:00
Pavel Emelyanov	3f7ee3ce5d	Merge 'batchlog: make replay (flush) faster' from Botond Dénes The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied. When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table. Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed). Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones. Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range. When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC. The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions. This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR: CREATE TABLE system.batchlog_v2 ( version int, stage int, shard int, written_at timestamp, id uuid, data blob, PRIMARY KEY ((version, stage, shard), written_at, id)); The new schema organization has the following goals: 1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC. 2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely. 3) Spread out the data among Scylla shards, via the batchlog shard column. 4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed. Fixes: https://github.com/scylladb/scylladb/issues/23358 This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`. Closes scylladb/scylladb#26671 * github.com:scylladb/scylladb: db/config: change batchlog_replay_cleanup_after_replays default to 1 test/boost/batchlog_manager_test: add test for batchlog cleanup replica/mutation_dump: always set position weight for clustering positions service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/ test/lib: introduce error_injection.hh utils/error_injection: add debug log to disable() and disable_all() test/lib/cql_test_env: forward config to batchlog test/lib/cql_test_env: add batch type to execute_batch() test/lib/cql_assertions: add with_size(predicate) overload test/lib/cql_assertions: add source location to fail messages test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row() test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload db/batchlog_manager: config: s/write_timeout/reply_timeot/ db,service: switch to system.batchlog_v2 db/system_keyspace: introduce system.batchlog_v2 service,db: extract generation of batchlog delete mutation service,db: extract get_batchlog_mutation_for() from storage-proxy db/batchlog_manager: only consider propagation delay with tombstone-gc=repair db/batchlog_manager: don't drop entire batch if one mutations' table was dropped data_dictionary: table: add get_truncation_time() db/batchlog_manager: batch(): replace map_reduce() with simple loop db/batchlog_manager: finish coroutinizing replay_all_failed_batches db/batchlog_manager: improve replayAllFailedBatches logs	2025-12-15 15:05:19 +03:00
Nadav Har'El	a9442e6d56	test/cqlpy: rename test with duplicate name When translating Cassandra's test validation/operations/DeleteTest.java I accidentally used the same name for two tests, resulting in the first of them never being run. Let's fix the name of the second of the two to be the real name it had in the original Cassandra test. After this patch pytest reports 52 tests in this file, instead of 51 before this patch. The previously-ignored test was correct, and it now passes in both Scylla and Cassandra. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-15 12:02:59 +02:00
Dawid Mędrek	1e14c08eee	locator/token_metadata: Remove get_host_id() The function is declared, but it's not defined or used anywhere. Closes scylladb/scylladb#27374	2025-12-15 10:36:52 +01:00
Michael Litvak	b9ec1180f5	alternator: require rf_rack_valid_keyspaces when creating index When creating an alternator table with tablets, if it has an index, LSI or GSI, require the config option rf_rack_valid_keyspaces to be enabled. The option is required for materialized views in tablets keyspaces to function properly and avoid consistency issues that could happen due to cross-rack migrations and pairing switches when RF-rack validity is not enforced. Currently the option is validated when creating a materialized view via the CQL interface, but it's missing from the alternator interface. Since alternator indexes are based on materialized views, the same check should be added there as well. Fixes scylladb/scylladb#27612 Closes scylladb/scylladb#27622	2025-12-15 10:36:57 +02:00
Michał Hudobski	12483d8c3c	vector_search: throw an error when we restrict primary in vector search We currently allow restrictions on single column primary key, but we ignore the restriction and return all results. This can confuse the users. We change it so such a restriction will throw an error and add a test to validate it. Fixes: VECTOR-331 Closes scylladb/scylladb#27143	2025-12-15 09:45:56 +02:00
Jenkins Promoter	d5641398f5	Update pgo profiles - aarch64	2025-12-15 05:16:31 +02:00
Alex	d21faab9dc	test: cqlpy: Remove test_switch_tenants and add test in cluster testing. The test needs to run twice, in two separate Scylla runs, using two different modes: gossip and raft. The cluster framework supports this setup, while cqlpy only runs against Scylla instances in raft mode. Therefore, the test was moved from cqlpy to the cluster-based framework. This commit both adds the test in cluster/ and removes the old version in cqlpy/.	2025-12-14 18:46:06 +02:00
Alex	30f6a40ae6	server: Refactor update_control_connection_scheduling_group functionality This refactoring moves the logic that retrieves the scheduling group for driver_service_level_name out of switch_tenant. This change is possible because the scheduling group for the driver is retrieved from a map (LOOKUP). The lookup function is fully synchronized, non-coroutine, and returns immediately. For that reason, it’s better to perform this lookup outside of the switch_tenant function.	2025-12-14 18:46:06 +02:00
Alex	5579489c4c	server: Refactor scheduling group update functionality. This change generalizes the scheduling-group update functionality and removes some copy-paste code, improving overall readability and maintainability. To achieve this, capturing lambdas were introduced. As a result, self-deducing this was added to those lambdas to avoid coroutine-related issues (“coroutine fiasco”).	2025-12-14 18:46:05 +02:00
Alex	17c9d640fe	server: Fix switch_tenant problem, When running on a V2 server, service-level data comes from service level cache. Because of this, we can use synchronized function to get the schedualing group. Since we are transitioning to a Raft-based architecture where all servers will be V2, we can safely implement this fix specifically for that case. This change adds get_cached_user_scheduling_group functionality and moves its usage out of switch_tenant function in update_scheduling_group_v2 usage.	2025-12-14 16:27:40 +02:00
Alex	f98af582a7	server: Add update_service_level_scheduling_group_v1 functions to create placehholder for functionality that will introduce v2 implementation. The new functionality will allow usage of service level cache	2025-12-14 16:09:18 +02:00
Nadav Har'El	c06e63daed	Merge 'auth: start using SHA 512 hashing originated from musl with added yielding' from Andrzej Jackowski This patch series contains the following changes: - Incorporation of `crypt_sha512.c` from musl to out codebase - Conversion of `crypt_sha512.c` to C++ and coroutinization - Coroutinization of `auth::passwords::check` - Enabling use of `__crypt_sha512` orignated from `crypt_sha512.c` for computing SHA 512 passwords of length <=255 - Addition of yielding in the aforementioned hashing implementation. The alien thread was a solution for reactor stalls caused by indivisible password‑hashing tasks (https://github.com/scylladb/scylladb/issues/24524). However, because there is only one alien thread, overall hashing throughput was reduced (see, e.g., https://github.com/scylladb/scylla-enterprise/issues/5711). To address this, the alien‑thread solution is reverted, and a hashing implementation with yielding is introduced in this patch series. Before this patch series, ScyllaDB used SHA-512 hashing provided by the `crypt_r` function, which in our case meant using the implementation from the `libxcrypt` library. Adding yielding to this `libxcrypt` implementation is problematic, both due to licensing (LGPL) and because the implementation is split into many functions across multiple files. In contrast, the SHA-512 implementation from `musl libc` has a more permissive license and is concise, which makes it easier to incorporate into the ScyllaDB codebase. The performance of this solution was compared with the previous implementation that used one alien thread and the implementation after the alien thread was reverted. The results (median) of `perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters are as follows: - Alien thread: 41.5 new connections/s per shard - Reverted alien thread: 244.1 new connections/s per shard - This commit (yielding in hashing): 198.4 new connections/s per shard The roughly 20% performance deterioration compared to the old implementation without the alien thread comes from the fact that the new hashing algorithm implemented in `utils/crypt_sha512.cc` performs an expensive self-verification and stack cleanup. On the other hand, with smp=10 the current implementation achieves roughly 5x higher throughput than the alien thread. In addition, due to yielding added in this commit, the algorithm is expected to provide similar protection from stalls as the alien thread did. In a test that in parallel started a cassandra-stress workload and created thousands of new connections using python-driver, the values of `scylla_reactor_stalls_count` metric were as follows: - Alien thread: 109 stalls/shard total - Reverted alien thread: 13186 stalls/shard total - This commit (yielding in hashing): 149 stalls/shard total Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms` values were: - Alien thread: 1087 ms/shard total - Reverted alien thread: 72839 ms/shard total - This commit (yielding in hashing): 1623 ms/shard total To summarize, yielding during hashing computations achieves similar throughput to the old solution without the alien thread but also prevents stalls similarly to the alien thread. Fixes: scylladb/scylladb#26859 Refs: scylladb/scylla-enterprise#5711 No automatic backport. After this PR is completed, the alien thread should be rather reverted from older branches (2025.2-2025.4 because on 2025.1 it's already removed). Backporting of the other commits needs further discussion. Closes scylladb/scylladb#26860 * github.com:scylladb/scylladb: test/boost: add too_long_password to auth_passwords_test test/boost: add same_hashes_as_crypt_r to auth_passwords_test auth: utils: add yielding to crypt_sha512 auth: change return type of passwords::check to future auth: remove code duplication in verify_scheme test/boost: coroutinize auth_passwords_test utils: coroutinize crypt_sha512 utils: make crypt_sha512.cc to compile utils: license: import crypt_sha512.c from musl to the project Revert "auth: move passwords::check call to alien thread"	2025-12-14 14:01:01 +02:00
David Garcia	c1c3b2c5bb	docs: fix local build prevents early exits in metrics docs generation to break the local build. Fixes #27497 Closes scylladb/scylladb#27615	2025-12-14 11:48:48 +02:00
Raphael S. Carvalho	a0a7941eb1	test: Add reproducer for split vs intra-node migration race This is a problem caught after removing split from add_sstable_and_update_cache(), which was used by intra node migration when loading new sstables into the destination shard. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:18 -03:00
Raphael S. Carvalho	e3b9abdb30	test: Verify split failure on behalf of repair during critical disk utilization Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:18 -03:00
Raphael S. Carvalho	bc772b791d	test: boost: Add failure_when_adding_new_sstable_test Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:18 -03:00
Raphael S. Carvalho	77a4f95eb8	test: Add reproducer for split vs incremental repair race condition Refs #26041. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 17:01:16 -03:00
Raphael S. Carvalho	992bfb9f63	compaction: Fail split of new sstable if manager is disabled If manager has been disabled due to out of space prevention, it's important to throw an exception rather than silently not splitting the new sstable. Not splitting a sstable when needed can cause correctness issue when finalizing split later. It's better to fail the writer (e.g. repair one) which will be retried than making caller think everything succeeded. The new replica::table::add_new_sstable_and_update_cache() will now unlink the new sstable on failure, so the table dir will not be left with sstables not loaded. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:51 -03:00
Raphael S. Carvalho	ee3a743dc4	replica: Don't split in do_add_sstable_and_update_cache() Now, only sstable loader on boot and refresh from upload uses this procedure. The idea is that maybe_split_new_sstable() will throw when compaction cannot run due to e.g. out of space prevention. It could fail repair writer, but we don't want it to fail boot. As for refresh from upload, it's not supposed to work when tablet map at the time of backup is not the same when restoring. Even before this, refresh would fail if split already executed, split would only happen if split was still ongoing. We need token range stability for local restore. The safe variant will always be load and stream. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:51 -03:00
Raphael S. Carvalho	48d243f32f	streaming: Leave sstables unsealed until attached to the table We want the invariant that after ACK, all sealed sstables will be split. This guarantee that on restart, no unsplit sstables will be found sealed. The paths that generate unsplit sstables are streaming and file streaming consumers. It includes intra-node streaming, which is local but can clone an unsplit sstable into destination. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:51 -03:00
Raphael S. Carvalho	d9d58780e2	replica: Wire add_new_sstables_and_update_cache() into intra-node streaming Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:51 -03:00
Raphael S. Carvalho	ddb27488fa	replica: Wire add_new_sstable_and_update_cache() into file streaming consumer Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:51 -03:00
Raphael S. Carvalho	10225ee434	replica: Wire add_new_sstable_and_update_cache() into streaming consumer After the wiring, failure to attach the new sstable in the streaming consumer will unlink the sstable automatically. Fixes #27414. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:51 -03:00
Raphael S. Carvalho	a72025bbf6	replica: Document old add_sstable_and_update_cache() variants Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:51 -03:00
Raphael S. Carvalho	3f8363300a	replica: Introduce add_new_sstables_and_update_cache() Piggyback on new add_new_sstable_and_update_cache(), replacing the previous add_sstables_and_update_cache(). Will be used by intra-node migration since we want it to be safe when loading the cloned sstables. An unsplit sstable can be cloned into destination which already ACKed split, so we need this variant which splits sstable if needed, while it's unsealed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	63d1d6c39b	replica: Introduce add_new_sstable_and_update_cache() Failure to load sstable in streaming can leave sealed sstables on disk since they're not unlinked on failure. This can result in several problems: 1) Data resurrection: since the sstable may contain deleted data 2) Split issue: since the finalization requires all sstables to be split 3) Disk usage issue: since the sstables hold space and streaming retries can keep accumulating these files. This new procedure will be later wired into streaming consumers, in order to fix those problems. Another benefit of the interface is that if there's split when adding the new sstable, the output sstables will be returned to the caller, allowing them to register the actual loaded sstables into e.g. the view builder. Refs #27414. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	27d460758f	replica: Account for sstables being added before ACKing split We want the invariant that after ACK, all sealed sstables will be split. If check-and-attach is not atomic, this sequence is possible: 1) no split decision set. 2) Unsplit sstable is checked, no need to split, sealed. 3) split decision is set and ACKed 4) unsplit sstable is attached Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	794e03856a	replica: Remove repair read lock from maybe_split_new_sstable() The lock is intended to serialize some maintenance compactions, such as major, with repair. But maybe_split_new_sstable() is restricted solely to new sstables that aren't part of the sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	2dae0a7380	compaction: Preserve state of input sstable in maybe_split_new_sstable() This is crucial with MVs, since the splitting must preserve the state of the original sstable. We want the sstable to be in staging dir, so it's excluded when calculating the diff for performing pushes to view replicas. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	1fdc410e24	Rename maybe_split_sstable() to maybe_split_new_sstable() Since the function must only be used on new sstables, it should be renamed to something describing its usage should be restricted. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	1a077a80f1	sstables: Allow storage::snapshot() to leave destination sstable unsealed Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	c5e840e460	sstables: Add option to leave sstable unsealed in the stream sink That will be needed for file streaming to leave output sstable unsealed. we want the invariant where all sealed sstables are split after split was ACKed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	c10486a5e9	test: Verify unsealed sstable can be compacted This is crucial for splitting before sealing the sstable produced by repair. This way, unsplit sstables won't be left on disk sealed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	ab82428228	sstables: Allow unsealed sstable to be loaded File streaming will have to load an unsealed sstable, so we need to be able to parse components from temporary TOC instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Raphael S. Carvalho	b1be4ba2fc	sstables: Restore sstable_writer_config::leave_unsealed This option was retired in commit `0959739216`, but it will be again needed in order to implement split before sealing. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-12-12 16:59:50 -03:00
Emil Maskovsky	5e7456936e	db: schema_applier: improve exception-safe cleanup When applying schema changes, the previous implementation attempted to handle exceptions by catching and rethrowing them, but did so incorrectly: using `throw ex` with a `std::exception_ptr` loses the original exception type and details. The correct approach is to use `std::rethrow_exception()`. However, in this case, explicit exception handling is unnecessary. The only reason for catching was to ensure `ap.destroy()` is called before propagating the exception. This can be more cleanly and safely achieved using Seastar's `.finally()` continuation, which guarantees cleanup regardless of success or failure. This change removes the manual try/catch/rethrow and uses `.finally()` to ensure proper cleanup, letting exceptions propagate naturally and preserving their type and information. Fixes: SCYLLADB-94 Refs: scylladb/scylladb#27501	2025-12-12 18:18:31 +01:00
Emil Maskovsky	e6f5f2537e	directories: fix exception rethrowing Fix location identified by clang-tidy where `std::exception_ptr` was incorrectly rethrown using `throw ep;`. The correct approach is to use `std::rethrow_exception(ep)`, which preserves the original exception type and stack trace. But this can be even further simplified by logging the current exception with `std::current_exception()` and rethrowing using `throw;` instead of capturing and rethrowing a `std::exception_ptr`. This matches the idiomatic pattern used elsewhere in the codebase and improves clarity. This change ensures proper exception propagation and avoids type slicing or loss of diagnostic information.	2025-12-12 18:10:20 +01:00
Emil Maskovsky	76aacc00f2	storage_service: use coroutine-friendly exception propagation in join_node_response_handler Improve exception handling in join_node_response_handler by using `co_await coroutine::return_exception_ptr` to propagate exceptions. This replaces the incorrect direct throw of `std::exception_ptr` and ensures proper coroutine-friendly exception propagation.	2025-12-12 17:59:54 +01:00
Cezar Moise	7c8ab3d3d3	test.py: add pid to ServerInfo Adding pid info to servers allows matching coredumps with servers Other improvements: - When replacing just some fields of ServerInfo, use `_replace` instead of building a new object. This way it is agnostic to changes to the Object - When building ServerInfo from a list, the types defined for its fields are not enforced, so ServerInfo(*list) works fine and does not need to be changed if fields are added or removed.	2025-12-12 15:11:03 +02:00
Botond Dénes	cb7f2e4953	docs: scylla-sstable-script-api.rst: add introduction and title	2025-12-12 13:50:12 +02:00
Botond Dénes	dd5b6770c8	docs: scylla-sstable.rst: extract script API to separate document The script API is 500+ lines long in an already too long and hard to navigate document. Extract it to a separate document, making both documents shorter and easier to navigate.	2025-12-12 13:44:32 +02:00
Botond Dénes	7e7e378a4b	Merge 'Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"' from null Reverts commit `8192f45e84`. The merge exposed a critical bug where truncate operations during table drop with auto-snapshot fail, causing Raft applier fiber to stop with unhandled exceptions. This leads to schema inconsistencies across nodes and test failures with "Keyspace does not exist" errors. Root Cause Commit `19b6207f` modified `truncate_table_on_all_shards` to set `use_sstable_identifier = true`: ```cpp // Before (working) co_await table::snapshot_on_all_shards(sharded_db, table_shards, name); // After (broken) auto opts = db::snapshot_options{.use_sstable_identifier = true}; co_await table::snapshot_on_all_shards(sharded_db, table_shards, name, opts); ``` This triggers exceptions during snapshot that propagate through Raft state machine, causing: - Raft applier stops: `raft::state_machine_error` at `raft/server.cc:1369` - Schema changes fail to propagate - Nodes report non-existent keyspaces for valid schemas Changes Reverts 15 files (200 deletions, 74 insertions): - Removes `use_sstable_identifier` from truncate/snapshot code paths - Reverts `snapshot_options` struct back to simple `skip_flush` boolean - Removes REST API and nodetool `--use-sstable-identifier` parameter - Removes feature tests from `test/boost/database_test.cc` No backport required - the original feature was merged to master only and never released. <!-- START COPILOT ORIGINAL PROMPT --> <details> <summary>Original prompt</summary> ---- This section details on the original issue you should resolve <issue_title>test_table_drop_with_auto_snapshot failed with InvalidRequest</issue_title> <issue_description>Seen in: https://jenkins.scylladb.com/job/scylla-master/job/next/9968//testReport Logs: [download](https://downloads.scylladb.com/unstable/scylla/master/testLogs/2025-12-08T15:05:42Z/) Error message: ``` cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist" ``` Stacktrace: ``` manager = <test.pylib.manager_client.ManagerClient object at 0xffff734e6c10> @pytest.mark.asyncio async def test_table_drop_with_auto_snapshot(manager: ManagerClient): logger.info("Bootstrapping cluster") cfg = { 'auto_snapshot': True } servers = await manager.servers_add(3, config = cfg) cql = manager.get_cql() # Increases the chance of tablet migration concurrent with schema change await inject_error_on(manager, "tablet_allocator_shuffle", servers) for i in range(3): await cql.run_async("DROP KEYSPACE IF EXISTS test;") await cql.run_async("CREATE KEYSPACE IF NOT EXISTS test WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 8 };") await cql.run_async("CREATE TABLE IF NOT EXISTS test.tbl_sample_kv (id int, value text, PRIMARY KEY (id));") > await cql.run_async("INSERT INTO test.tbl_sample_kv (id, value) VALUES (1, 'ala');") E cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace test does not exist" test/cluster/test_tablets2.py:173: InvalidRequest ``` </issue_description> <agent_instructions>this issue was exposed by commit `8192f45e84`, please send a pull request reverting that merge commit and mark it as fixing this github issue.</agent_instructions> <comments> <comment_new><author>@yaronkaikov</author><body> @denesb is this something in your team area? if not , please feel free to delegate it or un-assign yourself :-)</body></comment_new> <comment_new><author>@nyh</author><body> This is very strange. Clearly the keyspace `test` does exist at this point, because we created it two lines above and also we ran `CREATE TABLE .. test.tbl_sample_kv` which would have failed if the keyspace `test` didn't exist - so it must exit, no? In the past, we had a bug where the running `CREATE KEYSPACE IF NOT EXISTS` forgot to set the "schema modified" event in the response so it failed to wait for schema agreement, but 1. we fixed this bug (https://github.com/scylladb/scylladb/pull/18819 by @nuivall ) and 2. this bug didn't happen in this case, where CREATE TABLE deed had work to do. But I just realized something... Our fix in https://github.com/scylladb/scylladb/pull/18819 only applies to CREATE KEYSPACE / TABLE / VIEW / TYPE statements. It wasn't applied to `DROP KEYSPACE` - and it should have been.... But I don't have a good theory how a bug like https://github.com/scylladb/scylladb/pull/18819 can explain this specific test failure. Different schema operations are already linearized, so if a `CREATE TABLE test.tbl_sample_kv` succeeded, I don't see how there could possibly be any earlier `DROP KEYSPACE test` that suddenly springs to life. Unless we have a serious bug in our raft-based schema operations.</body></comment_new> <comment_new><author>@nyh</author><body> Another bug we could have in theory is that the Python driver's async `cql.run_async` might have a bug where it is not waiting for the schema agreement despite being told to wait. If it doesn't wait for schema agreement, this can easily explain this bug: 1. the CREATE KEYSPACE, CREATE TABLE both are sent to node A, but 2. the last INSERT INTO is sent to node B which is not yet aware of this new keyspace and table, and fails. Copilot claims that execute_async() does have this bug! > For schema-altering statements, schema agreement (meaning all nodes agree on the new schema) is important before running follow-up operations, but this is enforced only by synchronous helpers like Session.execute(), not the asynchronous version. > If you use execute_async() for schema operations, you are responsible for checking schema agreement yourself, using [Session.check_schema_agreement()](https://docs.datastax.com/en/developer/python-driver/latest/api/cassandra/cluster/#cassandra.cluster.Session.check_schema_agreement) or (in newer code) ResponseFuture.check_schema_agreement. > According to [a discussion on the DataStax support forum](https://support.datastax.com/s/article/Does-the-Python-Driver-for-Cassandra-Wait-for-Schema-Agreement-after-a-Schema-Change?language=en_US) and the [driver’s source code](`7f12a5e1c6/cassandra/cluster.py (L487)`), schema agreement is not ch... </details> <!-- START COPILOT CODING AGENT SUFFIX --> - Fixes scylladb/scylladb#27501 <!-- START COPILOT CODING AGENT TIPS --> --- Closes scylladb/scylladb#27604 * github.com:scylladb/scylladb: Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy" Initial plan	2025-12-12 13:20:49 +02:00
Botond Dénes	3d73a9781e	docs: scylla-sstable: prepare for script API extract We are about to extract the script API to a separate document. In preparation convert soon-to-be cross-document references, so they keep working after the extraction.	2025-12-12 13:15:48 +02:00
Piotr Smaron	982339e73f	doc: audit: set audit as enabled by default	2025-12-12 09:18:54 +01:00
Piotr Smaron	d1a04b3913	Reapply "audit: enable some subset of auditing by default" This reverts commit a5edbc7d612df237a1dd9d46fd5cecf251ccfd13. Fixes: https://github.com/scylladb/scylladb/issues/26020	2025-12-12 09:18:54 +01:00
copilot-swe-agent[bot]	77ee7f3417	Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy" This reverts commit `8192f45e84`. The merge exposed a bug where truncate (via drop) fails and causes Raft errors, leading to schema inconsistencies across nodes. This results in test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist' errors. The specific problematic change was in commit `19b6207f` which modified truncate_table_on_all_shards to set use_sstable_identifier = true. This causes exceptions during truncate that are not properly handled, leading to Raft applier fiber stopping and nodes losing schema synchronization.	2025-12-12 03:55:13 +00:00
copilot-swe-agent[bot]	0ff89a58be	Initial plan	2025-12-12 03:48:12 +00:00
Yaron Kaikov	f7ffa395a8	workflows: trigger CI automatically when conflicts label is removed Add pull_request_target event with unlabeled type to trigger-scylla-ci workflow. This allows automatic CI triggering when the 'conflicts' label is removed from a PR, in addition to the existing manual trigger via comment. The workflow now runs when: - A user posts a comment with '@scylladbbot trigger-ci' (existing) - The 'conflicts' label is removed from a PR (new) Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-84 Closes scylladb/scylladb#27521	2025-12-11 16:48:06 +02:00
Piotr Smaron	3fa3b920de	Update CODEOWNERS to remove redundant entries Removing myself as I have no maintainer's permissions to review the code Closes scylladb/scylladb#27576	2025-12-11 16:47:08 +02:00
Botond Dénes	e7ca52ee79	Merge 'api: storage_service/tablets/repair: disable incremental repair by default' from Benny Halevy Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414 Backport to 2025.4 where `611918056a` was introduced Closes scylladb/scylladb#27530 * github.com:scylladb/scylladb: api: storage_service/tablets/repair: disable incremental repair by default docs: nodetool-commands: cluster: repair: fix incremental-mode example	2025-12-11 15:23:09 +02:00
Botond Dénes	730eca5dac	Merge 'Remove noexcept from storage_group and table functions to allow exception propagation' from null Fixed a critical bug where `storage_group::for_each_compaction_group()` was incorrectly marked `noexcept`, causing `std::terminate` when actions threw exceptions (e.g., `utils::memory_limit_reached` during memory-constrained reader creation). Changes made: 1. Removed `noexcept` from `storage_group::for_each_compaction_group()` declaration and implementation 2. Removed `noexcept` from `storage_group::compaction_groups()` overloads (they call for_each_compaction_group) 3. Removed `noexcept` from `storage_group::live_disk_space_used()` and `memtable_count()` (they call compaction_groups()) 4. Kept `noexcept` on `storage_group::flush()` - it's a coroutine that automatically captures exceptions and returns them as exceptional futures 5. Removed `noexcept` from `table_load_stats()` functions in base class, table, and storage group managers Rationale: As noted by reviewers, there's no reason to kill the server if these functions throw. For coroutines returning futures, `noexcept` is appropriate because Seastar automatically captures exceptions and returns them as exceptional futures. For other functions, proper exception handling allows the system to recover gracefully instead of terminating. Fixes #27475 Closes scylladb/scylladb#27476 * github.com:scylladb/scylladb: replica: Remove unnecessary noexcept replica: Remove noexcept from compaction_groups() functions replica: Remove noexcept from storage_group::for_each_compaction_group	2025-12-11 15:17:35 +02:00
Benny Halevy	c8cff94a5a	api: storage_service/tablets/repair: disable incremental repair by default Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-11 14:25:21 +02:00
Benny Halevy	5fae4cdf80	docs: nodetool-commands: cluster: repair: fix incremental-mode example There is no 'regular' incremental mode anymore. The example seems have meant 'disabled'. Fixes #27587 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-11 14:25:11 +02:00
Marcin Maliszkiewicz	8bbcaacba1	auth: always catch by const reference This is best practice. Closes scylladb/scylladb#27525	2025-12-11 12:42:30 +01:00
Yaron Kaikov	3dfa5ebd7f	Add JIRA issue validation to backport PR fixes check Extend the Fixes validation pattern to also accept JIRA issue references (format: [A-Z]+-\d+) in addition to GitHub issue references. This allows backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'. Fixes: https://github.com/scylladb/scylladb/issues/27571 Closes scylladb/scylladb#27572	2025-12-11 12:23:16 +02:00
Avi Kivity	24264e24bb	Revert "repair: Add tablet repair progress report support" This reverts commit `faad0167d7`. It causes a regression in test_two_tablets_concurrent_repair_and_migration_repair_writer_level in debug mode (with ~5%-10% probability). Fixes #27510. Closes scylladb/scylladb#27560	2025-12-11 12:18:11 +02:00
Nadav Har'El	0c64e3be9a	Merge 'Unify and fix rjson string and string_view conversions' from Marcin Maliszkiewicz This patch-set consolidates and corrects rjson string conversion handling. It removes unnecessary string copies, ensures proper length usage and replaces ad-hoc conversions with consistent helper functions. Overall, the changes make rjson string handling safer, faster, and more uniform across the codebase. Backport: no, it's a refactor Closes scylladb/scylladb#27394 * github.com:scylladb/scylladb: fix rjson::value to bytes conversion with missing GetStringLength call alternator: change type from string to string_view in should_add_capacity fix rjson::value to string_view conversion with missing GetStringLength call use rjson::to_string_view when rjson::value gets converted using GetStringLength use rjson::to_sstring and rjson::to_string for various string conversions utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds utils: move rjson::to_string_view func to string related place utils: add to_sstring and to_string rjson helper	2025-12-11 12:05:41 +02:00
Nadav Har'El	b3b0860e7c	test/alternator: add reproducer for bug with storing invalid values This patch adds a reproducer for a long-known bug, #8070, where Alternator can store invalid values which are just blindly stored as JSON, and we will only see the failure when reading the item back - and either the client will fail to parse it, or sometimes even Alternator's own code (e.g., FilterExpression) will fail to parse it. The right behavior is to fail the write - not the read. The included test checks writing different kinds of invalid values using PutItem, UpdateItem, and BatchWriteItem. The new tests pass on DynamoDB, but fail on Alternator so marked as "xfail". Refs #8070. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-11 11:58:22 +02:00
Nadav Har'El	db15c212a6	test/alternator: reproducer for issue 27375 This patch adds a reproducer for issue #27375, where even with alternator_streams_increased_compatibility set to true, if an attribute is set to the same value it had but using a different JSON representation - a Alternator Streams event is unduly produced. For example, if a map {'dog': 1, 'cat': 2} is changed to {'cat': 2, 'dog': 1}, this non-change should not be reported. The new test added in this patch passes on DynamoDB (an event is not generated) but fails on Alternator (an event is generated), so the new test is marked with xfail. Refs #27375. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-11 11:34:19 +02:00
Nadav Har'El	3595941020	utils/rjson: fix error messages from rjson::parse() rjson::parse() when parsing JSON stored in a chunked_content (a vector of temporary buffers) failed to initialize its byte counter to 0, resulting in garbage positions in error messages like: Parsing JSON failed: Missing a name for object member. at 1452254 These error messages were most noticable in Alternator, which parses JSON requests using a chunked_content, and reports these errors back to the user. The fix is trivial: add the missing initialization of the counter. The patch also adds a regression test for this bug - it sends a JSON corrupt at position 1, and expect to see "at 1" and not some large random number. Fixes #27372 Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-11 11:17:01 +02:00
Nadav Har'El	102516a787	test/alternator: extract get_signed_request() to util.py get_signed_request() started in test_manual_requests.py as a way to sign a manually-created DynamoDB-API request - for sending requests that boto3 can't. Over time, we started to use this function in additional test files, and it's about time to move it to util.py - which is more natural to import from multiple files. This patch also adds a new function, manual_request(), which combines get_signed_request() and actually sending the request via requests.post(). New tests should prefer it, because it's easier to use. We'll use the new function in tests that we add in the next patches. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-12-11 11:16:42 +02:00
Marcin Maliszkiewicz	d5b63df46e	transport: remove redundant futurize_invoke from counted data sink and source Closes scylladb/scylladb#27526	2025-12-11 10:32:16 +03:00
Dario Mirovic	f545ed37bc	test: dtest: audit_test.py: fix audit error log detection `test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py` has an insert statement that is expected to fail. Dtest environment uses `FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails means we expect 6 error audit logs. The test failed because `create keyspace ks` failed once, then succeeded on retry. It allowed the test to proceed properly, but the last part of the test that expects exactly 6 failed queries actually had 7. The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed queries, counting only the query expected to fail. If other queries fail with successful retry, it's fine. If other queries fail without successful retry, the test will fail, as it should in such situations. They are not related to this expected failed insert statement. Fixes #27322 Closes scylladb/scylladb#27378	2025-12-11 10:17:07 +03:00
Benny Halevy	5f13880a91	utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout When waiting for the condition variable times out we call on_internal_error, but unfortunately, the backtrace it generates is obfuscated by `coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`. To make the log more useful, print the error injection name and the caller's source_location in the timeout error message. Fixes #27531 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27532	2025-12-10 23:25:54 +01:00
Calle Wilund	8c4ac457af	test::cluster::dtest::tools::files: Remove file This contained only one routine; `corrupt_file`, which is highly problematic, and not used. If you want to "corrupt" a file, it should be done controlled, not at random.	2025-12-10 15:37:04 +01:00
Calle Wilund	e48170ca8e	commitlog_replay: Handle fully corrupt files same as partial corruption. Fixes #26744 If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category. However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up. The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of loosing data as expected. Changed test to consistently cause this exact error instead.	2025-12-10 15:37:04 +01:00
Andrzej Jackowski	11ad32c85e	test/boost: add too_long_password to auth_passwords_test The test documents the current behavior of hashing algorithms that fail if the passphrase has 512 bytes or more. Moreover, it documents the behavior of the current bcrypt implementation that compares only the first 72 bytes of the password. Although we don't typically use bcrypt for password hashing, it is possible to insert such a hash using `CREATE ROLE ... WITH HASHED PASSWORD ...`. Refs: scylladb/scylladb#26842	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	4c8c9cd548	test/boost: add same_hashes_as_crypt_r to auth_passwords_test The test verifies that the old and new implementation of SHA-512 hashing returns exactly the same values. Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	98f431dd81	auth: utils: add yielding to crypt_sha512 This change allows yielding during hashing computations to prevent stalls. The performance of this solution was compared with the previous implementation that used one alien thread and the implementation after the alien thread was reverted. The results (median) of `perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters are as follows: - Alien thread: 41.5 new connections/s per shard - Reverted alien thread: 244.1 new connections/s per shard - This commit (yielding in hashing): 198.4 new connections/s per shard The alien thread is limited by a single-core hashing throughput, which is roughly 400-500 hashes/s in the test environment. Therefore, with smp=10, the throughput is below 50 hashes/s, and the difference between the alien thread and other solutions further increases with higer smp. The roughly 20% performance deterioration compared to the old implementation without the alien thread comes from the fact that the new hashing algorithm implemented in `utils/crypt_sha512.cc` performs an expensive self-verification and stack cleanup. On the other hand, with smp=10 the current implementation achieves roughly 5x higher throughput than the alien thread. In addition, due to yielding added in this commit, the algorithm is expected to provide similar protection from stalls as the alien thread did. In a test that in parallel started a cassandra-stress workload and created thousands of new connections using python-driver, the values of `scylla_reactor_stalls_count` metric were as follows: - Alien thread: 109 stalls/shard total - Reverted alien thread: 13186 stalls/shard total - This commit (yielding in hashing): 149 stalls/shard total Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms` values were: - Alien thread: 1087 ms/shard total - Reverted alien thread: 72839 ms/shard total - This commit (yielding in hashing): 1623 ms/shard total To summarize, yielding during hashing computations achieves similar throughput to the old solution without the alien thread but also prevents stalls similarly to the alien thread. Fixes: scylladb/scylladb#26859 Refs: scylladb/scylla-enterprise#5711	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	4ffdb0721f	auth: change return type of passwords::check to future Introduce a new `passwords::hash_with_salt_async` and change the return type of `passwords::check` to `future<bool>`. This enables yielding during password computations later in this patch series. The old method, `hash_with_salt`, is marked as deprecated because new code should use the new `hash_with_salt_async` function. We are not removing `hash_with_salt` now to reduce the regression risk of changing the hashing implementation—at least the methods that change persistent hashes (CREATE, ALTER) will continue to use the old hashing method. However, in the future, `hash_with_salt` should be entirely removed. Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	775906d749	auth: remove code duplication in verify_scheme Refactoring: create a new function `verify_hashing_output` to reuse code in `hash_with_salt` and `verify_scheme`. The change is introduced to facilitate verification of hashing output when the implementation is extended later in this patch series. Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	11eca621b0	test/boost: coroutinize auth_passwords_test This commit prepares `auth_passwords_test` for using coroutines, because later in this patch series `auth::passwords::check` and other similar functions will return Seastar futures. Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	d7818b56df	utils: coroutinize crypt_sha512 Change `sha512crypt` and `__crypt_sha512` to coroutines to allow yielding during hash computations later in this patch series. Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	033fed5734	utils: make crypt_sha512.cc to compile The purpose of this change is to allow the usage of Seastar futures in crypt_sha512 later in this patch series. Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	c6c30b7d0a	utils: license: import crypt_sha512.c from musl to the project This patch imports the `crypt_sha512.c` file from the musl library. We need it to incorporate yielding in the `crypt_r` function to avoid reactor stalls during long hashing computations. Before this patch series, ScyllaDB used SHA-512 hashing provided by the `crypt_r` function, which in our case meant using the implementation from the `libxcrypt` library. Adding yielding to this `libxcrypt` implementation is problematic, both due to licensing (LGPL) and because the implementation is split into many functions across multiple files. In contrast, the SHA-512 implementation from `musl libc` has a more permissive license and is concise, which makes it easier to incorporate into the ScyllaDB codebase. Both `crypt_sha512.c` and musl license are obtained from git.musl-libc.org: - https://git.musl-libc.org/cgit/musl/tree/src/crypt/crypt_sha512.c - https://git.musl-libc.org/cgit/musl/tree/COPYRIGHT Import commit: commit 1b76ff0767d01df72f692806ee5adee13c67ef88 Author: Alex Rønne Petersen <alex@alexrp.com> Date: Sun Oct 12 05:35:19 2025 +0200 s390x: shuffle register usage in __tls_get_offset to avoid r0 as address Refs: scylladb/scylladb#26859	2025-12-10 15:36:18 +01:00
Andrzej Jackowski	5afcec4a3d	Revert "auth: move passwords::check call to alien thread" The alien thread was a solution for reactor stalls caused by indivisible password‑hashing tasks (scylladb/scylladb#24524). However, because there is only one alien thread, overall hashing throughput was reduced (see, e.g., scylladb/scylla-enterprise#5711). To address this, the alien‑thread solution is reverted, and a hashing implementation with yielding will be introduced later in this patch series. This reverts commit `9574513ec1`.	2025-12-10 15:36:09 +01:00
Calle Wilund	9b5f3d12a3	test::pylib::suite::base: Split options.name test specifier only once For some arcane reason, we split optional the test pattern given to test.py twice across '::' to get the file + case specifiers later given to pytest etc. This means that for a test with a class group (such as some migrated dtests), we cannot really specify the exact test to run (pattern <file>::<class>::test). Simply splitting only on first '::' fixes this. Should not affect any other tests.	2025-12-10 15:35:12 +01:00
Tomasz Grabiec	0e51a1f812	replica: Remove unnecessary noexcept Can potentially lead to unnecessary abort. compaction_groups() and for_each_compaction_group() can throw. Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>	2025-12-10 14:51:35 +01:00
Tomasz Grabiec	8b807b299e	replica: Remove noexcept from compaction_groups() functions They can throw during merge, when the number of compaction groups is higher than 3. Callers can deal with that, so we shouldn't abort.	2025-12-10 14:48:23 +01:00
Tomasz Grabiec	07ff659849	replica: Remove noexcept from storage_group::for_each_compaction_group They don't really have to be noexcept. And "action" may actually throw, leading to abort. It was observed to throw when creating memtable readers: terminate called after throwing an instance of 'utils::memory_limit_reached' what(): kill limit triggered on semaphore sl:users by permit xxx Aborting on shard 4, in scheduling group sl:users. std::terminate() at ??:0 __clang_call_terminate at main.cc:0 replica::storage_group::for_each_compaction_group(std::function<void (seastar::lw_shared_ptr<replica::compaction_group> const&)>) const at ./replica/table.cc:920 (inlined by) replica::table::add_memtables_to_reader_list(std::vector<mutation_reader, std::allocator<mutation_reader>>&, seastar::lw_shared_ptr<schema const> const&, reader_permit const&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr const&, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>, std::function<void (unsigned long)>) const at ./replica/table.cc:196 (inlined by) replica::table::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:243 (inlined by) replica::table::as_mutation_source() const::$_0::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:3673 (inlined by) mutation_reader std::__invoke_impl<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(std::__invoke_other, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61 (inlined by) std::enable_if<is_invocable_r_v<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>, mutation_reader>::type std::__invoke_r<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:114 (inlined by) std::_Function_handler<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>), replica::table::as_mutation_source() const::$_0>::_M_invoke(std::_Any_data const&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290 (inlined by) std::function<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>)>::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591 (inlined by) mutation_source::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ././readers/mutation_source.hh:143 query::querier_base::querier_base(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, mutation_source const&, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:91 (inlined by) query::querier::querier(mutation_source const&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:164 (inlined by) replica::table::query(seastar::lw_shared_ptr<schema const>, reader_permit, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, query::result_memory_limiter&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::optional<query::querier>) at ./replica/table.cc:3583 replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0::operator()(reader_permit) const at ./replica/database.cc:1533 (inlined by) seastar::noncopyable_function<seastar::future<void> (reader_permit)>::indirect_vtable_for<replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0>::call(seastar::noncopyable_function<seastar::future<void> (reader_permit)> const, reader_permit) (.llvm.13537529942037499926) at ././seastar/include/seastar/util/noncopyable_function.hh:158 seastar::noncopyable_function<seastar::future<void> (reader_permit)>::operator()(reader_permit) const at ././seastar/include/seastar/util/noncopyable_function.hh:215 (inlined by) reader_concurrency_semaphore::execution_loop() (.resume) at ./reader_concurrency_semaphore.cc:980 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ./build/release/seastar/./seastar/include/seastar/core/coroutine.hh:122 (inlined by) seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2627 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3099 seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3267 seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0::operator()() const at ./build/release/seastar/./seastar/src/core/reactor.cc:4591 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>, void>::type std::__invoke_r<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111 (inlined by) std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290 std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591 Fixes #27475 Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>	2025-12-10 14:48:11 +01:00
Yaron Kaikov	d3e199984e	auto-backport.py: modify instruction for making PR ready for review Update the comment sent when PR has conflicts with clear instrauctions how to make the PR Ready for review Fixes: https://scylladb.atlassian.net/browse/RELENG-152 Closes scylladb/scylladb#27547	2025-12-10 14:53:38 +02:00
Pavel Emelyanov	a63508a6f7	object_storage: Temporarily handle pure endpoint addresses as endpoints To keep backward compatibility, support - old configs -- where endpoint is just an address and port is separate. When it happens, format the "new" endpoint name - lookup by address-only. If it happens, scan all endpoints and see if any one matches the provided address Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-10 15:33:47 +03:00
Pavel Emelyanov	ca4564e41c	code: Remove dangling mentions of s3::endpoint_config Collect some code that's unused after previous changes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-10 15:33:47 +03:00
Pavel Emelyanov	f93cafac51	docs: Update docs according to new endpoints config option format Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-10 15:33:47 +03:00
Pavel Emelyanov	a3ca4fccef	object_storage: Create s3 client with "extended" endpoint name For this, add the s3::client::make(endpoint, ...) overload that accepts endpoint in proto://host:port format. Then it parses the provided url and calls the legacy one, that accepts raw host string and config with port, https bit, etc. The generic object_storage_endpoint_param no longer needs to carry the internal s3::endpoint_config, the config option parsing changes respectively. Tests, that generate the config files, and docs are updated. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-10 15:33:47 +03:00
Pavel Emelyanov	5953a89822	test: Add named constants for test_get_object_store_endpoints endpoint names Instead of hard-coded 'a' and 'b' here and there Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-10 15:33:46 +03:00
Pavel Emelyanov	932b008107	s3/storage: Tune config updating Don't prepare s3::endpoint_config from generic code, jut pass the region and iam_role_arn (those that can potentially change) to the callback. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-10 15:33:46 +03:00
Pavel Emelyanov	e47f0c6284	sstable: Shuffle args for s3_client_wrapper Make it construct like gs_client_wrapper -- with generic endpoint param reference and make the storage-specific casts/gets/whatever internally. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-10 15:33:46 +03:00
Nadav Har'El	8822c23ad4	Merge 'test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions thr…' from Dario Mirovic …eshold The initial problem: Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened. Test logic: These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests. The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values. Note that the exceptions are counted per shard, not per code path. The new problem: The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12. There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted. Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths. For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run). Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold. Also, this patch series enables debug logging for `exception` logger. This will allow us to inspect which exceptions happened if a protocol exceptions test fails again. Fixes #27247 Fixes #27325 Issue observed on master and branch-2025.4. The tests, in the same form, exist on master, branch-2025.4, branch-2025.3, branch-2025.2, and branch-2025.1. Code change is simple, and no issue is expected with backport automation. Thus, backports for all the aforementioned versions is requested. Closes scylladb/scylladb#27412 * github.com:scylladb/scylladb: test: cqlpy: test_protocol_exceptions.py: enable debug exception logging test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold	2025-12-10 10:53:30 +02:00
Marcin Maliszkiewicz	be9992cfb3	fix rjson::value to bytes conversion with missing GetStringLength call	2025-12-09 19:27:22 +01:00
Marcin Maliszkiewicz	daf00a7f24	alternator: change type from string to string_view in should_add_capacity It avoids allocation.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	62962f33bb	fix rjson::value to string_view conversion with missing GetStringLength call In some cases we unnecessarily convert to string which causes a copy. In other we convert without calling GetStringLength which causes iteration to dermine length which is already known. In some cases we do even both. This commit fixes that.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	060c2f7c0d	use rjson::to_string_view when rjson::value gets converted using GetStringLength This commit is only cosmetics, changes calls to GetStringLength into rjson::to_string_view with the same underlying implementation.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	64149b57c3	use rjson::to_sstring and rjson::to_string for various string conversions In some cases we ommit size checking which is wrong as according to rapid json documentation strings may contain \0 byte in the middle.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	4b004fcdfc	utils: use rjson document wrapper in instance_profile_credentials_provider::parse_creds So that we can use our common utility functions.	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	5e38b3071b	utils: move rjson::to_string_view func to string related place	2025-12-09 19:27:21 +01:00
Marcin Maliszkiewicz	225b3351fc	utils: add to_sstring and to_string rjson helper So that conversion code is common and it's easier to avoid accidental type conversions. Additionally according to rapid json library size must be checked explicitly, this also avoids extra iteration in char* to (s)string conversion.	2025-12-09 19:27:21 +01:00
Avi Kivity	80c6718ea8	build: update toolchain to Fedora 43 with clang 21.1.6 Rebase to Fedora 43 with clang 21.1 and libstdc++ 15. Fedora container image registry moved to registry.fedoraproject.org as it seems to be updated more regularly. Added python3-devel to the dependencies as some packages scylla-cqlsh depends on aren't yet available in the form of wheels for Python 3.14, and so have to be built locally. In any case it's better to reduce dependency on those wheels even if the ones currently missing appear eventually. Added libev-devel to the dependencies so that the python driver builds correctly even if "wheels" are not published. This reduces our dependency on the python driver's binary release schedule. Without libev-devel, TLS does not work correctly. We no long remove the clang and clang-libs packages. Doxygen started depending on clang-libs, and removing them removes doxygen, breaking the build when it looks for that. The build will still pick up the optimized clang, since /usr/local/bin is earlier in the path. We keep the clang package, since it allows us to mess a little less with the directory structure. Optimized clang binaries generates and stored in https://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-21.1.6-Fedora-43-x86_64.tar.gz With ./scripts/refresh-pgo-profiles.sh, the new compiler shows a small performance improvement (instructions_per_op) in perf-simple-query: clang 21: 259353.60 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35720 insns/op, 17427 cycles/op, 0 errors) 265940.08 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35725 insns/op, 17042 cycles/op, 0 errors) 262650.01 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35720 insns/op, 17240 cycles/op, 0 errors) 262881.22 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35675 insns/op, 17222 cycles/op, 0 errors) 264898.68 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35732 insns/op, 17070 cycles/op, 0 errors) throughput: mean= 263144.72 standard-deviation=2528.69 median= 262881.22 median-absolute-deviation=1753.96 maximum=265940.08 minimum=259353.60 instructions_per_op: mean= 35714.47 standard-deviation=22.34 median= 35720.38 median-absolute-deviation=10.20 maximum=35732.14 minimum=35675.50 cpu_cycles_per_op: mean= 17200.12 standard-deviation=154.62 median= 17221.70 median-absolute-deviation=129.77 maximum=17427.33 minimum=17041.57 clang 20: 254431.39 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35883 insns/op, 17708 cycles/op, 0 errors) 259701.02 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35883 insns/op, 17351 cycles/op, 0 errors) 261166.92 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35912 insns/op, 17270 cycles/op, 0 errors) 260656.31 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35869 insns/op, 17289 cycles/op, 0 errors) 259628.13 tps ( 64.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 35946 insns/op, 17370 cycles/op, 0 errors) throughput: mean= 259116.75 standard-deviation=2698.56 median= 259701.02 median-absolute-deviation=1539.55 maximum=261166.92 minimum=254431.39 instructions_per_op: mean= 35898.42 standard-deviation=30.69 median= 35882.97 median-absolute-deviation=15.90 maximum=35945.63 minimum=35869.02 cpu_cycles_per_op: mean= 17397.49 standard-deviation=178.35 median= 17351.35 median-absolute-deviation=108.79 maximum=17707.63 minimum=17269.68 Closes scylladb/scylladb#26773	2025-12-09 15:16:31 +02:00
Pavel Emelyanov	855b91ec20	scripts: Make PR merging check more granular Currently we have 3 explicit checks, and some of them are configurable: - Jenkins job being stable. Can be disabled with --force - Whether submodule update is happenning. It's not allowed by default, and should be enabled with --allow-submodule option - Target branch checking (recently merged #27249). Happens unconditionally This PR unifies all checks in two ways. First, each restriction can be lifted with --allow-foo options. The existing --allow-submodule stays and two options are added: - --allow-unstable to skip jenkins job check (like --force works now) - --allow-any-branch to skip target branch check Second, the --force option lifts all the known restrictions. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27294	2025-12-09 13:58:21 +02:00
Nadav Har'El	95e303faf3	Merge 'Refactor get_view_natural_endpoint' from Wojciech Mitros With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack. In this series we remove no longer needed code and reorganize the needed code for better clarity. After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines. Fixes https://github.com/scylladb/scylladb/issues/26313 Closes scylladb/scylladb#27383 * github.com:scylladb/scylladb: mv: replace the simple/complex rack-aware pairing with exact rack matching mv: split out vnode pairing code from get_view_natural_endpoint mv: unify self-pairing and rack-aware pairing into one bool mv: remove the workaround for left nodes when sending view updates	2025-12-09 13:19:13 +02:00
Nadav Har'El	8ba595e472	Merge 'alternator: fix batch writes during intranode tablet migrations' from Petr Gusev Scylla implements `LWT` in the` storage_proxy::cas` method. This method expects to be called on a specific shard, represented by the `cas_shard` parameter. Clients must create this object before calling `storage_proxy::cas`, check its `this_shard()` method, and jump to `cas_shard.shard()` if it returns false. The nuance is that by the time the request reaches the destination shard, the tablet may have already advanced in its migration state machine. For example, a client may acquire a `cas_shard` at the `streaming` tablet state, then submit a request to another shard via `smp::submit_to(cas_shard.shard())`. However, the new `cas_shard` created on that other shard might already be in the `write_both_read_new` state, and its `cas_shard.shard()` would not be equal to `this_shard_id()`. Such broken invariant results in an `on_internal_error` in `storage_proxy::cas`. Clients of `storage_proxy::cas` are expected to check` cas_shard.this_shard()` and recursively jump to another shard if it returns false. Most calls to `storage_proxy::cas` already implement this logic. The only exception is `executor::do_batch_write`, which currently checks `cas_shard.this_shard()` only once. This can break the invariant if the tablet state changes more than once during the operation. This PR fixes the issue by implementing recursive `cas_shard.this_shard()` checks in `executor::do_batch_write`. It also adds a test that reproduces the problem. Fixes: scylladb/scylladb#27353 backport: need to be backported to 2025.4 Closes scylladb/scylladb#27396 * github.com:scylladb/scylladb: alternator/executor.cc: eliminate redundant dk copy alternator/executor.cc: release cas_shard on the original shard alternator/executor.cc: move shard check into cas_write alternator/executor.cc: make cas_write a private method alternator/executor.cc: make do_batch_write a private method alternator/executor.cc: fix indent test_alternator: add test_alternator_invalid_shard_for_lwt	2025-12-09 11:25:15 +02:00
Petr Gusev	608eee0357	alternator/executor.cc: eliminate redundant dk copy A small refactoring/optimization.	2025-12-09 10:21:06 +01:00
Petr Gusev	0bcc2977bb	alternator/executor.cc: release cas_shard on the original shard Before this series, we kept the cas_shard on the original shard to guard against tablet movements running in parallel with storage_proxy::cas. The bug addressed by this PR shows that this approach is flawed: keeping the cas_shard on the original shard does not guarantee that a new cas_shard acquired on the target shard won’t require another jump. We fixed this in the previous commit by checking cas_shard.this_shard() on the target shard and continuing to jump to another shard if necessary. Once cas_shard.this_shard() on the target shard returns true, the storage_proxy::cas invariants are satisfied, and no other cas_shard instances need to remain alive except the one passed into storage_proxy::cas.	2025-12-09 10:21:06 +01:00
Petr Gusev	3a865fe991	alternator/executor.cc: move shard check into cas_write This change ensures that if cas_shard points to a different shard, the executor will continue issuing shard jumps until cas_shard.this_shard() returns true. The commit simply moves the this_shard() check from the parallel_for_each lambda into cas_write, with minimal functional changes. We enable test_alternator_invalid_shard_for_lwt since now it should pass. Fixes scylladb/scylladb#27353	2025-12-09 10:21:01 +01:00
Pavel Emelyanov	fb32e1c7ee	Merge 'streaming: tablet_sstable_streamer::stream refactoring' from Ernest Zaslavsky Refactor the way we decide the sstable belong to a tablet, fully or partially to simplify the flow and make it more readable. Also extract the logic and make it testable, add tests to cover changes The change is purely aesthetic, no need to backport Closes scylladb/scylladb#27101 * github.com:scylladb/scylladb: streaming: remove unnecessary lambda creating sstable token range streaming: simplify get_sstables_for_tablets logic streaming: switch to range-based for loop streaming: drop sstable skip microoptimization in tablet loop streaming: replace reverse iterators with reverse view in sstables scan streaming: return from get_sstables_for_tablets earlier streaming: add get_sstables_by_tablet_range tests test,sstables: add helper to set sstable first and last keys streaming: refactor get_sstables_for_tablets to make it accessible streaming: refactor get_sstables_for_tablets to make it testable streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic	2025-12-09 10:53:57 +03:00
Patryk Jędrzejczak	b6895f0fa7	test: make test_broken_bootstrap faster This change makes the test ~20 s faster. It's a forgotten follow-up: https://github.com/scylladb/scylladb/pull/18927#discussion_r1627331946 Closes scylladb/scylladb#27445	2025-12-09 09:25:42 +02:00
Dario Mirovic	c30b326033	test: cqlpy: test_protocol_exceptions.py: enable debug exception logging Enable debug logging for "exception" logger inside protocol exception tests. The exceptions will be logged, and it will be possible to see which ones occured if a protocol exceptions test fails. Refs #27272 Refs #27325	2025-12-09 01:35:42 +01:00
Dario Mirovic	807fc68dc5	test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold The initial problem: Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened. Test logic: These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests. The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values. Note that the exceptions are counted per shard, not per code path. The new problem: The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12. There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted. Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths. For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run). Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the exception burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold. Fixes #27247 Fixes #27325	2025-12-09 01:34:48 +01:00
Michał Jadwiszczak	51843195f7	test/boost/view_build_test: increase number of retires Default number of retires in `eventually()` in `test_builder_with_concurrent_drop` sometimes is not enough to observe changes in system tables on aarch64 builds. This patch increases the number of retries to 30. Fixes scylladb/scylladb#27370 Closes scylladb/scylladb#27493	2025-12-08 23:14:01 +02:00
Gleb Natapov	7038b8b544	test/scylla_cluster: fix the check that a process failed to start If the process is running returncode will be Node, otherwise it will have some value (which can be 0 s well) and the current code treats 0 as if the process is still running. Closes scylladb/scylladb#27490	2025-12-08 18:23:29 +01:00
Tomasz Grabiec	7df610b73d	sstables: Remove host id mismatch warning for sstable streaming Tablet migration transfers sstable files without changing origin host-id. As it should, becuase those sstables were not written on the destination host, and should be ignored by commit log replay. So it's a normal situation, and it's confusing to see this warning in logs. Fixes #26957 Closes scylladb/scylladb#27433	2025-12-08 18:39:22 +02:00
Piotr Dulikowski	386309d6a0	Merge 'Improve the way distributed-loader constructs storage_options for backup sstables' from Pavel Emelyanov The distributed_loader::get_sstables_from_object_store() method accepts an endpoint parameter and internally wants to get storage type for that endpoint (s3 or gcs). This is needed to construct storage_options object to create an sstable object. To get the type, the method scans db::config option, but there's much simpler way to get one. Code cleanup, no need to backport Closes scylladb/scylladb#27381 * github.com:scylladb/scylladb: sstables_loader: Provide endpoint type for get_sstables_from_object_store() storage_manager: Introduce get_endpoint_type() method storage_manager: Split get_endpoint_client()	2025-12-08 16:55:20 +01:00
Yauheni Khatsianevich	f12adfc292	test/lwt: add counter-table support to BaseLWTTester Extend BaseLWTTester with optional counter-table configuration and verification, enabling randomized LWT tests over tablets with counters.	2025-12-08 15:57:33 +01:00
Amnon Heiman	a213e41250	scylla-node-exporter: Add ethtool to node exporter AWS suggests following multiple network performance metrics: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html#network-performance-metrics This patch enables the ethtool collector with the specific list of metrics Ater this patch the relevant metris looks like: $ curl http://localhost:9100/metrics \|& grep ethtool node_ethtool_bw_in_allowance_exceeded{device="ens5"} 0 node_ethtool_bw_out_allowance_exceeded{device="ens5"} 0 node_ethtool_conntrack_allowance_available{device="ens5"} 51303 node_ethtool_conntrack_allowance_exceeded{device="ens5"} 0 node_ethtool_info{bus_info="0000:00:05.0",device="ens5",driver="ena",expansion_rom_version="",firmware_version="",version="6.14.0-1015-aws"} 1 node_ethtool_linklocal_allowance_exceeded{device="ens5"} 0 node_scrape_collector_duration_seconds{collector="ethtool"} 0.001091436 node_scrape_collector_success{collector="ethtool"} 1 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#27358	2025-12-08 14:27:10 +02:00
Dawid Mędrek	58dc414912	test/cluster/mv: Rewrite test_view_building_scheduling_group We rewrite the test to avoid flakiness. Instead of looking at the metrics, we make a trade-off and start depending on a less reliable mechanism -- logs. We grep all relevant messages printed by Scylla in TRACE mode and make sure that they were all printed from a context using the streaming scheduling group. Although it's a "less proper" way of testing, it should be much more dependable and avoid flakiness. Fixes scylladb/scylladb#25957 Closes scylladb/scylladb#26656	2025-12-08 14:24:25 +02:00
Ferenc Szili	d883ff2317	test: fix flakyness caused by TRUNCATE retries The test test_truncate_during_topology_change tests TRUNCATE TABLE while bootstrapping a new node. With tablets enabled TRUNCATE is a global topology operation which needs to serialize with boostrap. When TRUNCATE TABLE is issued, it first checks if there is an already queued truncate for the same table. This can happen if a previous TRUNCATE operation has timed out, and the client retried. The newly issued truncate will only join the queued one if it is waiting to be processed, and will fail immediatelly if the TRUNCATE is already being processed. In this test, TRUNCATE will be retried after a timeout (1 minute) due to the default retry policy, and will be retried up to 3 times, while the bootstrap is delayed by 2 minutes. This means that the test can validate the result of a truncate which was started after bootstrap was completed. Because of the way truncate joins existing truncate operations, we can also have the following scenario: - TRUNCATE times out after one minute because the new node is being bootstrapped - the client retries the TRUNCATE command which also times out after 1m - the third attempt is received during TRUNCATE being processed which fails the test This patch changes the retry policy of the TRUNCATE operation to FallthroughRetryPolicy which guarantees that TRUNCATE will not be retried on timeout. It also increases the timeout of the TRUNCATE from 1 to 4 minutes. This way the test will actually validate the performance of the TRUNCATE operation which was issued during bootstrap, instead of the subsequent, retried TRUNCATEs which could have been issued after the bootstrap was complete. Fixes: #26347 Closes scylladb/scylladb#27245	2025-12-08 14:13:26 +02:00
dependabot[bot]	1f777da863	build(deps): bump sphinx-scylladb-theme from 1.8.9 to 1.8.10 in /docs Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.9 to 1.8.10. - [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases) - [Commits](https://github.com/scylladb/sphinx-scylladb-theme/commits) --- updated-dependencies: - dependency-name: sphinx-scylladb-theme dependency-version: 1.8.10 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#27468	2025-12-08 13:40:51 +02:00
Asias He	faad0167d7	repair: Add tablet repair progress report support This patch adds tablet repair progress report support so that the user could use the /task_manager/task_status API to query the progress. In order to support this, a new system table is introduced to record the user request related info, i.e, start of the request and end of the request. The progress is accurate when tablet split or merge happens in the middle of the request, since the tokens of the tablet are recorded when the request is started and when repair of each tablet is finished. The original tablet repair is considered as finished when the finished ranges cover the original tablet token ranges. After this patch, the /task_manager/task_status API will report correct progress_total and progress_completed. Fixes #22564 Fixes #26896 Closes scylladb/scylladb#26924	2025-12-08 13:35:19 +02:00
Andrei Chekun	0115a21b9a	test.py: fail test when timeout reached for boost test There is a bug in current pytest's boost implementation. When timeout reached process will be killed, but it was not correctly propagated, that lead to a false positive result. This will fail test case when timeout for the process is reached. This is to prevent issues like this https://github.com/scylladb/scylladb/issues/27237 Closes scylladb/scylladb#27463	2025-12-08 11:49:46 +01:00
Ernest Zaslavsky	71834ce7dd	streaming: remove unnecessary lambda creating sstable token range The `sstable_token_range` lambda was only used once to create a token range for an SSTable. Inline the construction directly where needed, removing the extra lambda. This simplifies the code without changing behavior.	2025-12-08 12:30:24 +02:00
Ernest Zaslavsky	df21112c39	streaming: simplify get_sstables_for_tablets logic Remove the use of the `overlaps` helper and unnest nested conditionals in get_sstables_for_tablets. Straightforward `before` and `after` checks are sufficient to decide how each SSTable should be handled.	2025-12-08 12:30:24 +02:00
Ernest Zaslavsky	bd339cc4d8	streaming: switch to range-based for loop Replace the explicit iterator loop with a range-based for loop. This simplifies the code, enforces constness, and avoids the unnecessary use of postfix increment. The behavior remains unchanged,but readability and maintainability are improved.	2025-12-08 12:30:23 +02:00
Ernest Zaslavsky	91bf23eea1	streaming: drop sstable skip microoptimization in tablet loop Remove the microoptimization that advanced over SSTables ending before a tablet range. This approach is misleading since SSTables are not sorted by their end token, and the extra logic adds complexity with little to no benefit. The streaming path here is not performance‑critical, so the simpler loop is preferable.	2025-12-08 12:30:23 +02:00
Ernest Zaslavsky	f925ed176b	streaming: replace reverse iterators with reverse view in sstables scan Use a reverse view over the SSTables vector instead of reverse iterators. This avoids awkward rbegin/rend usage and the mental overhead of tracking inverted sort order. With a view, we can use standard begin/end iteration while preserving the intended scan direction.	2025-12-08 12:30:23 +02:00
Ernest Zaslavsky	68dcd1b1b2	streaming: return from get_sstables_for_tablets earlier Check if tablets or sstables list is empty and if so, return immediately	2025-12-08 12:30:23 +02:00
Ernest Zaslavsky	6fd5160947	streaming: add get_sstables_by_tablet_range tests Add a comprehensive test suite that exercises various combinations of SSTable containment within tablet ranges. These cases cover boundary conditions, partial overlaps, and full containment to validate all recent changes made to `get_sstables_by_tablet_range`.	2025-12-08 12:30:23 +02:00
Ernest Zaslavsky	3fc914ca59	test,sstables: add helper to set sstable first and last keys Introduce a utility helper to set the first and last decorated keys on an SSTable. This is intended for testing purposes, making it easier to construct SSTables with defined boundaries in unit tests.	2025-12-08 12:30:23 +02:00
Ernest Zaslavsky	6ef7ad9b5a	streaming: refactor get_sstables_for_tablets to make it accessible Create `get_sstables_for_tablets_for_tests` friend free function for testing purposes. Adding this free function allows direct testing without requiring the full streamer context.	2025-12-08 12:30:23 +02:00
Ernest Zaslavsky	581b8ace83	streaming: refactor get_sstables_for_tablets to make it testable Make the `get_sstables_for_tablets` member function `static`. This is a step toward improved testability, allowing the function to be invoked directly without requiring a full instance of the streamer.	2025-12-08 12:30:23 +02:00
Pavel Emelyanov	8192f45e84	Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier. When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory and the manifest.json file, rather than using the sstable generation. This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable may be migrated across shards or across nodes, and in this case, its generation may change, but its sstable identifier remains sstable. Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to object storage and exist in previous backed up snapshots. Historically, the sstable generation was guaranteed to be unique only per table per node, so the dedup code currently checks for deduplication in the node scope. However, with tablet migration, sstables are renamed when migrated to a different shard, i.e. their generation changes, and they may be renamed when migrated to another node, but even if they are not, the dedup logic still assumes uniqueness only within a node. To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since `3a12ad96c7`). Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables in a wider scope. This can be cluster-wide, but we practically need only rack-wide deduplication or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a numerical replication factor to a rack list containing a subset of the available racks in a datacenter). Fixes #27181 * New feature, no backport required Closes scylladb/scylladb#27184 * github.com:scylladb/scylladb: database: truncate_table_on_all_shards: set use_sstable_identifier to true nodetool: snapshot: add --use-sstable-identifier option api: storage_service: take_snapshot: add use_sstable_identifier option test: database_test: add snapshot_use_sstable_identifier_works test: database_test: snapshot_works: add validate_manifest sstable: write_scylla_metadata: add random_sstable_identifier error injection table: snapshot_on_all_shards: take snapshot_options sstable: add get_format getter sstable: snapshot: add use_sstable_identifier option db: snapshot_ctl: snapshot_options: add use_sstable_identifier options db: snapshot_ctl: move skip_flush to struct snapshot_options	2025-12-08 12:56:12 +03:00
Petr Gusev	c6eec4eeef	alternator/executor.cc: make cas_write a private method We will need to access executor::_stats field from cas_write. We could pass it as a paramter, but it seems simpler to just make cas_write and instance method too.	2025-12-08 10:29:54 +01:00
Petr Gusev	9bef142328	alternator/executor.cc: make do_batch_write a private method We will need to access executor::_stats field on other shards.	2025-12-08 10:29:54 +01:00
Petr Gusev	74bf24a4a7	alternator/executor.cc: fix indent	2025-12-08 10:29:28 +01:00
Petr Gusev	e60bcd0011	test_alternator: add test_alternator_invalid_shard_for_lwt This test reproduces scylladb/scylladb#27353 using two injection points. First, the test triggers an intra-node tablet migration and suspends it at the streaming stage using the intranode_migration_streaming_wait injection. Next, it enables the alternator_executor_batch_write_wait injection, which suspends a batch write after its cas_shard has already been created. The test then issues several batch writes and waits until one of them hits this injection on the destination shard. At this point, the cas_shard.erm for that write is still in the streaming state, meaning the executor would need to jump back to the source shard. The test then resumes the suspended tablet migration, allowing it to update the ERM on the source shard to write_both_read_new. After that, the test releases the suspended batch write and expects it to perform two shard jumps: first from the destination to the source shard, and then again back to the source shard. This commit adds the alternator_executor_batch_write_wait injection to alternator/executor.cc. Coroutines are intentionally avoided in the parallel_for_each lambda to prevent unnecessary coroutine-frame allocations.	2025-12-08 10:29:28 +01:00
Avi Kivity	45c16553eb	Revert "Update tools/cqlsh submodule" This reverts commit `ff1b212319`. In this commit, the python driver was updated to 3.29.6. That version has a serious flaw - it rejects compression=None settings [1] which cqlsh (legitimately) uses in copyutil.py. The reason this hasn't caused numerous continuous integration failures is that the submodule update commit did not update the frozen toolchain, so the build was effectively running with an older version of the driver. Fix by reverting the change. This allows us to regenerate the frozen toolchain when we need to. Reverted changes: * tools/cqlsh 2240122...6badc99 (2): > Update scylla-driver version to 3.29.6 > Revert "Migrate workflows to Blacksmith" [1] `78f554236f` Closes scylladb/scylladb#27473	2025-12-08 08:50:52 +02:00
Nadav Har'El	c984f557ef	Merge 'alternator: eliminate cross shard ::free for do_batch_write' from Petr Gusev This is an optimization follow-up [for this PR](https://github.com/scylladb/scylladb/pull/27396#issuecomment-3611410774): avoiding destruction of foreign objects on the wrong shard. Releasing objects allocated on a different shard causes their ::free calls to be executed remotely, which adds unnecessary load to the SMP subsystem. Before this PR, a `std::vector<put_or_delete_item>` could be moved to another shard. When the vector was eventually destroyed, its ::free had to be marshalled back to the shard where the memory had originally been allocated. This change avoids that overhead by passing the vector by const reference instead. backport: not needed, this is an optimization Closes scylladb/scylladb#27432 * github.com:scylladb/scylladb: alternator/executor.cc: avoid cross-shard free storage_proxy: cas: take cas_request by raw reference	2025-12-07 22:54:36 +02:00
Andrei Chekun	5e83311305	test.py: switch to ThreadPoolExecutor With python 3.14, the Process fails due to pickling issue with nodes objects. This will eliminate this issue, so we can bump up the python version. Closes scylladb/scylladb#27456	2025-12-07 17:37:25 +02:00
Petr Gusev	f00f7976c1	alternator/executor.cc: avoid cross-shard free This commit is an optimization: avoiding destruction of foreign objects on the wrong shard. Releasing objects allocated on a different shard causes their ::free calls to be executed remotely, which adds unnecessary load to the SMP subsystem. Before this patch, a std::vector could be moved to another shard. When the vector was eventually destroyed, its ::free had to be marshalled back to the shard where the memory had originally been allocated. This change avoids that overhead by passing the vector by const reference instead. The referenced objects lifetime correctness reasoning: * the put_or_delete_item refs usages in put_or_delete_item_cas_request are bound to its lifetime * cas_request lifetime is bound to storage_proxy::cas future * we don't release put_or_delete_item-s untill all storage_proxy::cas calls are done.	2025-12-07 16:14:56 +01:00
Petr Gusev	c428645d16	storage_proxy: cas: take cas_request by raw reference In the next commit we want to add an optimization that relies on precise control over the lifetime of cas_request. In particular, we want the implementation of this interface in Alternator to operate on raw references that are guaranteed to remain valid only until the cas() future is resolved. We already depend on the same lifetime assumptions in cas_request when used by modification_statement. However, these assumptions are not clearly expressed in the current interface: cas_request is taken by shared_ptr, and nothing prevents cas() from storing that pointer inside paxos_response_handler, which may outlive the cas() future. This commit fixes that by taking cas_request by raw reference. This makes it explicit that cas() does not assume ownership of the object. Callers must ensure that the referenced object remains valid until the returned future is resolved.	2025-12-07 16:14:56 +01:00
Tomasz Grabiec	082342ecad	Attach names to allocating sections for better debuggability Large reserves in allocating_section can cause stalls. We already log reserve increase, but we don't know which table it belongs to: lsa - LSA allocation failure, increasing reserve in section 0x600009f94590 to 128 segments; Allocating sections used for updating row cache on memtable flush are notoriously problematic. Each table has its own row_cache, so its own allocating_section(s). If we attached table name to those sections, we could identify which table is causing problems. In some issues we suspected system.raft, but we can't be sure. This patch allows naming allocating_sections for the purpose of identifying them in such log messages. I use abstract_formatter for this purpose to avoid the cost of formatting strings on the hot path (e.g. index_reader). And also to avoid duplicating strings which are already stored elsewhere. Fixes #25799 Closes scylladb/scylladb#27470	2025-12-07 14:14:25 +02:00
Avi Kivity	47efbdffbc	Merge 'cache, mvcc: Preempt cache update when applying range tombstone from memtable' from Tomasz Grabiec Range tombstones are represented as entry attributes, which applies to the interval between entries. So if a range tombstone covers many rows, to apply it we have to update all covered entries. In some workloads that could be many entries, even the whole cache. Before the patch, we did this update without preemption, which can cause reactor stalls in such workloads. This scenario is already covered by mvcc_tests, e.g. test_apply_to_incomplete_respects_continuity. And I verified that the new preemption point is hit in the test. perf-row-cache-update results show no significant stalls anymore (max 2ms scheduling delay, instead of previous 1.5 s): Generated 1124195 rows Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]} Draining... took 0.000616 [ms] cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630) update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0 Fixes #23479 Fixes #2578 Closes scylladb/scylladb#27469 * github.com:scylladb/scylladb: cache, mvcc: Preempt cache update when applying range tombstone from memtable partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone() perf-row-cache-update: Add scenario with large tombstone covering many rows	2025-12-07 11:54:15 +02:00
Avi Kivity	d811eeb4ca	Merge 'Make direct failure detector verb handler more efficient' from Gleb Natapov We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group. Fixes https://github.com/scylladb/scylladb/issues/27142 Backport to all version where `60f1053087` was backported to since it should improve performance in large clusters. Closes scylladb/scylladb#27387 * github.com:scylladb/scylladb: direct_failure_detector: run direct failure detector in the gossiper scheduling group raft: drop invoke_on from the pinger verb handler direct_failure_detector: pass timeout to direct_fd_ping verb	2025-12-07 11:40:26 +02:00
Marcin Maliszkiewicz	4784e39665	auth: fix ctor signature of certificate_authenticator In `b9199e8b24` we added cache argument to constructor of authenticators but certificate_authenticator was ommited. Class registrator sadly only fails in runtime for such cases. Fixes https://github.com/scylladb/scylladb/issues/27431 Closes scylladb/scylladb#27434	2025-12-07 11:18:42 +02:00
Tomasz Grabiec	d4014b7970	Drop legacy schema support We switched to using v3 schema tables (in system_schema keyspace) in 2017, in `9eb91bc30b`. So no system should have the old schema any more. No need to run legacy_schema_migrator on boot. Closes scylladb/scylladb#27420	2025-12-07 00:09:13 +02:00
Tomasz Grabiec	92b5e4d63d	cache, mvcc: Preempt cache update when applying range tombstone from memtable Range tombstones are represented as entry attributes, which applies to the interval between entries. So if a range tombstone covers many rows, to apply it we have to update all covered entries. In some workloads that could be many entries, even the whole cache. Before the patch, we did this update without preemption, which can cause reactor stalls in such workloads. This scenario is already covered by mvcc_tests, e.g. test_apply_to_incomplete_respects_continuity. And I verified that the new preemption point is hit in the test. perf-row-cache-update results show no significant stalls anymore (max 2ms scheduling delay, instead of previous 1.5 s): Generated 1124195 rows Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]} Draining... took 0.000616 [ms] cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630) update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0 Fixes #23479 Fixes #2578	2025-12-06 13:45:35 +01:00
Tomasz Grabiec	e546143fd9	partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone()	2025-12-06 01:03:10 +01:00
Tomasz Grabiec	721434054b	perf-row-cache-update: Add scenario with large tombstone covering many rows Fills memtable with rows and a tombstone which deletes all rows which are already in cache. Similar to raft log workload, but more extreme. With -c1 -m4G, observed really bad performance: update: 1711.976196 [ms], preemption: {count: 22603, 99%: 0.943127 [ms], max: 1494.571776 [ms]}, cache: 2148/2906 [MB], alloc/comp: 1334/869 [MB] (amp: 0.651), pr/me/dr 1062186/0/1062187 cache: 2148/2906 [MB], memtable: 738/1024 [MB], alloc/comp: 993/0 [MB] (amp: 0.000) Which means that max reactor stall during cache update was 1.5 [s] 0.7 GB memtables. 2.1 GB in cache.	2025-12-06 01:03:09 +01:00
Nadav Har'El	350cbd1d66	alternator: fix typo of BatchWriteItem in comments The DynamoDB API's "BatchWriteItem" operation is spelled like this, in singular. Some comments incorrectly referred to as BatchWriteItems - in plural. This patch fixes those mistakes. There are no functional changes here or changes to user-facing documents - these mistakes were only in code comments. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27446	2025-12-05 15:08:58 +02:00
Botond Dénes	866c96f536	Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk This pull request adds support for calculation and storing CRC32 digests for all SSTable components. This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream. All important SSTable components (Index, Partitions, Rows, Summary, Filter, CompressionInfo, and TOC) are covered. Several test cases where introduced to verify expected behaviour. Backport is not required, it is a new feature Fixes #20100 Closes scylladb/scylladb#27287 * github.com:scylladb/scylladb: sstable_test: add verification testcases of SSTable components digests persistance sstables: store digest of all sstable components in scylla metadata sstables: Add TemporaryScylla metadata component type sstables: Extract file writer closing logic into separate methods sstables: Add components_digests to scylla metadata components sstables: Implement CRC32 digest-only writer	2025-12-05 11:36:50 +02:00
Botond Dénes	367633270a	Merge 'EAR: handle IPV6 hosts in KMIP and use shared (improved) http parser in AWS/Azure' from Calle Wilund Fixes #27367 Fixes #27362 Fixes #27366 Makes http URL parser handle IPv6. Makes KMIP host setup handle IPv6 hosts + use system trust if no truststore set Moves Azure/KMS code to use shared http URL parser to avoid same regex everywhere. Closes scylladb/scylladb#27368 * github.com:scylladb/scylladb: ear::kms/ear::azure: Use utils::http URL parsing ear::kmip_host: Handle ipv6 hosts + use system trust when not specified utils::http: Handle ipv6 numeric host part in URL:s	2025-12-05 10:43:07 +02:00
Asias He	e97a504775	repair: Allow min max range to be updated for repair history It is observed that: repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb: seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum token,maximum token) is not in the format of (start, end]) This is because repair checks the end of the range to be repaired needs to be inclusive. When small_table_optimization is enabled for regular repair, a (minimum token,maximum token) will be used. To fix, we can relax the check of (start, end] for the min max range. Fixes #27220 Closes scylladb/scylladb#27357	2025-12-05 10:41:25 +02:00
Anna Stuchlik	a5c971d21c	doc: update the upgrade policy to cover non-consecutive minor upgrades Fixes https://github.com/scylladb/scylladb/issues/27308 Closes scylladb/scylladb#27319	2025-12-05 10:31:53 +02:00
Guy Shtub	a0809f0032	Update integration-jaeger.rst Fixing broken link in Jaeger Docs to ScyllaDB Closes scylladb/scylladb#26406	2025-12-05 10:23:07 +02:00
Piotr Dulikowski	bb6e41f97a	index: allow vector indexes without rf_rack_valid_keyspces The rf_rack_valid_keyspaces option needs to be turned on in order to allow creating materialized views in tablet keyspaces with numeric RF per DC. This is also necessary for secondary indexes because they use materialized views underneath. However, this option is _not_ necessary for vector store indexes because those use the external vector store service for querying the list of keys to fetch from the main table, they do not create a materialized view. The rf_rack_valid_keyspaces was, by accident, required for vector indexes, too. Remove the restriction for vector store indexes as it is completely unnecessary. Fixes: SCYLLADB-81 Closes scylladb/scylladb#27447	2025-12-05 09:26:26 +02:00
Marcin Maliszkiewicz	4df6b51ac2	auth: fix cache::prune_all roles iteration During `b9199e8b24` reivew it was suggested to use standard for loop but when erasing element it causes increment on invalid iterator, as role could have been erased before. This change brings back original code. Fixes: https://github.com/scylladb/scylladb/issues/27422 Backport: no, offending commit not released yet Closes scylladb/scylladb#27444	2025-12-04 23:35:54 +01:00
Taras Veretilnyk	0c8730ba05	sstable_test: add verification testcases of SSTable components digests persistance Adds a generic test helper that writes a random SSTable, reloads it, and verifies that the persisted CRC32 digest for each component matches the digest computed from disk. Those covers all checksummed components test cases.	2025-12-04 21:09:01 +01:00
Taras Veretilnyk	bc2e83bc1f	sstables: store digest of all sstable components in scylla metadata This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.	2025-12-04 21:00:09 +01:00
Patryk Jędrzejczak	f4c3d5c1b7	Merge 'fix test_coordinator_queue_management flakiness' from Gleb Natapov After `39cec4ae45` node join may fail with either "request canceled" notification or (very rarely) because it was banned. Depend on timing. The series fixes the test to check for both possibilities. Fixes #27320 No need to backport since the flakiness is in the mater only. Closes scylladb/scylladb#27408 * https://github.com/scylladb/scylladb: test: fix test_coordinator_queue_management flakiness test/pylib: allow expected_error in server_start to contain regular expression	2025-12-04 16:08:02 +01:00
Tomasz Grabiec	e54abde3e8	Merge 'main: delay setup of storage_service REST API' from Andrzej Jackowski The storage_service REST API uses `group0` internally. Before this patch, it was possible to send an HTTP request before `group0` was initialized, which resulted in a segmentation fault. Therefore, this patch delays the setup of the storage_service REST API. Additionally, `test_rest_api_on_startup` is added to reproduce the problem. Fixes: https://github.com/scylladb/scylladb/issues/27130 No backport. It's a crash fix but possible only if a request is sent in a very specific phase of a node start. Closes scylladb/scylladb#27410 * github.com:scylladb/scylladb: test: add test_rest_api_on_startup main: delay setup of storage_service REST API	2025-12-04 14:56:49 +01:00
Avi Kivity	9696ee64d0	database: fix overflow when computing data distribution over shards We store the per-shard chunk count in a uint64_t vector global_offset, and then convert the counts to offsets with a prefix sum: ```c++ // [1, 2, 3, 0] --> [0, 1, 3, 6] std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus()); ``` However, std::exclusive_scan takes the accumulator type from the initial value, 0, which is an int, instead of from the range being iterated, which is of uint64_t. As a result, the prefix sum is computed as a 32-bit integer value. If it exceeds 0x8000'0000, it becomes negative. It is then extended to 64 bits and stored. The result is a huge 64-bit number. Later on we try to find an sstable with this chunk and fail, crashing on an assertion. An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57 The fix is simple: the initial value is passed as uint64_t instead of int. Fixes https://github.com/scylladb/scylladb/issues/27417 Closes scylladb/scylladb#27418	2025-12-04 14:10:53 +01:00
Michał Jadwiszczak	aa908ba99c	docs/dev/view-building-coordinator: fix typos	2025-12-04 12:52:42 +01:00
Michał Jadwiszczak	529cd25c51	db/view/view_building_worker: remove unnnecessary empty lines	2025-12-04 12:52:42 +01:00
Michał Jadwiszczak	4fc5fcaec4	db/view/view_building_worker: fix typo	2025-12-04 12:52:42 +01:00
Michał Jadwiszczak	3253b05ec9	db/view/view_building_worker: avoid creating a copy of tasks map The loop can be converted to use an iterator and avoid creating a copy of the tasks map.	2025-12-04 12:52:41 +01:00
Michał Jadwiszczak	597a2ce5f9	db/view/view_building_worker: wrap conditionally compiled code in a scope The code creates a local variable, so it's better to wrap it in a local scope, to the conditionally compiled variable doesn't pollute the external scope.	2025-12-04 12:52:41 +01:00
Michał Jadwiszczak	a5f19af050	db/view/view_building_worker: remove unnecessary CV broadcast After scylladb/scylladb#26897 was merged, the worker doesn't use the view building state machine CV to manage lifetime of batches, so the broadcast is not needed.	2025-12-04 12:52:41 +01:00
Michał Jadwiszczak	b4fe565f07	db/view/view_building_worker: catch general execption in staging task registrator In case of general exception in `view_building_worker::create_staging_sstable_tasks()`, catch it, print it with error level and sleep 1s before retrying. This will allow for the registrator to retry its work in case of failure and it should be easier to detect any bugs in the method.	2025-12-04 12:52:37 +01:00
Calle Wilund	8dd69f02a8	ear::kms/ear::azure: Use utils::http URL parsing Fixes #27367 Move to reuse shared code.	2025-12-04 11:38:41 +00:00
Calle Wilund	d000fa3335	ear::kmip_host: Handle ipv6 hosts + use system trust when not specified Fixes #27362 The KMIP host connector should handle ipv4 connections (named or numeric). It also should fall back to system trust when truststore is not specified.	2025-12-04 11:38:41 +00:00
Calle Wilund	4e289e8e6a	utils::http: Handle ipv6 numeric host part in URL:s Fixes #27366 A URL with numeric host part formats special in case of ipv6, to avoid confusion with port part. The parser should handle this. I.e. http://[2001:db8:4006:812::200e]:8080 v2: * Include scheme agnostic parse + case insensitive scheme matching	2025-12-04 11:38:41 +00:00
Benny Halevy	19b6207f17	database: truncate_table_on_all_shards: set use_sstable_identifier to true To facilitate global sstable deduplication on backup. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:57:39 +02:00
Benny Halevy	ff52550739	nodetool: snapshot: add --use-sstable-identifier option Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:57:39 +02:00
Benny Halevy	e654045755	api: storage_service: take_snapshot: add use_sstable_identifier option Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:57:39 +02:00
Benny Halevy	07b92a1ee8	test: database_test: add snapshot_use_sstable_identifier_works Test that taking a snapshot with the use_sstable_identifier option (and injecting `random_sstable_identifier`) produces different file names in the snapshot than the original sstable names and validate te manifest.json file respectively. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:57:38 +02:00
Benny Halevy	7504d10d9e	test: database_test: snapshot_works: add validate_manifest Validate the manifest.json format by loading it using rjson::parse and then validate its contents to ensure it lists exactly the SSTables present in the snapshot directory. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:55:50 +02:00
Benny Halevy	28cb300d0a	sstable: write_scylla_metadata: add random_sstable_identifier error injection To be used by a unit test in the following patch for testing the snapshot use_sstable_identifier option. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:55:50 +02:00
Benny Halevy	9b3fbedc8c	table: snapshot_on_all_shards: take snapshot_options And pass the use_sstable_identifier down the stack to the sstables layer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:55:50 +02:00
Benny Halevy	420fb1fd53	sstable: add get_format getter To be used by the snapshot code in te following patch for manufacturing a basename using the sstable_id rather than its generation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:55:50 +02:00
Benny Halevy	7c62417b54	sstable: snapshot: add use_sstable_identifier option When set to true, use the sstable_identifier as the sstable name in the snapshot rather than its generation. sstable::snapshot now returns the generation it used for the sstable in the snapshot, based on the `use_sstable_identifier` option, to be used by the upper layer generating the manifest. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:53:32 +02:00
Botond Dénes	9d2f7c3f52	Merge 'mv: allow setting concurrency in PRUNE MATERIALIZED VIEW' from Wojciech Mitros The PRUNE MATERALIZED VIEW statement is performed as follows: 1. Perform a range scan of the view table from the view replicas based on the ranges specified in the statement. 2. While reading the paged scan above, for each view row perform a read from all base replicas at the corresponding primary key. If a discrepancy is detected, delete the row in the view table. When reading multiple rows, this is very slow because for each view row we need to performe a single row query on multiple replicas. In this patch we add an option to speed this up by performing many of the single base row reads concurrently, at the concurrency specified in the USING CONCURRENCY clause. Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were: * Pruning with concurrency 1 took total 1416 seconds * Pruning with concurrency 2 took total 731 seconds * Pruning with concurrency 10 took total 234 seconds * Pruning with concurrency 100 took total 171 seconds So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE Fixes https://github.com/scylladb/scylladb/issues/27070 Closes scylladb/scylladb#27097 * github.com:scylladb/scylladb: mv: allow setting concurrency in PRUNE MATERIALIZED VIEW cql: add CONCURRENCY to the USING clause	2025-12-04 11:47:41 +02:00
Aleksandra Martyniuk	e3e81a9a7a	repair: throw if flush failed in get_flush_time Currently, _flush_time was stored as a std::optional<gc_clock::time_point> and std::nullopt indicates that the flush was needed but failed. It's confusing for the caller and does not work as expected since the _flush_time is initialized with value (not optional). Change _flush_time type to gc_clock::time_point. If a flush is needed but failed, get_flush_time() throws an exception. This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319 but it was mistakenly overwritten during rebases. Refs: https://github.com/scylladb/scylladb/issues/24415. Closes scylladb/scylladb#26794	2025-12-04 11:45:53 +02:00
Gleb Natapov	86dde50c0d	direct_failure_detector: run direct failure detector in the gossiper scheduling group When direct failure detector was introduces the idea was that it will run on the same connection raft group0 verbs are running, but in `60f1053087` raft verbs were moved to run on the gossiper connection while DIRECT_FD_PING was left where it was. This patch move it to gossiper connection as well and fix the pinger code to run in gossiper scheduling group.	2025-12-04 11:35:43 +02:00
Gleb Natapov	6a6bbbf1a6	raft: drop invoke_on from the pinger verb handler Currently raft direct pinger verb jumps to shard 0 to check if group0 is alive before replying. The verb runs relatively often, so it is not very efficient. The patch distributes group0 liveness information (as it changes) to all shard instead, so that the handler itself does not need to jump to shard 0.	2025-12-04 11:35:43 +02:00
Avi Kivity	b82f92b439	main: replace p11-kit hack for trust paths override with gnutls hack p11-kit has hardcoded paths for the trust paths. Of course, each Linux distribution hardcodes those paths differently. As a result, our relocatable gnutls, which uses p11-kit-trust.so to process the trust paths, needs some overrides to select the right paths. Currently, we use p11_kit_override_system_files(), a p11-kit API intended for testing, but which worked well enough for our purpose, to override the trust module configuration. Unfortunately, starting (presumably [1]) in gnutls 3.8.11, gnutls changed how it works with p11-kit and our override is now ignored. This was likely unintentional, but there appears to be a better way: instead of letting gnutls auto-load the trust module from a hacked configuration, we load the modules outselves using gnutls_pkcs11_init(GNUTLS_PKCS11_FLAG_MANUAL) and gnutls_pkcs11_add_provider(). These appear to be intended for the purpose. We communicate the paths to the scylla executable using an environment variable. This isn't optimal, but is much easier than adding a command line variable since there are multiple levels of command line parsing due to the subtool mechanism. With this, we unlock the possibility to upgrade gnutls to newer versions. [1] `aa5f15a872` Closes scylladb/scylladb#27348	2025-12-04 11:33:51 +02:00
Gleb Natapov	f00e00fde0	test: fix test_coordinator_queue_management flakiness After `39cec4ae45` node join may fail with either "request canceled" notification or (very rarely) because it was banned. Depend on timing. The patch fixes the test to check for both possibilities.	2025-12-04 11:06:20 +02:00
Gleb Natapov	b0727d3f2a	test/pylib: allow expected_error in server_start to contain regular expression Currently expected_error parameter to server_start can only work with exact matches. Change it to support regular expressions.	2025-12-04 11:06:20 +02:00
Calle Wilund	4169bdb7a6	encryption::gcp_host: Add exponential retry for server errors Fixes #27242 Similar to AWS, google services may at times simply return a 503, more or less meaning "busy, please retry". We rely for most cases higher up layers to handle said retry, but we cannot fully do so, because both we reach this code sometimes through paths that do no such thing, and also because it would be slightly inefficient, since we'd like to for example control the back-off for auth etc. This simply changes the existing retry loop in gcp_host to be a little more forgiving, special case 503 errors and extend the retry to the auth part, as well as re-use the exponential_backoff_retry primitive. v2: * Avoid backoff if refreshing credentials. Should not add latency due to this. * Only allow re-auth once per (non-service-failure-backoff) try. * Add abort source to both request and retry v3: * Include timeout and other server errors in retry-backoff v4: * Reorder error code handling correctly Closes scylladb/scylladb#27267	2025-12-04 10:13:37 +02:00
Anna Stuchlik	c5580399a8	replace the Driver pages with a link to the new Drivers pages This commit removes the now redundant driver pages from the Scylla DB documentation. Instead, the link to the pages where we moved the diver information is added. Also, the links are updated across the ScyllaDB manual. Redirections are added for all the removed pages. Fixes https://github.com/scylladb/scylladb/issues/26871 Closes scylladb/scylladb#27277	2025-12-04 10:07:27 +02:00
Benny Halevy	1c45ad7cee	db: snapshot_ctl: snapshot_options: add use_sstable_identifier options To be used for naming sstables in the snapshot by their sstable identifiers rather than their generation, to facilitate global deduplication of sstables in backup. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 09:46:35 +02:00
Benny Halevy	c18133b6cb	db: snapshot_ctl: move skip_flush to struct snapshot_options Prepare for adding another option: use_sstable_identifer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 09:46:35 +02:00
Tomasz Grabiec	1d42770936	Merge 'topology_coordinator: Add barrier to cleanup_target' from Łukasz Paszkowski Consider the following scenario: 1. A table has RF=3 and writes use CL=QUORUM 2. One node is down 3. There is a pending tablet migration from the unavailable node that is reverted During the revert, there can be a time window where the pending replica being cleaned up still accepts writes. This leads to write failures, as only two nodes (out of four) are able to acknowledge writes. This patch fixes the issue by adding a barrier to the cleanup_target tablet transition state, ensuring that the coordinator switches back to the previous replica set before cleanup is triggered. Fixes https://github.com/scylladb/scylladb/issues/26512 It's a pre existing issue. Backport is required to all recent 2025.x versions. Closes scylladb/scylladb#27413 * github.com:scylladb/scylladb: topology_coordinator: Fix the indentation for the cleanup_target case topology_coordinator: Add barrier to cleanup_target test_node_failure_during_tablet_migration: Increase RF from 2 to 3	2025-12-03 23:57:45 +01:00
Taras Veretilnyk	d287b054b9	sstables: Add TemporaryScylla metadata component type Add TemporaryScylla component type to make atomic updates of SSTable Scylla metadata using temporary files and atomic rename operations possible. This will be needed in further commit to rewrite metadata together with the statistics component.	2025-12-03 23:40:10 +01:00
Szymon Wasik	4f803aad22	Improve documentation of vector search configuration parameters. This patch adds separate group for vector search parameters in the documentation and fixes small typos and formatting. Fixes: SCYLLADB-77. Closes scylladb/scylladb#27385	2025-12-03 21:02:59 +02:00
Karol Nowacki	a54bf50290	vector_search: Fix requests hanging on unreachable nodes When a vector store node becomes unreachable, a client request sent before the keep-alive timer fires would hang until the CQL query timeout was reached. This occurred because the HTTP request writes to the TCP buffer and then waits for a response. While data is in the buffer, TCP retransmissions prevent the keep-alive timer from detecting the dead connection. This patch resolves the issue by setting the `TCP_USER_TIMEOUT` socket option, which applies an effective timeout to TCP retransmissions, allowing the connection to fail faster. Closes scylladb/scylladb#27388	2025-12-03 21:01:43 +02:00
Nadav Har'El	06dd3b2e64	install-dependencies.sh: add zlib Scylla uses zlib, through the header <zlib.h>, in sstable compression. We also want to use it in Alternator for gzip-compressed requests. We never actually required zlib explicltly in install-dependencies.sh, we only get it through transitive dependencies. But it's better to require it explicitly so this is what we do in this patch. In Fedora, we use the newer, more efficient, zlib-ng which is API- compatible with the classic zlib. Unfortunately, the Debian zlib-ng package is not drop-in compatible with zlib (you need to include a different header file <zlib-ng.h>) so we use the classic zlib. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27238	2025-12-03 19:30:36 +02:00
Łukasz Paszkowski	6163fedd2e	topology_coordinator: Fix the indentation for the cleanup_target case	2025-12-03 16:37:33 +01:00
Łukasz Paszkowski	67f1c6d36c	topology_coordinator: Add barrier to cleanup_target Consider the following scenario: 1. A table has RF=3 and writes use CL=QUORUM 2. One node is down 3. There is a pending tablet migration from the unavailable node that is reverted During the revert, there can be a time window where the pending replica being cleaned up still accepts writes. This leads to write failures, as only two nodes (out of four) are able to acknowledge writes. This patch fixes the issue by adding a barrier to the cleanup_target tablet transition state, ensuring that the coordinator switches back to the previous replica set before cleanup is triggered. Fixes https://github.com/scylladb/scylladb/issues/26512	2025-12-03 16:19:17 +01:00
Łukasz Paszkowski	669286b1d6	test_node_failure_during_tablet_migration: Increase RF from 2 to 3 The patch prepares the test for additional write workload to be executed in parallel with node failures. With the original RF=2, QUORUM is also 2, which causes writes to fail during node outage. To address it, the third rack with a single node is added and the replication factor is increased to 3.	2025-12-03 16:00:19 +01:00
Botond Dénes	b9199e8b24	Merge 'auth: use auth cache on login path' from Marcin Maliszkiewicz Scylla currently has bad resiliency to connection storms. Nodes are easy to overload or impact their latency by unbound concurrency in making new connections on the client side. This can easily happen in bigger deployments where there are thousands of client instances, e.g. pods. To improve resiliency we are introducing unified auth specialized cache to the system. This patch series is stage 1, where cache is used only on login path. Dependency diagram: ``` \|Authentication Layer\| \| v +--------------------------------+ \| Auth Cache \| +--------------------------------+ ^ \| \| \| \| v \|Raft Write Logic \| \| CQL Read Layer\| ``` Cache invalidation is based on raft and the cache contains full content of related tables. Ldap role manager may benefit partially as can_logic function is common and will be cached, but it still needs to query roles from external source. Performance results: For single shard connection/disconnection scenario insns/conn decreased by 5%, allocs/conn decreased by 23%, tasks/conn decreased by 20%. Results for 20 shards are very similar. Raw data before: ``` ≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true 2> /dev/null Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0, auth, connection_per_request} Pre-populated 10000 partitions 1128.55 tps (599.2 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2586610 insns/op, 1350912 cycles/op, 0 errors) 1157.41 tps (601.3 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2589046 insns/op, 1356691 cycles/op, 0 errors) 1167.42 tps (603.3 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2603234 insns/op, 1360607 cycles/op, 0 errors) 1159.63 tps (605.9 allocs/op, 0.0 logallocs/op, 145.3 tasks/op, 2609977 insns/op, 1363935 cycles/op, 0 errors) 1165.12 tps (608.8 allocs/op, 0.0 logallocs/op, 145.2 tasks/op, 2625804 insns/op, 1365736 cycles/op, 0 errors) throughput: mean= 1155.63 standard-deviation=15.66 median= 1159.63 median-absolute-deviation=9.49 maximum=1167.42 minimum=1128.55 instructions_per_op: mean= 2602934.31 standard-deviation=16063.01 median= 2603234.19 median-absolute-deviation=13887.96 maximum=2625804.05 minimum=2586609.82 cpu_cycles_per_op: mean= 1359576.30 standard-deviation=5945.69 median= 1360607.05 median-absolute-deviation=4358.94 maximum=1365736.42 minimum=1350912.10 ``` Raw data after: ``` ≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true --duration 10 2> /dev/null Running test with config: {workload=read, partitions=10000, concurrency=100, duration=10, ops_per_shard=0, auth, connection_per_request} Pre-populated 10000 partitions 1132.09 tps (457.5 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2432485 insns/op, 1270655 cycles/op, 0 errors) 1157.70 tps (458.4 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2447779 insns/op, 1283768 cycles/op, 0 errors) 1162.86 tps (459.0 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2463225 insns/op, 1291782 cycles/op, 0 errors) 1153.15 tps (460.2 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2469230 insns/op, 1296381 cycles/op, 0 errors) 1142.09 tps (460.6 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2478900 insns/op, 1299342 cycles/op, 0 errors) 1124.89 tps (462.5 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2470962 insns/op, 1305026 cycles/op, 0 errors) 1156.75 tps (464.4 allocs/op, 0.0 logallocs/op, 115.1 tasks/op, 2493823 insns/op, 1305136 cycles/op, 0 errors) 1152.16 tps (466.3 allocs/op, 0.0 logallocs/op, 115.2 tasks/op, 2497246 insns/op, 1309816 cycles/op, 0 errors) 1154.77 tps (469.8 allocs/op, 0.0 logallocs/op, 115.5 tasks/op, 2571954 insns/op, 1345341 cycles/op, 0 errors) 1152.22 tps (472.4 allocs/op, 0.0 logallocs/op, 115.3 tasks/op, 2551954 insns/op, 1334202 cycles/op, 0 errors) throughput: mean= 1148.87 standard-deviation=12.08 median= 1153.15 median-absolute-deviation=7.88 maximum=1162.86 minimum=1124.89 instructions_per_op: mean= 2487755.88 standard-deviation=43838.23 median= 2478900.02 median-absolute-deviation=24531.06 maximum=2571954.26 minimum=2432485.38 cpu_cycles_per_op: mean= 1304144.76 standard-deviation=22129.55 median= 1305025.71 median-absolute-deviation=12363.25 maximum=1345341.16 minimum=1270655.17 ``` Fixes https://github.com/scylladb/scylladb/issues/18891 Backport: no, it's a new feature Closes scylladb/scylladb#26841 * github.com:scylladb/scylladb: auth: use auth cache on login path auth: corutinize standard_role_manager::can_login main: auth: add auth cache dependency to auth service raft: update auth cache when data changes auth: storage_service: reload auth cache on v1 to v2 auth migration raft: reload auth cache on snapshot application service: add auth cache getter to storage service main: start auth cache service auth: add unified cache implementation auth: move table names to common.hh	2025-12-03 16:45:01 +02:00
Andrzej Jackowski	1ff7f5941b	test: add test_rest_api_on_startup This test verifies that REST API requests are handled properly when a server is started or restarted. It is used to verify the fix for scylladb/scylladb#27130, where a server failed with a segmentation fault when `storage_service/raft_topology/reload` was called too early. Refs: scylladb/scylladb#27130	2025-12-03 15:35:59 +01:00
Andrzej Jackowski	3b70154f0a	main: delay setup of storage_service REST API The storage_service REST API uses `group0` internally. Before this patch, it was possible to send an HTTP request before `group0` was initialized, which resulted in a segmentation fault. Therefore, this patch delays the setup of the storage_service REST API. Fixes: scylladb/scylladb#27130	2025-12-03 15:35:54 +01:00
Pavel Emelyanov	6ae72ed134	test: Reuse S3 fixtures facilities in cqlpy/test_tools.py Creating endpoint conf can be made with the s3_server method Getting boto3 resource from s3_server itself is also possible Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27380	2025-12-03 16:32:54 +02:00
Michael Litvak	9213a163cb	test: fix test flakiness in test_colocated_tables_gc_mode The test executes a LWT query in order to create a paxos state table and verify the table properties. However, after executing the LWT query, the table may not exist on all nodes but only on a quorum of nodes, thus checking the properties of the table may fail if the table doesn't exist on the queried node. To fix that, execute a group0 read barrier to ensure the table is created on all nodes. Fixes scylladb/scylladb#27398 Closes scylladb/scylladb#27401	2025-12-03 12:12:24 +01:00
David Garcia	d9593732b1	docs: add strict mode to control metrics validation behavior The metrics extension now includes validation to detect missing metrics. This validation caused failures during multiversion publication because older versions did not generate all required properties. Instead of fixing each branch, a strict mode flag was introduced to control when validation should run. Strict mode is enabled in the workflow that validates pull requests, ensuring that new changes meet the expected metrics. During multiversion builds, validation errors are now logged but do not raise exceptions, which prevents build failures while still providing visibility into missing data. docs: verbose mode docs: verbose mode Closes scylladb/scylladb#27402	2025-12-03 14:09:08 +03:00
Anna Stuchlik	48cf84064c	doc: add the upgrade guide from 2025.x to 2025.4 Fixes https://github.com/scylladb/scylladb/issues/26451 Fixes https://github.com/scylladb/scylladb/issues/26452 Closes scylladb/scylladb#27310	2025-12-03 11:18:10 +03:00
Avi Kivity	a12165761e	Update seastar submodule * seastar b5c76d6b...7ec14e83 (5): > Merge 'reactor: coroutinize more file related functions' from Avi Kivity reactor: reindent after coroutinization reactor: fdatasync: coroutinize reactor: touch_directory: coroutinize reactor: make_directory: coroutinize reactor: open_directory: coroutinize reactor: statvfs: coroutinize reactor: fstatfs: coroutinize reactor: file_system_at: coroutinize reactor: file_accessible: coroutinize reactor: file_size: coroutinize reactor: file_stat: coroutinize > reactor: Mark some sched-stats getters const > Merge 'coroutine: allocate coroutine frame in a critical section' from Avi Kivity coroutine: allocate coroutine frame in a critical section memory: add C23 free_sized, free_aligned_sized > coroutines: simplify execute_involving_handle_destruction_in_await_suspend() > coroutine: introduce try_future Closes scylladb/scylladb#27369	2025-12-03 10:55:47 +03:00
Nadav Har'El	7dc04b033c	test/cluster: fix missing racks in xfailing Alternator test Since Alternator is now using tablets by default, it's no longer possible to create an Alternator table on a 3-node cluster with a single rack - you need to have 3 racks to support RF=3. Most of the multi-node Alternator tests in test/cluster/test_alternator.py were already fixed to use a 3-rack cluster, but one test was missed because it was marked "xfail" so its new failure to create the table was missed. This patch adds the missing 3-rack setup, so the xfailing test returns to failing on the real bug - not on the table creation. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27382	2025-12-03 10:54:11 +03:00
Piotr Dulikowski	654ac9099b	db/view/view_building_coordinator: skip work if no view is built Even though that `view_building_coordinator::work_on_view_building` has an `if` at the very beginning which checks whether the currently processed base table is set, it only prints a message and continues executing the rest of the function regardless of the result of the check. However, some of the logic in the function assumes that the currently processed base table field is set and tries to access the value of the field. This can lead to the view building coordinator accessing a disengaged optional, which is undefined behavior. Fix the function by adding the clearly missing `co_await` to the check. A regression test is added which checks that the view building state observer - a different fiber which used to print a weird message due to erroneus view building coordinator behavior - does not print a warning. Fixes: scylladb/scylladb#27363 Closes scylladb/scylladb#27373	2025-12-03 09:44:28 +02:00
Andrzej Jackowski	ff1b212319	Update tools/cqlsh submodule The motivation for the update is using newer version of scylla-driver that supports new event type CLIENT_ROUTES_CHANGE. * tools/cqlsh 22401228...6badc992 (2): > Update scylla-driver version to 3.29.6 > Revert "Migrate workflows to Blacksmith" Closes scylladb/scylladb#27359	2025-12-02 15:14:26 +02:00
Calle Wilund	4e7ec9333f	gcp::object_storage: Include auth in exponential back-off-retry Fixes #27268 Refs #27268 Includes the auth call in code covered by backoff-retry on server error, as well as moves the code to use the shared primitive for this and increase the resilience a bit (increase retry count). v2: * Don't do backoff if we need to refresh credentials. * Use abort source for backoff if avail v3: * Include other retryable conditions in auth check Closes scylladb/scylladb#27269	2025-12-02 15:08:49 +02:00
Gleb Natapov	82f80478b8	direct_failure_detector: pass timeout to direct_fd_ping verb Currently direct_fd_ping runs without timeout, but the verb is not waited forever, the wait is canceled after a timeout, this timeout simply is not passed to the rpc. It may create a situation where the rpc callback can runs on a destination but it is no longer waited on. Change the code to pass timeout to rpc as well and return earlier from the rpc handler if the timeout is reached by the time the callback is called. This is backwards compatible since timeout is passed as optional.	2025-12-02 14:55:20 +02:00
Botond Dénes	357f91de52	Revert "Merge 'db/config: enable `ms` sstable format by default' from Michał Chojnowski" This reverts commit `b0643f8959`, reversing changes made to `e8b0f8faa9`. The change forgot to update sstables_manager::get_highest_supported_format(), which results in /system/highest_supported_sstable_version still returning me, confusing and breaking tests. Fixes: scylladb/scylla-dtest#6435 Closes scylladb/scylladb#27379	2025-12-02 14:38:56 +02:00
Botond Dénes	e762027943	db/config: change batchlog_replay_cleanup_after_replays default to 1 Now that batchlog cleanup is cheap, on account of memtable flush on the system.batchlog table garbage-collecting tombstones (previous patch), we can afford to do cleanup on each replay, keeping the memtable size small and more importantly -- the amount of tombstones in the memtable small.	2025-12-02 14:21:26 +02:00
Botond Dénes	8edd5b80ab	test/boost/batchlog_manager_test: add test for batchlog cleanup Add more tests covering different aspects of batchlog replay, cleanup, replay timeout and finally v1 -> v2 migration.	2025-12-02 14:21:26 +02:00
Botond Dénes	fb84b30f88	replica/mutation_dump: always set position weight for clustering positions SELECT * FROM MUTATION_FRAGMENTS() queries have a transformed schema (mutation-fragment schema), which is a superset of that of the queried table's. The mutation fragment schema represents position_in_partition of mutation fragments expressed as clustering columns. This presents some challenges, as some position_in_partition fields are null for some positions. This was solved by setting these clustering keys components to bytes{}. In the process, a mistake was made: when the clustering key is missing in the position_in_partition, the position_weight is also set to bytes{}. This is not correct, it is possible for some positions to have no key but to still have a position_weight. An example is position_in_partition::before_all_clustered_rows(). Fix this by always filling in the position_weight for positions which have region() == clustered, instead of the earlier condition on the key presence. This is a minor bug affecting range tombstone changes at the two extremes: position_in_partition::{before,after}_all_clustered_rows(). In both cases, the position_weight can be deduced by a human looking at the results, based on the position of the range tombstone change, relative to other fragments.	2025-12-02 14:21:26 +02:00
Botond Dénes	8545f7eedd	service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/ Rename to make it more explicit where the error injection happens. Also change how the error is injected, use the lambda overload instead of is_enabled(), the former leaves better trace in logs, which helps when debugging tests.	2025-12-02 14:21:26 +02:00
Botond Dénes	e52e1f842e	test/lib: introduce error_injection.hh Test-specific helpers for working with error injection. Provides an RAII object to enable/disable error injection points, on all shards.	2025-12-02 14:21:26 +02:00
Botond Dénes	0a7df4b8ac	utils/error_injection: add debug log to disable() and disable_all() enable() and friends already has debug logs.	2025-12-02 14:21:26 +02:00
Botond Dénes	9bb8156f02	test/lib/cql_test_env: forward config to batchlog Currently all batchlog config items are hardcoded. Make the two important ones configurable: replay_timeout and replay_cleanup_after_replays.	2025-12-02 14:21:26 +02:00
Botond Dénes	d1b796bc43	test/lib/cql_test_env: add batch type to execute_batch() Allow executing logged batches too. Also add a trace log to the method.	2025-12-02 14:21:26 +02:00
Botond Dénes	1ad64731bc	test/lib/cql_assertions: add with_size(predicate) overload Sometimes, expected size is not a single number. Add predicate overload to allow expressing more complicated expectations.	2025-12-02 14:21:26 +02:00
Botond Dénes	abadb8ebfb	test/lib/cql_assertions: add source location to fail messages For tests that contain multiple assert_that() invokations, identifying the one that failed is very challenging. Add source location to fail messages to allow convenient identification of the call-site.	2025-12-02 14:21:26 +02:00
Botond Dénes	54f16f9019	test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row() Invoke the passed in function with a columns_assertions instance for the current row, allowing for sweeping checks across columns of all rows.	2025-12-02 14:21:26 +02:00
Botond Dénes	b584e1e18e	test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check	2025-12-02 14:21:26 +02:00
Botond Dénes	aa1d3f1170	test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload To enable assertions on columns which are sometimes null. One existing user of with_typed_column() needs adjustment, because the previous version of with_typed_column() covered up silently for null value, but after this patch this caused a failure.	2025-12-02 14:21:26 +02:00
Botond Dénes	e309b5dbe1	db/batchlog_manager: config: s/write_timeout/reply_timeot/ Although the value of this item is indeed derived from the write timeout config, the name doesn't reflect what it is used for. Change it to reflect it better.	2025-12-02 14:21:26 +02:00
Botond Dénes	846b656610	db,service: switch to system.batchlog_v2 New batchlogs are written to the batchlog_v2 table and replay also uses the v2 table. The content of system.batchlog is attempted to be migrated to system.batchlog_v2 after each start of the batchlog_manager service. The migration is retried on each replay if it fails. This is reduntant but simple. Batchlog cleanup now doesn't involve flushing memtables, the only remaining user of replica/database.hh is gone, so the include is dropped.	2025-12-02 14:21:26 +02:00
Botond Dénes	ee851266be	db/system_keyspace: introduce system.batchlog_v2 Rearranges the system.batchlog schema as follows: CREATE TABLE system.batchlog_v2 ( version int, stage int, shard int, written_at timestamp, id uuid, data blob, PRIMARY KEY ((version, stage, shard), written_at, id)); With the following goals: 1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC. 2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely. 3) Spread out the data among Scylla shards, via the batchlog shard column. 4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.	2025-12-02 14:21:25 +02:00
Botond Dénes	9434ec2fd1	service,db: extract generation of batchlog delete mutation Don't build batchlog delete mutations in storage-proxy code. Move this code into db/batchlog_manager.cc, exposed via db/batchlog.hh. This serves multiple goals: 1) Concentrates low-level batchlog related logic in db/batchlog_manager.cc 2) Reduce current and future code duplication. 3) Make future changes to this logic easier.	2025-12-02 14:21:25 +02:00
Botond Dénes	f54602daf0	service,db: extract get_batchlog_mutation_for() from storage-proxy Don't build batchlog mutations in storage-proxy code. Move this code into db/batchlog_manager.cc, exposed via db/batchlog.hh. This serves multiple goals: 1) Concentrates low-level batchlog related logic in db/batchlog_manager.cc 2) Reduce current and future code duplication. 2) Make future changes to this logic easier.	2025-12-02 14:21:25 +02:00
Botond Dénes	097c2cd676	db/batchlog_manager: only consider propagation delay with tombstone-gc=repair The propagation delay has no effect for other tombstone gc strategies, so ignore it when tombstone-gc != repair.	2025-12-02 14:21:25 +02:00
Botond Dénes	4f30807f01	db/batchlog_manager: don't drop entire batch if one mutations' table was dropped Just skip the mutation(s) whose tables were dropped instead. Use the newly introduced data_dictionary::table::get_truncation_time() to avoid looking up real table object.	2025-12-02 14:21:25 +02:00
Botond Dénes	55704908a0	data_dictionary: table: add get_truncation_time() So the batchlog manager can avoid looking up the real table and instead just work with data dictionary.	2025-12-02 14:21:25 +02:00
Taras Veretilnyk	a191503ddf	sstables: Extract file writer closing logic into separate methods Refactor the consume_end_of_stream() method by extracting the inline file writer closing logic into dedicated methods: - close_index_writer() - close_partitions_writer() - close_rows_writer()	2025-12-02 13:07:41 +01:00
Taras Veretilnyk	619bf3ac4b	sstables: Add components_digests to scylla metadata components Add components_digests struct with optional digest fields for storing CRC32 digests of individual SSTable components in Scylla metadata. Those includes: - Data - Compression - Filter - Statistics - Summary - Index - TOC - Partitions - Rows	2025-12-02 12:36:34 +01:00
Pawel Pery	b5c85d08bb	unittest: fix vector_store_client_test_dns_refresh_aborted hangs The root cause for the hanging test is a concurrency deadlock. `vector_store_client` runs dns refresh time and it is waiting for the condition variable.After aborting dns request the test signals the condition variable. Stopping the vector_store_client takes time enough to trigger the next dns refresh - and this time the condition variable won't be signalled - so vector_store_client will wait forever for finish dns refresh fiber. The commit fixes the problem by waiting for the condition variable only once. Fixes: #27237 Fixes: VECTOR-370 Closes scylladb/scylladb#27239	2025-12-02 12:22:44 +01:00
Piotr Dulikowski	3aaab5d5a3	Merge 'vector_search: Fix high availability during timeouts' from Karol Nowacki This PR introduces two key improvements to the robustness and resource management of vector search: Proper Abort on CQL Timeout: Previously, when a CQL query involving a vector search timed out , the underlying ANN query to the vector store was not aborted and would continue to run. This has been fixed by ensuring the abort source is correctly signaled, terminating the ANN request when its parent CQL query expires and preventing unnecessary resource consumption. Faster Failure Detection: The connection and keep-alive timeouts for vector store nodes were excessively long (2 and 11 minutes, respectively), causing significant delays in detecting and recovering from unreachable nodes. These timeouts are now aligned with the request_timeout_in_ms setting, allowing for much faster failure detection and improving high availability by failing over from unresponsive nodes more quickly. Fixes: SCYLLADB-76 This issue affects the 2025.4 branch, where similar HA recovery delays have been observed. Closes scylladb/scylladb#27377 * github.com:scylladb/scylladb: vector_search: Fix ANN query abort on CQL timeout vector_search: Reduce connection and keep-alive timeouts	2025-12-02 11:14:48 +01:00
Botond Dénes	337f417b13	db/batchlog_manager: batch(): replace map_reduce() with simple loop The map_reduce achieves no concurrency, both map and reduce are synchronous. It only achieves two redundant lookups for the table and hard-to-read code. Convert it into a simple loop. Preserve the stall-protection by adding a maybe_yield() to the loop.	2025-12-02 12:05:10 +02:00
Wojciech Mitros	6221c58325	mv: replace the simple/complex rack-aware pairing with exact rack matching When the initial version of rack-aware pairing was introduced, materialized views with tablets were still experimental. Since then, we decided that we'll only allow materialized views in clusters where the base table and the view are replicated on the same racks, with one replica of each tablet on each rack. This allows us to remove almost all logic from our base-view pairing. The only check for the paired view replica is now whether it's in the same rack as the base replica sending the update. In this patch we replace the simple and complex rack-aware pairing with the simple check above. Because of this, we have to remove a test case from network_topology_strategy_test which was testing complex pairing. The tested topology is not supported for views with tablets (or is unlikely to be supported, as it's a random test), so there's no use keeping the test. The test case for simple rack aware pairing was kept, but now we only test the case where each rack has one replica, not multiple. Additionally, we split finding of an unpaired replica to a separate function and partially rewrite it without reusing the helper stuctures that were present when calculating the simple and complex rack-aware pairing. We only look for an unpaired replica if we couldn't find a paired replica ourselves or if the number of view replicas didn't match the base replicas. If an unpaired replica appears while these conditions pass, we won't send an extra update, but that would be a new bug altogether, because we only expect the unpaired replica to appear during RF changes, so when these conditions aren't fulfilled. Fixes https://github.com/scylladb/scylladb/issues/26313	2025-12-02 10:52:36 +01:00
Ernest Zaslavsky	605f71d074	s3_client: handle additional transient network errors Add handling for a broader set of transient network-related `std::errc` values in `aws_error::from_system_error`. Treat these conditions as retryable when the client re-creates the socket for each request. Fixes: https://github.com/scylladb/scylladb/issues/27349 Closes scylladb/scylladb#27350	2025-12-02 11:44:40 +02:00
Botond Dénes	705af2bc16	db/batchlog_manager: finish coroutinizing replay_all_failed_batches It was coroutinized already but strangely, some continuations also remained. The `batch` lambda is still left in continuation style.	2025-12-02 10:42:28 +02:00
Botond Dénes	5b5f9120d0	db/batchlog_manager: improve replayAllFailedBatches logs Add cleanup flag value to start message and drop cpu, it is redundant as Scylla already adds the shard number to the logs. Add all_replayed to finish message.	2025-12-02 10:42:28 +02:00
Pavel Emelyanov	6c115c691f	sstables_loader: Provide endpoint type for get_sstables_from_object_store() Currently the method scans db::config to find one. It has some drawbacks. First, it's not very nice. Second, it needs to handle the case when the endpoint is missing, while it relally never is. Third, the type in config entry is not necessarily set. It's nicer to get the type from storage manager. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-02 11:18:32 +03:00
Pavel Emelyanov	5924c36b50	storage_manager: Introduce get_endpoint_type() method So that other code (spoiler: see next patch) have simple API to get one. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-02 11:18:27 +03:00
Pavel Emelyanov	ad6a73c29b	storage_manager: Split get_endpoint_client() To get the get_endpoint() internal helper for future use. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-02 11:18:23 +03:00
Wojciech Mitros	4ec0fa6eb5	mv: split out vnode pairing code from get_view_natural_endpoint To avoid repeatedly checking whether we're using tablets and having to use unnecesarily flexible code fitting both cases, we split out the base-view pairing code for the case of vnodes to another function. The get_view_natural_endpoint will now have only common steps, a call to that function, and steps specific to tablets.	2025-12-02 03:32:36 +01:00
Wojciech Mitros	c313b215e4	mv: unify self-pairing and rack-aware pairing into one bool We always use "legacy self pairing" when not using tablets, and the "rack aware pairing" has been enabled in every version where views with tablets isn't experimental. So in practice, instead of checking these variables we can just look at whether the table uses tablets.	2025-12-02 03:32:32 +01:00
Karol Nowacki	086c6992f5	vector_search: Fix ANN query abort on CQL timeout When a CQL vector search request timed out, the underlying ANN query was not aborted and continued to run. This happened because the abort source was not being signaled upon request expiration. This commit ensures the ANN query is aborted when the CQL request times out preventing unnecessary resource consumption.	2025-12-02 01:17:01 +01:00
Karol Nowacki	b6afacfc1e	vector_search: Reduce connection and keep-alive timeouts The connection timeout was 2 minutes and the keep-alive timeout was 11 minutes. If a vector store node became unreachable, these long timeouts caused significant delays before the system could recover, negatively impacting high availability. This change aligns both timeouts with the `request_timeout` configuration, which defaults to 10 seconds. This allows for much faster failure detection and recovery, ensuring that unresponsive nodes are failed over from more quickly.	2025-12-02 01:17:01 +01:00
Łukasz Paszkowski	0ed3452721	service/storage_service: Mark nodes excluded on shard0 Excluding nodes is a group0 operation and as such it needs to be executed onyl on shard0. In case, the method `mark_excluded` is invoked on a different shard, redirect the request to shard0. Fixes https://github.com/scylladb/scylladb/issues/27129 Closes scylladb/scylladb#27167	2025-12-01 17:30:40 +01:00
Jenkins Promoter	c3c0991428	Update pgo profiles - aarch64	2025-12-01 13:47:56 +02:00
Wojciech Mitros	7c612e1789	mv: remove the workaround for left nodes when sending view updates At one point, the get_view_natural_endpoint was using IP for the view update (and hint) destinations, but the hint code was using host_id for the destinations. When a node left, we could no longer have a mapping for a IP to host_id and when trying to store a hint for this IP, we'd crash. We worked around this issue by dropping the view update completely if the target is in the "left" state. Since then, we also moved to host_id's in the view update code, so there's no longer any translation needed when storing the hints. Additionally, we now drain hints not when entering the "left" state, but when the node actually stops owning tokens. Because of that, the workaround is not needed anymore, so we remove it in this commit. The existing test_mv_tablets_empty_ip case verifies that indeed, we do not crash in the original problematic scenario.	2025-12-01 12:27:28 +01:00
Calle Wilund	5f53d7852e	docs::encryption: Add warning that replicated provider is deprecated And will be removed.	2025-12-01 09:44:44 +00:00
Calle Wilund	52cc30e00c	ent::encryption: Switch default key provider from replicated to local Since we are deprecating the replicated provider, it makes little sense to have it be default.	2025-12-01 09:44:44 +00:00
Calle Wilund	ba38d58539	replicated_key_provider: Add deprecation warning on usage Warns user utilizing the provider that the provider is deprecated and will be removed.	2025-12-01 09:44:44 +00:00
Jenkins Promoter	563e5ddd62	Update pgo profiles - x86_64	2025-12-01 04:24:36 +02:00
Ernest Zaslavsky	f0e2941e34	streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic Extract the SST filtering logic into a dedicated member function. This prepares the code for independent testing without requiring the entire streamer to be initialized.	2025-11-30 18:27:15 +02:00
Artsiom Mishuta	796205678f	test.py: set worksteal distribution set worksteal disribution for xdist(new sheduler) Because now it shows better tests distribution that standart(load) in CI Closes scylladb/scylladb#27354	2025-11-30 18:13:03 +02:00
Emil Maskovsky	902d70d6b2	.github: add Copilot instructions for AI-generated code Add comprehensive coding guidelines for GitHub Copilot to improve quality and consistency of AI-generated code. Instructions cover C++ and Python development with language-specific best practices, build system usage, and testing workflows. Following GitHub Copilot's standard layout with general instructions in .github/copilot-instructions.md and language-specific files in .github/instructions/ directory using *.instructions.md naming. No backport: This change is only for developers in master, so it doesn't need to be backported. Closes scylladb/scylladb#25374	2025-11-30 13:30:05 +02:00
Avi Kivity	ce2a403f18	Merge 'alternator: implement gzip-compressed requests' from Nadav Har'El In this series we implement Alternator's support for gzip-compressed requests, i.e., requests with the "Content-Encoding: gzip" header, other uncompressed header, and a gzip-compressed body. The server needs to verify the signature of the compressed content, and then uncompress the body before running the request. We only support gzip compression because this is what DynamoDB supports. But in the future we can easily add support for other compression algorithms like lz4 or zstd. This series Refs #5041 but doesn't "Fixes" it because it only implements compressed requests (Content-Encoding), not compressed responses (Accept-Encoding). In addition to the code changes, the series also contains tests for this feature that make sure it behaves like DynamoDB. Note that while we will have now support in our server for compressed requests, just like DynamoDB does, the clients (AWS SDKs) will probably NOT make use of it because they do not enable request compression by default. For example, see the tests for some hoops one needs to jump through in boto3 (the Python SDK) to send compressed requests. However, we are hoping that in the future Alternator's modified clients will use compressed requests and enjoy this feature. Closes scylladb/scylladb#27080 * github.com:scylladb/scylladb: test/alternator: enable, and add, tests for gzip'ed requests alternator: implement gzip-compressed requests	2025-11-30 13:27:46 +02:00
Avi Kivity	d4be9a058c	Update seastar submodule seastar::compat::source_location (which should not have been used outside Seastar) is replaced with std::source_location to avoid deprecation warnings. The relevant header, which was removed, is no longer included. * seastar 8c3fba7a...b5c76d6b (3): > testing: There can be only one memory_data_sink > util: Use std::source_location directly > Merge 'net: support proxy protocol v2' from Avi Kivity apps: httpd: add --load-balancing-algorithm apps: httpd: add /shard endpoint test: socket_test: add proxy protocol v2 test suite test: socket_test: test load balancer with proxy protocol net: posix_connected_socket: specialize for proxied connections net: posix_server_socket_impl: implement proxy protocol in server sockets net: posix_server_socket_impl: adjust indentation net: posix_server_socket_impl: avoid immediately-invoked lambda net: conntrack: complete handle nested class special member functions net: posix_server_socket_impl: coroutinize accept() Closes scylladb/scylladb#27316	2025-11-30 12:38:47 +02:00
Piotr Dulikowski	44c605e59c	Merge 'Fix the types of change events in Alternator Streams' from Piotr Wieczorek This patch increases the compatibility with DynamoDB Streams by integrating the DynamoDB's event type rules (described in https://github.com/scylladb/scylladb/issues/6918) into Alternator. The main changes are: - introduce a new flag `alternator_streams_strict_compatibility`, meant as a guard of performance-intensive operations that increase the compatibility with DynamoDB Streams. If enabled, Alternator always performs a RBW before a data-modifying operation, and propagates its result to CDC. Then, the old item is compared to the new one, to determine the mutation type (INSERT vs MODIFY). This option is a no-op for tables with disabled Alternator Streams, - reduce splitting of simple Alternator mutations, - correctly distinguish event types described in #6918, except for item deletes. Deleting a missing item with DeleteItem, BatchWriteItem, or a missing field with UpdateItem still emit REMOVEs. To summarize, the emitted events of the data manipulation operations should be as follows: - DeleteItem/BatchWriteItem.DeleteItem of existing item: REMOVE (OK) - DeleteItem of nonexistent item: nothing (OK) - BatchWriteItem.DeleteItem of nonexistent item: nothing (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of existing and not equal item: MODIFY (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of existing and equal item: nothing (OK) - PutItem/UpdateItem/BatchWriteItem.PutItem of nonexistent item: INSERT (OK) No backport is necessary. Refs https://github.com/scylladb/scylladb/pull/26149 Refs https://github.com/scylladb/scylladb/pull/26396 Refs https://github.com/scylladb/scylladb/issues/26382 Fixes https://github.com/scylladb/scylladb/issues/6918 Closes scylladb/scylladb#26121 * github.com:scylladb/scylladb: test/alternator: Enable the tests failing because of #6918 alternator, cdc: Don't emit events for no-op removes alternator, cdc: Don't emit an event for equal items alternator/streams, cdc: Differentiate item replace and item update in CDC alternator: Change the return type of rmw_operation_return config: Add alternator_streams_strict_compatibility flag cdc: Don't split a row marker away from row cells	2025-11-30 07:20:22 +01:00
Asias He	da5cc13e97	repair: Fix deadlock when topology coordinator steps down in the middle Consider this: 1) n1 is the topology coordinator 2) n1 schedules and executes a tablet repair with session id s1 for a tablet on n3 an n4. 3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1] 4) n1 steps down before it executes locator::tablet_transition_stage::end_repair 5) n2 becomes the new topology coordinator 6) n2 runs locator::tablet_transition_stage::repair again 7) n3 and n4 try to take the lock again and hangs since the lock is already taken. To avoid the deadlock, we can throw in step 7 so that n2 will proceed to end_repair stage and release the lock. After that, the scheduler could schedule the tablet repair request again. Fixes #26346 Closes scylladb/scylladb#27163	2025-11-28 15:14:39 +01:00
Radosław Cybulski	b54a9f4613	Fix use-after-free in encode_paging_state in Alternator Fix unlikely use-after-free in `encode_paging_state`. The function incorrectly assumes that current position to encode will always have data for all clustering columns the schema defines. It's possible to encounter current position having less than all columns specified, for eample in case of range tombstone. Those don't happen in Alternator tables as DynamoDB doesn't allow range deletions and clustering key might be of size at most 1. Alternator api can be used to read scylla system tables and those do have range tombstones with more than single clustering column. The fix is to stop trying to encode columns, that don't have the value - they are not needed anyway, as there's no possible position with those values (range tombstone made sure of that). Fixes #27001 Fixes #27125 Closes scylladb/scylladb#26960	2025-11-28 16:51:15 +03:00
Pavel Emelyanov	d35ce81ff1	Merge 'test: wait for read_barrier in wait_until_driver_service_level_created' from Andrzej Jackowski Previously, `wait_until_driver_service_level_created` only waited for the `driver` service level to appear in the output of `LIST ALL SERVICE_LEVELS`. However, the fact that one node lists `sl:driver` does not necessarily mean that all other nodes can see it yet. This caused sporadic test failures, especially in DEBUG builds. To prevent these failures, this change adds an extra wait for a `raft/read_barrier` after the `driver` service level first appears. This ensures the service level is globally visible across the cluster. Fixes: https://github.com/scylladb/scylladb/issues/27019 Na backport - test fix for `sl:driver` tests, and this that is only available on `master` Closes scylladb/scylladb#27076 * github.com:scylladb/scylladb: test: wait for read_barrier in wait_until_driver_service_level_created test: use ManagerClient in wait_until_driver_service_level_created	2025-11-28 16:47:29 +03:00
Dawid Mędrek	b76af2d07f	cql3: Improve errors when manipulating default service level Before this commit, any attempt to create, alter, attach, or drop the default service level would result in a syntax error whose error message was unclear: ``` cqlsh> attach service level default to cassandra; SyntaxException: line 1:21 no viable alternative at input 'default' ``` The error stems from the grammar not being able to parse `default` as a correct service level name. To fix that, we cover it manually. This way, the grammar accepts it and we can process it in Scylla. The reason why we'd like to cover the default service level is that it's an actual service level that the user should reference. Getting a syntax error is not what should happen. Hence this fix. We validate the input and if the given role is really the default service level, we reject the query and provide an informative error message. Two validation tests are provided. Fixes scylladb/scylladb#26699 Closes scylladb/scylladb#27162	2025-11-28 15:32:37 +03:00
Dawid Mędrek	48a28c24c5	db/commitlog: Include position and alignment information in errors When we come across a segment truncation, this information may be helpful to determine when the error occurred exactly and hint at what code path might've led to it. Closes scylladb/scylladb#27207	2025-11-28 15:28:08 +03:00
Calle Wilund	59c87025d1	commitlog::read_log_file: Check for eof position on all data reads Fixes #24346 When reading, we check for each entry and each chunk, if advancing there will hit EOF of the segment. However, IFF the last chunk being read has the last entry _exactly_ matching the chunk size, and the chunk ending at _exactly_ segment size (preset size, typically 32Mb), we did not check the position, and instead complained about not being able to read. This has literally _never_ happened in actual commitlog (that was replayed at least), but has apparently happened more and more in hints replay. Fix is simple, just check the file position against size when advancing said position, i.e. when reading (skipping already does). v2: * Added unit test Closes scylladb/scylladb#27236	2025-11-28 15:26:46 +03:00
Ernest Zaslavsky	1d5f60baac	streaming:: add more logging Start logging all missed streaming options like `scope`, `primary_replica` and `skip_reshape` flags Fixes: https://github.com/scylladb/scylladb/issues/27299 Closes scylladb/scylladb#27311	2025-11-28 12:50:33 +01:00
Emil Maskovsky	37e3dacf33	topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted In several exception handlers, only raft::request_aborted was being caught and rethrown, while seastar::abort_requested_exception was falling through to the generic catch(...) block. This caused the exception to be incorrectly treated as a failure that triggers rollback, instead of being recognized as an abort signal. For example, during tablet draining, the error log showed: "tablets draining failed with seastar::abort_requested_exception (abort requested). Aborting the topology operation" This change adds seastar::abort_requested_exception handling alongside raft::request_aborted in all places where it was missing. When rethrown, these exceptions propagate up to the main run() loop where handle_topology_coordinator_error() recognizes them as normal abort signals and allows the coordinator to exit gracefully without triggering unnecessary rollback operations. Fixes: scylladb/scylladb#27255 No backport: The problem was only seen in tests and not reported in customer tickets, so it's enough to fix it in the main branch. Closes scylladb/scylladb#27314	2025-11-28 12:19:21 +01:00
Michael Litvak	97b7c03709	tablet: scheduler: Do not emit conflicting migration in merge colocation The tablet scheduler should not emit conflicting migrations for the same tablet. This was addressed initially in scylladb/scylladb#26038 but the check is missing in the merge colocation plan, so add it there as well. Without this check, the merge colocation plan could generate a conflicting migration for a tablet that is already scheduled for migration, as the test demonstrates. This can cause correctness problems, because if the load balancer generates two migrations for a single tablet, both will be written as mutations, and the resulting mutation could contain mixed cells from both migrations. Fixes scylladb/scylladb#27304 Closes scylladb/scylladb#27312	2025-11-28 11:17:12 +01:00
Taras Veretilnyk	62802b119b	sstables: Implement CRC32 digest-only writer Introduce template parameter to checksummed file writer to support digest-only calculation without storing chunk checksums. This will be needed for future to calculate digest of other components.	2025-11-27 22:40:07 +01:00
Pavel Emelyanov	54edb44b20	code: Stop using seastar::compat::source_location And switch to std::source_location. Upcoming seastar update will deprecate its compatibility layer. The patch is for f in $(git grep -l 'seastar::compat::source_location'); do sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f; done and removal of few header includes. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27309	2025-11-27 19:10:11 +02:00
Avi Kivity	c85671ce51	scripts: refresh-submodules: don't omit last (first) commit `git log --format` doesn't add a newline after the last line. This causes `read` to ignore that line, losing the last line (corresponding to the first commit). Use `git log --tformat` instead, which terminates the last line. Closes scylladb/scylladb#27317	2025-11-27 18:46:27 +02:00
Botond Dénes	9b968dc72c	docs: update dependencies Via make update. Fixes: scylladb/scylladb#27231 Closes scylladb/scylladb#27263	2025-11-27 15:56:34 +03:00
Andrzej Jackowski	e366030a92	treewide: seastar module update The reason for this seastar update is to have the fixed handling of the `integer` type in `seastar-json2code` because it's needed for further development of ScyllaDB REST API. The following changes were introduced to ScyllaDB code to ensure it compiles with the updated seastar: - Remove `seastar/util/modules.hh` includes as the file was removed from seastar - Modified `metrics::impl::labels_type` construction in `test/boost/group0_test.cc` because now it requires `escaped_string` * seastar 340e14a7...8c3fba7a (32): > Merge 'Remove net::packet usage from dns.cc' from Pavel Emelyanov dns: Optimize packet sending for newer c-ares versions dns: Replace net::packet with vector<temporary_buffer> dns: Remove unused local variable dns: Remove pointless for () loop wrapping dns: Introduce do_sendv_tcp() method dns: Introduce do_send_udp() method > test: Add http rules test of matching order > Merge 'Generalize packet_data_source into memory_data_source' from Pavel Emelyanov memcached: Patch test to use memory_data_source memcached: Use memory_data_source in server rpc: Use memory_data_sink without constructing net::packet util: Generalize packet_data_source into memory_data_source > tests: coroutines: restore "explicit this" tests > reactor: remove blocking of SIGILL > Merge 'Update compilers in GH actions scripts' from Pavel Emelyanov github: Use gcc-14 github: Use clang-20 > Merge 'Reinforce DNS reverse resolution test ' from Pavel Emelyanov test: Make test_resolve() try several addresses test: Coroutinize test_resolve() helper > modules: make module support standards-compliant > Merge 'Fix incorrect union access in dns resolver' from Pavel Emelyanov dns: Squash two if blocks together dns: Do not check tcp entry for udp type > coroutine: Fix compilation of execute_involving_handle_destruction_in_await_suspend > promise: Document that promise is resolved at most once > coroutine: exception: workaround broken destroy coroutine handle in await_suspend > socket: Return unspecified socket_address for unconnected socket > smp: Fix exception safety of invoke_on_... internal copying > Merge 'Improve loads evaluation by reactor' from Pavel Emelyanov reactor: Keep loads timer on reactor reactor: Update loads evaluation loop > Merge 'scripts: add 'integer' type to seastar-json2code' from Andrzej Jackowski test: extend tests/unit/api.json to use 'integer' type scripts: add 'integer' type to seastar-json2code > Merge 'Sanitize tls::session::do_put(_one)? overloads' from Pavel Emelyanov tls: Rename do_put_one(temporary_buffer) into do_put() tls: Fix indentation after previous patch tls: Move semaphore grab into iterating do_put() > net: tcp: change unsent queue from packets to temporary_buffer:s > timer: Enable highres timer based on next timeout value > rpc: Add a new constructor in closed_error to accept string argument > memcache: Implement own data sink for responses > Merge 'file: recursive_remove_directory: general cleanup' from Avi Kivity file: do_recursive_remove_directory(): move object when popping from queue file: do_recursive_remove_directory(): adjust indentation file: do_recursive_remove_directory(): coroutinize file: do_recursive_remove_directory(): simplify conditional file: do_recursive_remove_directory(): remove wrong const file: do_recursive_remove_directory(): clean up work_entry > tests: Move thread_context_switch_test into perf/ > test: Add unit test for append_challenged_posix_file > Merge 'Prometheus metrics handler optimization' from Travis Downs prometheus: optimize metrics aggregation prometheus: move and test aggregate_by helper prometheus: various optimizations metrics: introduce escaped_string for label values metric:value: implement + in terms of += tests: add prometheus text format acceptance tests extract memory_data_sink.hh metrics_perf: enhance metrics bench > demos: Simplify udp_zero_copy_demo's way of preparing the packet > metrics: Remove deprecated make_...-ers > Merge 'Make slab_test be BOOST kind' from Pavel Emelyanov test: Use BOOST_REQUIRE checkers test: Replace some SEASTAR_ASSERT-s with static_assert-s test: Convert slab test into boost kind > Merge 'Coroutinize lister_test' from Pavel Emelyanov test: Fix indentation after previuous patch test: Coroutinize lister_test lister::report() method test: Coroutinize lister_test main code > file: recursive_remove_directory(): use a list instead of a deque > Merge 'Stop using packets in tls data_sink and session' from Pavel Emelyanov tls: Stop using net::packet in session::put() tls: Fix indentation after previous patch tls: Split session::do_put() tls: Mark some session methods private Closes scylladb/scylladb#27240	2025-11-27 12:34:22 +02:00
Nadav Har'El	32afcdbaf0	test/alternator: enable, and add, tests for gzip'ed requests After in the previous patch we implemented support in Alternator for gzip-compressed requests ("Content-Encoding: gzip"), here we enable an existing xfail-ing test for this feature, and also add more tests for more cases: * A test for longer compressed requests, or a short compressed request which expands to a longer request. Since the decompression uses small buffers, this test reaches additional code paths. * Check for various cases of a malformed gzip'ed request, and also an attempt to use an unsupported Content-Encoding. DynamoDB returns error 500 for both cases, so we want to test that we do to - and not silently ignore such errors. * Check that two concatenated gzip'ed streams is a valid request, and check that garbage at the end of the gzip - or a missing character at the end of the gzip - is recognized as an error. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-27 09:42:47 +02:00
Wojciech Mitros	323e5cd171	mv: allow setting concurrency in PRUNE MATERIALIZED VIEW The PRUNE MATERALIZED VIEW statement is performed as follows: 1. Perform a range scan of the view table from the view replicas based on the ranges specified in the statement. 2. While reading the paged scan above, for each view row perform a read from all base replicas at the corresponding primary key. If a discrepancy is detected, delete the row in the view table. When reading multiple rows, this is very slow because for each view row we need to performe a single row query on multiple replicas. In this patch we add an option to speed this up by performing many of the single base row reads concurrently, at the concurrency specified in the USING CONCURRENCY clause. Fixes https://github.com/scylladb/scylladb/issues/27070	2025-11-27 00:02:28 +01:00
Tomasz Grabiec	d6c14de380	Merge 'locator/node: include _excluded in missing places' from Patryk Jędrzejczak We currently ignore the `_excluded` field in `node::clone()` and the verbose formatter of `locator::node`. The first one is a bug that can have unpredictable consequences on the system. The second one can be a minor inconvenience during debugging. We fix both places in this PR. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72 This PR is a bugfix that should be backported to all supported branches. Closes scylladb/scylladb#27265 * github.com:scylladb/scylladb: locator/node: include _excluded in verbose formatter locator/node: preserve _excluded in clone()	2025-11-26 18:29:59 +01:00
Asias He	ab4896dc70	topology_coordinator: Send incremental repair rpc only when the feature is enabled Otherwise, in a mixed cluster, the handle_tablet_resize_finalization would fail because of the unknown rpc verb. Fixes #26309 Closes scylladb/scylladb#27218	2025-11-26 15:25:36 +01:00
Patryk Jędrzejczak	287c9eea65	locator/node: include _excluded in verbose formatter It can be helpful during debugging.	2025-11-26 13:26:17 +01:00
Patryk Jędrzejczak	4160ae94c1	locator/node: preserve _excluded in clone() We currently ignore the `_excluded` field in `clone()`. Losing information about exclusion can have unpredictable consequences. One observed effect (that led to finding this issue) is that the `/storage_service/nodes/excluded` API endpoint sometimes misses excluded nodes.	2025-11-26 13:26:11 +01:00
Patryk Jędrzejczak	cc273e867d	Merge 'fix notification about expiring erm held for to long' from Gleb Natapov Commit `6e4803a750` broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should. Fixes https://github.com/scylladb/scylladb/issues/27141 Closes scylladb/scylladb#27140 * https://github.com/scylladb/scylladb: test: test that expired erm that held for too long triggers notification token_metadata: fix notification about expiring erm held for to long	2025-11-26 12:59:00 +01:00
Amnon Heiman	68c7236acb	vector_index: require tablets for vector indexes This patch enforces that vector indexes can only be created on keyspaces that use tablets. During index validation, `check_uses_tablets()` verifies the base keyspace configuration and rejects creation otherwise. To support this, the `custom_index::validate()` API now receives a `const data_dictionary::database&` parameter, allowing index implementations to access keyspace-level settings during DDL validation. Fixes https://scylladb.atlassian.net/browse/VECTOR-322 Closes scylladb/scylladb#26786	2025-11-26 13:30:43 +02:00
Marcin Maliszkiewicz	dd461e0472	auth: use auth cache on login path This path may become hot during connection storms that's why we want it to stress the node as little as possible.	2025-11-26 12:01:33 +01:00
Marcin Maliszkiewicz	0c9b2e5332	auth: corutinize standard_role_manager::can_login Corutinize so that it's easier to add new logic in following commit.	2025-11-26 12:01:32 +01:00
Marcin Maliszkiewicz	b29c42adce	main: auth: add auth cache dependency to auth service In the following commit we'll switch some authorizer and role manager code to use the cache so we're preparing the dependency.	2025-11-26 12:01:31 +01:00
Marcin Maliszkiewicz	ea3dc0b0de	raft: update auth cache when data changes When applying group0_command we now inspect whether any auth internal tables were modified, and reload affected role entries in the cache. Since one auth DML may change multiple tables, when iterating over mutations we deduplicate affected roles across those tables.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	2a6bef96d6	auth: storage_service: reload auth cache on v1 to v2 auth migration	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	19da1cb656	raft: reload auth cache on snapshot application Receiving snaphot is a rare event so as a simplification we'll be reloading the whole cache instead of trying to merge states, especially that expected size is small, below 100 records. Reloading is non-disruptive operation, old entries are removed only after all entries are loaded. If entry is updated, shared pointer will be atomically replaced in a cache map.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	2cf1ca43b5	service: add auth cache getter to storage service Prepare for use in a subsequent commit in group0_state_machine, where the auth cache will be integrated. This follows the same pattern as updates to the service-level cache, view-building state, and CDC streams.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	642f468c59	main: start auth cache service The service is not yet used anywhere, we first build scaffolding.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	bd7c87731b	auth: add unified cache implementation It combines data from all underlying auth tables. Supports gentle full load and per role reloads. Loading is done on shard 0 and then deep copies data to all shards.	2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz	4c667e87ec	auth: move table names to common.hh They will be used additionally in cache code, added in following commits.	2025-11-26 12:00:50 +01:00
Nadav Har'El	f4555be8a5	docs/alternator: list another unimplemented Alternator feature A new feature was announced this week for Amazon DynamoDB, "multi- attribute composite keys in global secondary indexes", which allows to create GSIs with composite keys (multiple columns). This feature already existed in CQL's materialized views, but didn't exist in DynamoDB until now. So this patch adds a paragraph to our docs/alternator/compatibility.md mentioning that we don't support this DynamoDB feature yet. See also issue #27182 which we opened to track this unimplemented feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27183	2025-11-26 12:10:37 +02:00
Pavel Emelyanov	943350fd35	scripts: Add target branch checking in PR merging script Sometimes (though rarely) I call this script on mis-matching PR and current branch. E.g. trying to merge master PR into stable next, or 2025.X PR into next-2025.Y (X != Y). Typically merge fails, but it's good to catch it early. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#27249	2025-11-26 12:10:16 +02:00
Nadav Har'El	9cde93e3da	Merge 'db/view/view_building_coordinator: get rid of task's state in group0' from Michał Jadwiszczak Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability. With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead. In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments. Fixes https://github.com/scylladb/scylladb/issues/26311 This patch needs to be backported to 2025.4. Closes scylladb/scylladb#26897 * github.com:scylladb/scylladb: docs/dev/view-building-coordinator: update the docs after recent changes db/view/view_building: send coordinator's term in the RPC db/view/view_building_state: replace task's state with `aborted` flag db/view/view_building_coordinator: batch finished tasks reporting db/view/view_building_worker: change internal implementation db/view/view_building_coordinator: change `work_on_tasks` RPC return type	2025-11-26 11:35:44 +02:00
dependabot[bot]	86cd0a4dce	build(deps): bump sphinx-multiversion-scylla in /docs Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.3 to 0.3.4. --- updated-dependencies: - dependency-name: sphinx-multiversion-scylla dependency-version: 0.3.4 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#27214	2025-11-26 06:57:02 +02:00
tomek7667	9bbdd487b4	docs: insert.rst: Update insert example by removing 'year' column Closes scylladb/scylladb#26862	2025-11-26 06:55:28 +02:00
tomek7667	2138ab6b0e	docs: insert.rst: fix INSERT statement for NerdMovies example Closes scylladb/scylladb#26863	2025-11-26 06:53:45 +02:00
tomek7667	90a6aa8057	docs: ddl.rst: Fix formatting of null value note Closes scylladb/scylladb#26853	2025-11-26 06:52:18 +02:00
Botond Dénes	384bffb8da	Merge 'compaction: limit the maximum shares allocated to a compaction scheduling class' from Raphael Raph Carvalho This PR adds support for limiting the maximum shares allocated to a compaction scheduling class by the compaction controller. It introduces a new configuration parameter, compaction_max_shares, which, when set to a non zero value, will cap the shares allocated to compaction jobs. This PR also exposes the shares computed by the compaction controller via metrics, for observability purposes. Fixes https://github.com/scylladb/scylladb/issues/9431 Enhancement. No need to backport. NOTE: Replaces PR https://github.com/scylladb/scylladb/pull/26696 Ran a test in which the backlog raised the need for max shares (normalized backlog above normalization_factor), and played with different values for new option compaction_max_shares to see it works (500, 1000, 2000, 250, 50) Closes scylladb/scylladb#27024 * github.com:scylladb/scylladb: db/config: introduce new config parameter `compaction_max_shares` compaction_manager:config: introduce max_shares compaction_controller: add configurable maximum shares compaction_controller: introduce `set_max_shares()`	2025-11-26 06:51:30 +02:00
Avi Kivity	1f6e3301e7	dist: systemd: drop deprecated CPU and I/O shares/weight from scylla-server.slice The BlockIOWeight and CPUShares are deprecated. They are only used on RHEL 7, which has reached end-of-life. Their replacements, IOWeight and CPUWeight, are already set in the file. Remove the deprecated settings to reduce noise in the logs. Closes scylladb/scylladb#27222	2025-11-26 06:42:11 +02:00
Yaniv Michael Kaul	765a7e9868	gms/gossiper.cc: fix gossip log to show host-id/ip instead of host-id/host-id Probably a copy-paste error, fixes the log to print host-id/ip. Backport: no need, benign log issue. Fixes: https://github.com/scylladb/scylladb/issues/27113 Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#27225	2025-11-25 20:56:20 +01:00
Wojciech Mitros	3c376d1b64	alternator: use storage_proxy from the correct shard in executor::delete_table When we delete a table in alternator, the schema change is performed on shard 0. However, we actually use the storage_proxy from the shard that is handling the delete_table command. This can lead to problems because some information is stored only on shard 0 and using storage_proxy from another shard may make us miss it. In this patch we fix this by using the storage_proxy from shard 0 instead. Fixes https://github.com/scylladb/scylladb/issues/27223 Closes scylladb/scylladb#27224	2025-11-25 18:56:31 +01:00
Botond Dénes	584f4e467e	tools/scylla-sstable: introduce the dump-schema command There is a limited number of ways to obtain the schema of a table: 1) Use DESCRIBE TABLE in cqlsh 2) Find the schema definition in the code (for system tables) 3) Ask support/user to provide schema 4) Piece together the schema definition from the system tables Option (1) is the most convenient but requires access to live cluster. (2) is limited to system tables only. When investigating issues for customers, we have to rely on (3) and this often adds communication round-trips and delays. (4) requires knowledge of ScyllaDB internals and access to system tables. The new dump-schema commands provides a convenient way to obtain the schema of tables, given that there is access to either an sstable or the system tables. It can dump the schema of system tables without either. Closes scylladb/scylladb#26433	2025-11-25 20:32:36 +03:00
Nadav Har'El	4c7c5f4af7	alternator: implement gzip-compressed requests In this patch we implement Alternator's support for gzip-compressed requests, i.e., requests with the "Content-Encoding: gzip" header, other uncompressed headers, and a gzip-compressed body. The server needs to verify the signature of the compressed content, and then uncompress the body before running the request. We only support gzip compression because this is what DynamoDB supports. But in the future we can easily add support for other compression algorithms like lz4 or zstd. This patch Refs #5041 but doesn't "Fixes" it because it only implements compressed requests (Content-Encoding), not compressed responses (Accept-Encoding). The next patch will enable several tests for this feature and make sure it behaves like DynamoDB. Note that while we will have now support in our server for compressed requests, just like DynamoDB does, the clients (AWS SDKs) will probably NOT make use of it because they do not enable request compression by default. For example, see the tests for some hoops one needs to jump through in boto3 (the Python SDK) to send compressed requests. However, we are hoping that in the future Alternator's modified clients will use compressed requests and enjoy this feature. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-25 17:46:44 +02:00
Gleb Natapov	5dcdaa6f66	test: test that expired erm that held for too long triggers notification	2025-11-25 17:33:54 +02:00
Piotr Dulikowski	ff5c7bd960	Merge 'topology_coordinator: don't repair colocated tablets' from Michael Litvak With the introduction of colocated tables, all the tablet transitions now operate on groups of colocated tablets instead of individual tablets. such is tablet migration, and also tablet repair. The tablet repair currently doesn't work on individual tablets due to the limitations in the tablet map being shared. The way it was implemented to work on a group of colocated tablets is by repairing all the colocated tablets together, using a dedicated rpc, and setting a shared repair_time in the shared tablet map. It was implemented this way because we wanted to have some way to repair the tablets of a colocated table. However, we want to change this in the next release so that it will be possible to repair the tablets of a colocated table individually. In order to simplify and prepare for the future change, we prefer until then to not repair colocated tables at all. otherwise, we will need to support both the shared repair and individual repair together for a long time, and the upgrade will be more complicated. We change the handling of the tablet 'repair' transition to repair only the base table's tablets. It means it will not be possible to request tablet repair for a non-base colocated table such as local MV, CDC and paxos table. This restriction will be temporary until a later release where we will suuport repairing colocated tablets. This is a reasonable restriction because repair for these kind of tables is not required or as important as for normal tables. Fixes https://github.com/scylladb/scylladb/issues/27119 backport to 2025.4 since we must change it in the same version it's introduced before it's released Closes scylladb/scylladb#27120 * github.com:scylladb/scylladb: tombstone_gc: don't use 'repair' mode for colocated tables Revert "storage service: add repair colocated tablets rpc" topology_coordinator: don't repair colocated tablets	2025-11-25 14:58:06 +01:00
David Garcia	64a65cac55	docs: add metrics generation validation fix: windows gitbash support fix: new name found with no group vector_search/vector_store_client.cc 343 fix: rm allowmismatch fix: git bash (windows) compatibility fix: git bash (windows) compatibility Closes scylladb/scylladb#26173	2025-11-25 15:39:52 +03:00
Gleb Natapov	9f97c376f1	token_metadata: fix notification about expiring erm held for to long Commit `6e4803a750` broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix assign operator to call destructor.	2025-11-25 13:35:24 +02:00
Michał Jadwiszczak	fe9581f54c	docs/dev/view-building-coordinator: update the docs after recent changes Remove information about view building task state and explain how current lifetime of the task.	2025-11-25 12:14:05 +01:00
Michał Jadwiszczak	fb8cbf1615	db/view/view_building: send coordinator's term in the RPC To avoid case when an old coordinator (which hasn't been stopped yet) dictates what should be done, add raft term to the `work_on_view_building_tasks` RPC. The worker needs to check if the term matches the current term from raft server, and deny the request when the term is bad.	2025-11-25 12:14:05 +01:00
Michał Jadwiszczak	24d69b4005	db/view/view_building_state: replace task's state with `aborted` flag After previous commits, we can drop entire task's state and replace it with single boolean flag, which determines if a task was aborted. Once a task was aborted, it cannot get resurrected to a normal state.	2025-11-25 12:14:04 +01:00
Michał Jadwiszczak	eb04af5020	db/view/view_building_coordinator: batch finished tasks reporting In previous implementation to execute view building tasks, the coordinator needed to firstly set their states to `STARTED` and then it needed to remove them before it could start the next ones. This logic required a lot of group0 commits, especially in large clusters with higher number of nodes and big tablet count. After previous commit to the view building worker, the coordinator can start view building tasks without setting the `STARTED` state and deleting finished tasks. This patch adjusts the coordinator to save finished tasks locally, so it can continue to execute next ones and the finished tasks are periodically removed from the group0 by `finished_task_gc_fiber()`.	2025-11-25 12:14:04 +01:00
dependabot[bot]	b911a643fd	build(deps): bump sphinx-scylladb-theme from 1.8.8 to 1.8.9 in /docs Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.8 to 1.8.9. - [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases) - [Commits](https://github.com/scylladb/sphinx-scylladb-theme/commits) --- updated-dependencies: - dependency-name: sphinx-scylladb-theme dependency-version: 1.8.9 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#27169	2025-11-25 11:01:37 +02:00
Botond Dénes	1263e1de54	Merge 'docs: modify debian/ubutnu installation instructions' from Yaron Kaikov To support debian13, we need to modify the installation instructions since `apt-key` command is no longer available Also updated installation instruction to match the latest release Fixes: https://github.com/scylladb/scylladb/issues/26673 No need for backport since we added debian13 only in master for now Closes scylladb/scylladb#27205 * github.com:scylladb/scylladb: install-on-linux.rst: update installation example to supported release docs: modify debian/ubutnu installation instructions	2025-11-25 10:53:11 +02:00
Nadav Har'El	bcd1758911	Merge 'vector_search: add validator tests' from Pawel Pery The vector-search-validator is a binary tool which do functional and integration tests between scylla and vector-store. It is build in Rust mainly in vector-store repository. This patch adds possibility to write tests on scylladb repository side, compile them together with vector-store tests and run them in `test.py` environment. There are three parts of the change: - add sources of validator to the `test/vector_search_validator` directory - add support for building validator and vector-store in `build/vector-search-validator/bin` directory with or without cmake - add support for `pytest` and `test.py` to run validator test locally and in the CI environment; this part adds also README to the `test/vector_search_validator` directory Design for validator integration tests: https://scylladb.atlassian.net/wiki/spaces/RND/pages/39518215/Vector+Search+Core+Test+Plan+Document References: VECTOR-50 No backport needed as this is a new functionality. Closes scylladb/scylladb#26653 * github.com:scylladb/scylladb: vector_search: add vector-search-validator tests vector_search: implement building vector-search-validator vector_search: add vector-search-validator sources	2025-11-25 10:34:33 +02:00
Michael Litvak	868ac42a8b	tombstone_gc: don't use 'repair' mode for colocated tables For tables of special types that can be located: MV, CDC, and paxos table, we should not use tombstone_gc=repair mode because colocated tablets are never repaired, hence they will not have repair_time set and will never be GC'd using 'repair' mode.	2025-11-25 09:15:46 +01:00
Michael Litvak	005807ebb8	Revert "storage service: add repair colocated tablets rpc" This reverts commit `11f045bb7c`. The rpc was added together with colocated tablets in 2025.4 to support a "shared repair" operation of a group of colocated tablets that repairs all of them and allows also for special behavior as opposed to repairing a single specific tablet. It is not used anymore because we decided to not repair all colocated tablets in a single shared operation, but to repair only the base table, and in a later release support repairing colocated tables individually. We can remove the rpc in 2025.4 because it is introduced in the same version.	2025-11-25 09:06:48 +01:00
Michael Litvak	273f664496	topology_coordinator: don't repair colocated tablets With the introduction of colocated tables, all the tablet transitions now operate on groups of colocated tablets instead of individual tablets. such is tablet migration, and also tablet repair. The tablet repair currently doesn't work on individual tablets due to the limitations in the tablet map being shared. The way it was implemented to work on a group of colocated tablets is by repairing all the colocated tablets together, using a dedicated rpc, and setting a shared repair_time in the shared tablet map. It was implemented this way because we wanted to have some way to repair the tablets of a colocated table. However, we want to change this in the next release so that it will be possible to repair the tablets of a colocated table individually. In order to simplify and prepare for the future change, we prefer until then to not repair colocated tables at all. otherwise, we will need to support both the shared repair and individual repair together for a long time, and the upgrade will be more complicated. We change the handling of the tablet 'repair' transition to repair only the base table's tablets. It means it will not be possible to request tablet repair for a non-base colocated table such as local MV, CDC and paxos table. This restriction will be temporary until a later release where we will suuport repairing colocated tablets. This is a reasonable restriction because repair for these kind of tables is not required or as important as for normal tables. Fixes scylladb/scylladb#27119	2025-11-25 09:05:59 +01:00
Amnon Heiman	b2c2a99741	index/vector_index.cc: Don't allow zero as an index option This patch forces vector_index option value to be real-positive numbers as zero would make no senese. Fixes https://scylladb.atlassian.net/browse/VECTOR-249 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#27191	2025-11-25 10:05:44 +02:00
Karol Nowacki	ca62effdd2	vector_search: Restrict vector index tests to tablets only Vector indexes are going to be supported only for tablets (see VECTOR-322). As a result, tests using vector indexes will be failing when run with vnodes. This change ensures tests using vector indexes run exclusively with tablets. Fixes: VECTOR-49 Closes scylladb/scylladb#26843	2025-11-25 09:26:16 +02:00
Pawel Pery	9f10aebc66	vector_search: add vector-search-validator tests The commit adds a functionality for `pytest` and `test.py` to run `vector-search-validator` in `sudo unshare` environment. There are already two tests - first parametrized `test_validator.py::test_validator[test-case-name]` (run validator) and second `test_cargo_toml.py::test_cargo_toml` (check if the current `Cargo.toml` for validator is correct). Documentation for these tests are provided in `README.md`.	2025-11-24 17:26:04 +01:00
Pawel Pery	3702e982b9	vector_search: implement building vector-search-validator The commit adds targets building `build/vector-search-validator/bin/{vector-store,vector-search-validator}. The targets must be build for tests. They don't depend on build mode. The commit adds target in `configure.py` and also in `cmake`.	2025-11-24 17:26:04 +01:00
Pawel Pery	e569a04785	vector_search: add vector-search-validator sources The commit adds validator sources uses combination of local files and vector-store's files. In `build-env` there are definition of vector-store git repository and revision on which validator will be built. `cargo-toml-template` is script for printing current `Cargo.toml` to the stdout. After updating `build-env` developer needs to update new configuration with `./cargo-toml-template > Cargo.toml`. Git revision is used in several places in `Cargo.toml` and will be used for building `vector-store`, so for better handling git revision it should be setup only in one place. The validator is divided into several crates to be able to built it within scylladb and vector-store repositories. Here we need to create a new validator crate with simple `main` function and call `validator_engine::main` there. We provide tests written in scylladb repo in `validator-scylla` crate. The commit provides empty `cql` test case, which should be filled in the future.	2025-11-24 17:26:04 +01:00
Gleb Natapov	39cec4ae45	topology: let banned node know that it is banned Currently if a banned node tries to connect to a cluster it fails to create connections, but has no idea why, so from inside the node it looks like it has communication problems. This patch adds new rpc NOTIFY_BANNED which is sent back to the node when its connection is dropped. On receiving the rpc the node isolates itself and print an informative message about why it did so. Closes scylladb/scylladb#26943	2025-11-24 17:12:13 +01:00
Lakshmi Narayanan Sreethar	9cb766f929	db/config: introduce new config parameter `compaction_max_shares` Add support for the new configuration parameter `compaction_max_shares`, and update the compaction manager to pass it down to the compaction controller when it changes. The shares allocated to compaction jobs will be limited by this new parameter. Fixes #9431 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 12:52:29 -03:00
Lakshmi Narayanan Sreethar	468b800e89	compaction_manager:config: introduce max_shares Introduce an updateable value `max_shares` to compaction manager's config. Also add a method `update_max_shares()` that applies the latest `max_shares` value to the compaction controller’s `max_shares`. This new variable will be connected to a config parameter in the next patch. Refs #9431 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 11:43:38 -03:00
Lakshmi Narayanan Sreethar	f2b0489d8c	compaction_controller: add configurable maximum shares Add a `max_shares` constructor parameter to compaction_controller to allow configuring the maximum output of the control points at construction time. The constructor now calls `set_max_shares()` with the provided max_shares value. The subsequent commits will wire this value to a new configuration option. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-11-24 11:43:24 -03:00
Lakshmi Narayanan Sreethar	853811be90	compaction_controller: introduce `set_max_shares()` Add a method to dynamically adjust the maximum output of control points in the compaction controller. This is required for supporting runtime configuration of the maximum shares allocated to the compaction process by the controller. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-11-24 11:43:20 -03:00
Tomasz Grabiec	d4b77c422f	Merge 'load_stats: leaving replica could be std::nullopt' from Ferenc Szili When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator. This patch adds a check for leaving and pending replicas, and only performs the tablet size migration if neither are empty. This bug was introduced in `10f07fb95a` This change also adds the ability to create a tablet size in load_stats during end_migration stage of a tablet rebuild. We compute the new tablet size from by averaging the tablet sizes of the existing replicas. This change also adds the virtual table tablet_sizes which contains tablet sizes of all the replicas of all the tablets in the cluster. A version containing this bug has not yet been released, so a backport is not needed. Closes scylladb/scylladb#27118 * github.com:scylladb/scylladb: test: add tests for tablet size migration during end_migration virtual_table: add tablet_sizes virtual table load_stats: update tablet sizes after migration or rebuild	2025-11-24 15:31:30 +01:00
Yaron Kaikov	13eca61d41	install-on-linux.rst: update installation example to supported release Example of installation is out of date, since scylla-5.2 is EOL for long time upding the example for more recent release (together with packages update)	2025-11-24 16:22:17 +02:00
Anna Stuchlik	724dc1e582	doc: fix the info about object storage This commit fixes the information about object storage: - Object storage configuration is no longer marked as experimental. - Redundant information has been removed from the description. - Information related to object storage for SStabels has been removed as the feature is not working. Fixes https://github.com/scylladb/scylladb/issues/26985 Closes scylladb/scylladb#26987	2025-11-24 17:16:33 +03:00
Yaron Kaikov	5541f75405	docs: modify debian/ubutnu installation instructions To support debian13, we need to modify the installation instructions since `apt-key` command is no longer available Fixes: https://github.com/scylladb/scylladb/issues/26673	2025-11-24 13:33:17 +02:00
Michał Jadwiszczak	08974e1d50	db/view/view_building_worker: change internal implementation This commit doesn't change the logic behind the view building worker but it changes how the worker is executing view building tasks. Previously, the worker had a state only on shard0 and it was reacting to changes in group0 state. When it noticed some tasks were moved to `STARTED` state, the worker was creating a batch for it on the shard0 state. The RPC call was used only to start the batch and to get its result. Now, the main logic of batch management was moved to the RPC call handler. The worker has a local state on each shard and the state contains: - unique ptr to the batch - set of completed tasks - information for which views the base table was flushed So currently, each batch lives on a shard where it has its work to do exclusively. This eliminates a need to do a synchronization between shard0 and work shard, which was a painful point in previous implementation. The worker still reacts to changes in group0 view building state, but currently it's only used to observe whether any view building tasks was aborted by setting `ABORTED` state. To prepare for further changes to drop the view building task state, the worker ignores `IDLE` and `STARTED` states completely.	2025-11-24 11:12:31 +01:00
Michał Jadwiszczak	6d853c8f11	db/view/view_building_coordinator: change `work_on_tasks` RPC return type During the initial implementation of the view builing coordinator, we decided that if a view building task fails locally on the worker (example reason: view update's target replica is not available), the worker will retry this work instead of reporting a failure to the coordinator. However, we left return type of the RPC, which was telling if a task was finished successfully or aborted. But the worker doesn't need to report that a task was aborted, because it's the coordinator, who decides to abort a task. So, this commit changes the return type to list of UUIDs of completed tasks. Previously length of the returned vector needed to be the same as length of the vector sent in the request. No we can drop this restriction and the RPC handler return list of UUIDs of completed tasks (subset of vector sent in the request). This change is required to drop `STARTED` state in next commits. Since Scylla 2025.4 wasn't released yet and we're going to merge this patch before releasing, no RPC versioning or cluster feature is needed.	2025-11-24 11:12:29 +01:00
Avi Kivity	eb5e9f728c	build: lock cxxbridge-cmd version to the rest of the cxx packages rust/Cargo.toml locks the cxx packages to version 1.0.83, but install-dependencies.sh does not lock cxxbridge-cmd, part of that ecosystem. Since cxx 1.0.189 broke compatibility with 1.0.83 (understandable, as these are all sub-packages of a single repository), builds with newer cxxbridge-cmd are broken. Fix by locking cxxbridge-cmd to the same version as the other cxx subpackages. Regenerated frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz Probably better done by building cxxbridge-cmd during the build itself, but that is a deeper change. Fixes #27176 Closes scylladb/scylladb#27177	2025-11-24 07:04:53 +02:00
Avi Kivity	d6ef5967ef	tools: toolchain: prepare: replace 'reg' with 'skopeo' The prepare scripts uses 'reg' to verify we're not going to overwrite an existing image. The 'reg' command is not available in Fedora 43. Use 'skopeo' instead. Skopeo is part of the podman ecosystem so hopefully will live longer. Fixes #27178. Closes scylladb/scylladb#27179	2025-11-24 06:59:34 +02:00
Aleksandra Martyniuk	19a7d8e248	replica: database: change type of tables_metadata::_ks_cf_to_uuid If there is a lot of tables, a node reports oversized allocation in _ks_cf_to_uuid of type flat_hash_map. Change the type to std::unordered_map to prevent oversized allocations. Fixes: https://github.com/scylladb/scylladb/issues/26787. Closes scylladb/scylladb#27165	2025-11-24 06:42:40 +02:00
Botond Dénes	296d7b8595	Merge 'Enable digest+checksum verification for file based streaming' from Taras Veretilnyk This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream. These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data. For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculating and verifying the diges. New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches. Backport is not required, since it is a new feature Fixes #21776 Closes scylladb/scylladb#26702 * github.com:scylladb/scylladb: file_stream_test: add sstable file streaming integrity verification test cases streaming: prioritize sender-side errors in tablet_stream_files sstables: enable integrity check for data file streaming sstables: Add compressed raw streaming support sstables: Allow to read digest and checksum from user provided file instance sstables: add overload of data_stream() to accept custom file_input_stream_options	2025-11-24 06:37:27 +02:00
Aleksandra Martyniuk	76174d1f7a	cql3: reject ALTER KEYSPACE if rf of datacenter with tablets is omitted In ALTER KEYSPACE, when a datacenter name is omitted, its replication factor is implicitly set to zero with vnodes, while with tablets, it remains unchanged. ALTER KEYSPACE should behave the same way for tablets as it does for vnodes. However, this can be dangerous as we may mistakenly drop the whole datacenter. Reject ALTER KEYSPACE if it changes replication factor, but omits a datacenter that currently contains tablet replicas. Fixes: https://github.com/scylladb/scylladb/issues/25549. Closes scylladb/scylladb#25731	2025-11-24 06:36:51 +02:00
Avi Kivity	85db7b1caf	Merge 'address_map: Use more efficient and reliable replication method' from Tomasz Grabiec Primary issue with the old method is that each update is a separate cross-shard call, and all later updates queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated, because we update mapping on each change of gossip states. This made bootstrap impossible because nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 Closes scylladb/scylladb#26941 * github.com:scylladb/scylladb: address_map: Use barrier() to wait for replication address_map: Use more efficient and reliable replication method utils: Introduce helper for replicated data structures	2025-11-23 19:15:12 +02:00
Avi Kivity	b0643f8959	Merge 'db/config: enable `ms` sstable format by default' from Michał Chojnowski Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes. Make them the new default. If we change our mind, this change can be reverted later. New functionality, and this is a drastic change. No backport needed. Closes scylladb/scylladb#26377 * github.com:scylladb/scylladb: db/config: enable `ms` sstable format by default cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format api/system: add /system/chosen_sstable_version test/cluster/dtest: reduce num_tokens to 16	2025-11-23 13:52:57 +02:00
Piotr Dulikowski	e8b0f8faa9	Merge 'vector search: Add HTTPS requests support' from Karol Nowacki vector_search: Add HTTPS support for vector store connections This commit introduces TLS encryption support for vector store connections. A new configuration option is added: - vector_store_encryption_options.truststore: path to the trust store file To enable secure connections, use the https:// scheme in the vector_store_primary_uri/vector_store_secondary_uri configuration options. Fixes: VECTOR-327 Backport to 2025.4 as this feature is expected to be available in 2025.4. Closes scylladb/scylladb#26935 * github.com:scylladb/scylladb: test: vector_search: Ensure all clients are stopped on shutdown vector_search: Add HTTPS support for vector store connections	2025-11-22 14:58:06 +01:00
Karol Nowacki	58456455e3	test: vector_search: Ensure all clients are stopped on shutdown A flaky test revealed that after `clients::stop()` was called, the `old_clients` collection was sometimes not empty, indicating that some clients were not being stopped correctly. This resulted in sanitizer errors when objects went out of scope at the end of the test. This patch modifies `stop()` to ensure all clients, including those in `old_clients`, are stopped, guaranteeing a clean shutdown.	2025-11-22 08:18:45 +01:00
Karol Nowacki	c40b3ba4b3	vector_search: Add HTTPS support for vector store connections This commit introduces TLS encryption support for vector store connections. A new configuration option is added: - vector_store_encryption_options.truststore: path to the trust store file To enable secure connections, use the https:// scheme in the vector_store_primary_uri/vector_store_secondary_uri configuration options. Fixes: VECTOR-327	2025-11-22 08:18:45 +01:00
Ferenc Szili	39711920eb	test: add tests for tablet size migration during end_migration This change adds tests for the correctness of tablet size migration during the end_migrations stage. This size migration can happend for tablet migrations and for tablet rebuild.	2025-11-21 16:58:11 +01:00
Ferenc Szili	e96863be0c	virtual_table: add tablet_sizes virtual table This change adds the tablet_sizes virtual table. The contents of this table are gathered from the current load_stats data structure.	2025-11-21 16:53:28 +01:00
Ferenc Szili	cede4f66af	load_stats: update tablet sizes after migration or rebuild When migrating tablet size during the end_migration tablet transition stage, we need the pending and leaving replica hosts. The leaving and pending replicas are gathered in objects of type std::optional<tablet_replica> and are not checked if they contain a value before dereferencing which could cause an exception in the topology coordinator. This patch adds a check for leaving and pending replicas, and only perfoms the tablet size migration if neither are empty. This bug was introduced in `10f07fb95a` This change also adds the functionality to add the tablet size to load_stats after a tablet rebuild. We compute the average tablet size from the existing replicas, and add the new size to the pending replica.	2025-11-21 16:22:20 +01:00
Botond Dénes	38a1b1032a	Merge 'doc: update Cloud Instance Recommendations for GCP' from Anna Stuchlik This PR: - Removes n1-highmem instances from Recommended Instances. - Adds missing support for n2-highmem-96. - Updates the reference to n2 instances in the Google Cloud docs (fixes a broken link to GCP). - Adds the missing information about processors for n2-highmem-instance - Ice Lake and Cascade Lake (requested by CX). Fixes https://github.com/scylladb/scylladb/issues/25946 Fixes https://github.com/scylladb/scylladb/issues/24223 Fixes https://github.com/scylladb/scylladb/issues/23976 No backport needed if this PR is merged before 2025.4 branching. Closes scylladb/scylladb#26182 * github.com:scylladb/scylladb: doc: update information for n2-highmem instances doc: remove n1-highmem instances from Recommended Instances	2025-11-21 16:28:54 +02:00
Anna Stuchlik	dab74471cc	doc: update information for n2-highmem instances This commit updates the section for n2-highmem instances on the Cloud Instance Recommendations page - Added missing support for n2-highmem-96 - Update the reference to n2 instances in the Google Cloud docs. - Added the missing information about processors for this instance type (Ice Lake and Cascade Lake).	2025-11-21 15:13:36 +01:00
Taras Veretilnyk	3003669c96	file_stream_test: add sstable file streaming integrity verification test cases Add 'test_sstable_stream' to verify SSTable file streaming integrity check. The new tests cover both compressed and uncompressed SSTables and includes: - Checksum mismatch detection verification - Digest mismatch detection verifivation	2025-11-21 12:52:35 +01:00
Taras Veretilnyk	77dcad9484	streaming: prioritize sender-side errors in tablet_stream_files When 'send_data_to_peer' throws and closes the sink, the peer later reports its own error, masking the original sender failure. This commit preserves the original sender exception. If the status-retrieval task throws its own error before sender task rethrows its exception, we can still propagate the original exception later.	2025-11-21 12:52:31 +01:00
Taras Veretilnyk	c8d2f89de7	sstables: enable integrity check for data file streaming This patch enables integrity check in 'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream. These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data. For compressed SSTables - where checksums are already embedded - the cost comes from reading, calculation and verifying the digest.	2025-11-21 12:52:26 +01:00
Taras Veretilnyk	18e1dbd42e	sstables: Add compressed raw streaming support Implement compressed_raw_file_data_source that streams compressed chunks without decompression while verifying checksums and calculating digests. Extends raw_stream enum to support compressed_chunks mode. This data_source implementation will be used in the next commits for file based streaming.	2025-11-21 12:52:04 +01:00
Taras Veretilnyk	c32e9e1b54	sstables: Allow to read digest and checksum from user provided file instance Add overloaded methods to read digest and checksum from user-provided file handles: - 'read_digest(file f)' - 'read_checksum(file f) This will be useful for tablet file-based streaming to enable integrity verification, as the streaming code uses SSTable snapshots with open files to prevent missing components when SSTables are unlinked.	2025-11-21 12:51:40 +01:00
Michał Chojnowski	da51a30780	db/config: enable `ms` sstable format by default Trie-based sstable indexes are supposed to be (hopefully) a better default than the old BIG indexes. Make them the new default. If we change our mind, this change can be reverted later.	2025-11-21 12:39:46 +01:00
Michał Chojnowski	73090c0d27	cluster/dtest/bypass_cache_test: switch from highest_supported_sstable_format to chosen_sstable_format Trie-based indexes and older indexes have a difference in metrics, and the test uses the metrics to check for bypass cache. To choose the right metrics, it uses highest_supported_sstable_format, which is inappropriate, because the sstable format chosen for writes by Scylla might be different than highest_supported_sstable_format. Use chosen_sstable_format instead.	2025-11-21 12:39:46 +01:00
Michał Chojnowski	38e14d9cd5	api/system: add /system/chosen_sstable_version Returns the sstable version currently chosen for use in for new sstables. We are adding it because some tests want to know what format they are writing (tests using upgradesstable, tests which check stats that only apply to one of the index types, etc). (Currently they are using `highest_supported_sstable_format` for this purpose, which is inappropriate, and will become invalid if a non-latest format is the default).	2025-11-21 12:39:46 +01:00
Wojciech Mitros	aacf883a8b	cql: add CONCURRENCY to the USING clause Currently, the PRUNE MATERIALIZED VIEW statement performs all its reads and writes in a single, continous sequence. This takes too much time even for a moderate amount of 'PRUNED' data. Instead, we want to make it possible to set a concurrency of the reads and writes performed while processing the PRUNE statement, so that if the user so desires, it may finish the PRUNING quicker at the cost of adding more load on the cluster. In this patch we add the CONCURRENCY setting to the USING clause in cql. In the next patch, we'll be using it to actually set the concurrency of PRUNE MATERIALIZED VIEW.	2025-11-21 12:32:52 +01:00
Botond Dénes	5c6813ccd0	test/cluster/test_repair.py: add test_repair_timestamp_difference Add a test which verifies that if two nodes have the same data, with different timestamps, repair will detect and fix the diverging timestamps. All our repair tests focus on difference in data and I remember writing this test multiple times in the past to quickly verify whether this works. Time to upstream this test. Closes scylladb/scylladb#26900	2025-11-21 14:19:51 +03:00
Botond Dénes	6f79fcf4d5	tools/scylla-nodetool: dump request history on json assert A JSON assert happens when a JSON member is either missing or has unexpected type. rapidjson has a very unhelpful "json assert failed" message for this, with a backtrace (somewhat helpful), with no other context. To help debug such errors, collect all request sent to the API and dump them when such errors happen. The backtrace with the full request history should be enough to debug any such issues. Refs CUSTOMER-17 Closes scylladb/scylladb#26899	2025-11-21 14:17:53 +03:00
Gautam Menghani	939fcc0603	db/system_keyspace: Remove the FIXME related to caching of large tables Remove the FIXME comment for re-enabling caching of the large tables since the tables are used infrequently [1]. [1] : github.com/scylladb/scylladb/pull/26789#issuecomment-3477540364 Fixes #26032 Signed-off-by: Gautam Menghani <gautam.opensource@gmail.com> Closes scylladb/scylladb#26789	2025-11-21 12:34:34 +02:00
Radosław Cybulski	d589e68642	Add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime Closes scylladb/scylladb#26617	2025-11-21 12:27:41 +02:00
Nadav Har'El	64a075533b	alternator: fix update of stats from wrong shard In commit `51186b2` (PR #25457) we introduced new statistics for authentication errors, and among other places we modified executor::create_table() to update them when necessary. This function runs its real work (create_table_on_shard0()) on shard 0, but incorrectly updates "_stats" from the original shard. It doesn't really matter which shard's stats we update - but it does matter that code running on shard 0 shouldn't touch some other shard's objects. Since all we do on these stats is to increment an integer, the risk of updating it on the wrong shard is minimal to non-existant, but it's still wrong and can cause bigger trouble in the future as the code continues to evolve. The fix is simple - we should pass to create_table_on_shard0() the _stats object from the acutal shard running it (shard 0). Fixes #26942 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26944	2025-11-21 11:53:06 +02:00
Calle Wilund	3c4546d839	messaging_service: Add internode_compression=rack as option Fixes #27085 Adds a "rack" option to enum/config and handles in connection setup in messaging_service. Closes scylladb/scylladb#27099	2025-11-21 11:50:55 +02:00
Nadav Har'El	66bd3dc22c	test/alternator: tests for request compression DynamoDB's documentation https://docs.aws.amazon.com/sdkref/latest/guide/feature-compression.html suggests that DynamoDB allows request bodies to be compressed (currently only by gzip). The purpose of patch is to have a test reproducing this feature. The test shows us that indeed DynamoDB understands compressed requests using the "gzip" encoding, but Alternator does not, so the new test is xfail. As you can see in the test code, although the low-level SDK (botocore) can send compress requests, this is not actually enabled for DynamoDB and we need to resort to some trickery to send compressed requests. But the point is that once we do manage to send compressed requests, the test shows us that they work properly on AWS, but fail on Alternator. The failure of the compressed requests on Alternator is reported like: An error occurred (ValidationException) when calling the PutItem operation: Parsing JSON failed: Invalid value. at 70459088 This error message should probably be improved (what is that high number?!) but of course even better would be to make it really work. By enabling tracing on alternator-server (e.g., edit test/cqlpy/run.py and add `'--logger-log-level', 'alternator-server=trace',`) we can see exactly what request the SDK sends Alternator. What we can see in the request is: 1. The request headers are uncompressed (this is expected in HTTP) 2. There is a header "Content-Encoding: gzip" 3. The request's body is binary, a full-fleged gzip output complete with a gzip magic in the beginning. Refs #5041 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#27049	2025-11-21 10:48:33 +02:00
Shreyas Ganesh	4488a4fb06	docs: document sstables quarantine subdirectory Add documentation for the quarantine/ subdirectory that holds SSTables isolated due to validation failures or corruption. Document the scrub operation's quarantine_mode parameter options and the drop_quarantined_sstables API operation. Also update the directory hierarchy example to include the quarantine directory. Fixes #10742 Signed-off-by: Shreyas Ganesh <vansi.ganeshs@gmail.com> Closes scylladb/scylladb#27023	2025-11-21 10:45:33 +02:00
Ernest Zaslavsky	825d81dde2	cmake: dont complain about deprecated builtins On clang 21.1.4 (Fedora 43) the abseil compilation started to fail with `builtin XXX is deprecated use YYY instead`. Suppress this for abseil compilation only Closes scylladb/scylladb#27098	2025-11-21 10:31:54 +02:00
Botond Dénes	0cc5208f8e	Merge 'Add sstables_manager::config' from Pavel Emelyanov Currently sstables_manager keeps a reference on global db::config to configure itself. Most of other services use their own specific configs with much less data on-board for the same purposes (e.g. #24841, #19051 and #23705 did same for other services) This PR applies this approach to sstables_manager as well. Mostly it moves various values from db::config onto newly introduced struct sstables_manager::config, but it also adds specific tracking of sstable_file_io_extensions and patches tools/scylla-sstable not to use sstables_manager as "proxy" object to get db::config from along its calls. Shuffling components dependencies, no need to backport Closes scylladb/scylladb#27021 * github.com:scylladb/scylladb: sstables_manager: Drop db::config from sstables_manager tools/sstable: Make shard_of_with_tablets use db::config argument tools/sstable: Add db::config& to all operations tools/sstable: Get endpoints from storage manager sstables_manager: Hold sstable IO extensions on it sstables: Manager helper to grab file io extensions sstables_manager: Move default format on config sstables_manager: Move enable_sstable_data_integrity_check on config sstables_manager: Move data_file_directories on config sstables_manager: Move components_memory_reclaim_threshold on config sstables_manager: Move column_index_auto_scale_threshold on config sstables_manager: Move column_index_size on config sstables_manager: Move sstable_summary_ratio on config sstables_manager: Move enable_sstable_key_validation on config sstables_manager: Move available_memory on config code: Introduce sstables_manager::config sstables: Patch get_local_directories() to work on vector of paths code: Rename sstables_manager::config() into db_config()	2025-11-21 10:21:41 +02:00
Botond Dénes	f89bb68fe2	Merge 'cdc: Preserve properties when reattaching log table' from Dawid Mędrek When we enable CDC on a table, Scylla creates a log table for it. It has default properties, but the user may change them later on. Furthermore, it's possible to detach that log table by simply disabling CDC on the base table: ```cql /* Create a table with CDC enabled. The log table is created. / CREATE TABLE ks.t (pk int PRIMARY KEY) WITH cdc = {'enabled': true}; / Detach the log table. / ALTER TABLE ks.t WITH cdc = {'enabled': false}; / Modify a property of the log table. / ALTER TABLE ks.t_scylla_cdc_log WITH bloom_filter_fp_chance = 0.13; ``` The log table can also be reattached by enabling CDC on the base table again: ```cql / Reattach the log table / ALTER TABLE ks.t WITH cdc = {'enabled': true}; ``` However, because the process of reattachment goes through the same code that created it in the first place, the properties of the log table are rolled back to their default values. This may be confusing to the user and, if unnoticed, also have other consequences, e.g. affecting performance. To prevent that, we ensure that the properties are preserved. A reproducer test, `test_log_table_preserves_properties_after_reattachment`, has been provided to verify that the changes are correct. Another test, `test_log_table_preserves_id_after_reattachment`, has also been added because the current implementation sets properties and the ID separately. Fixes scylladb/scylladb#25523 Backport: not necessary. Although the behavior may be unexpected, it's not a bug per se. Closes scylladb/scylladb#26443 github.com:scylladb/scylladb: cdc: Preserve properties when reattaching log table cdc: Extract creating columns in CDC log table to dedicated function cdc: Extract default properties of CDC log tables to dedicated function schema/schema_builder.hh: Add set_properties schema: Add getter for schema::user_properties schema: Remove underscores in fields of schema::user_properties schema: Extract user properties out of raw_schema	2025-11-21 10:06:05 +02:00
Calle Wilund	03408b185e	utils::gcp::object_storage: Fix buffer alignment reordering trailing data Fixes #26874 Due to certain people (me) not being able to tell forward from backward, the data alignment to ensure partial uploads adhere to the 256k-align rule would potentially _reorder_ trailing buffers generated iff the source buffers input into the sink are small enough. Which, as a fun fact, they are in backup upload. Change the unit test to use raw sink IO and add two unit tests (of which the smaller size provokes the bug) that checks the same 64k buf segmented upload backup uses. Closes scylladb/scylladb#26938	2025-11-21 09:36:13 +02:00
Radosław Cybulski	ce8db6e19e	Add table name to tracing in alternator Add a table name to Alternator's tracing output, as some clients would like to consistently receive this information. - add missing `tracing::add_table_name` in `executor::scan` - add emiting tables' names in `trace_state::build_parameters_map` - update tests, so when tracing is looked for it is filtered by table's name, which confirms table is being outputed. - change `struct one_session_records` declaration to `class one_session_records`, as `one_session_records` is later defined as class. Refs #26618 Fixes #24031 Closes scylladb/scylladb#26634	2025-11-21 09:33:40 +02:00
Michał Chojnowski	3f11a5ed8c	test/cluster/dtest: reduce num_tokens to 16 cluster.dtest_alternator_tests.test_slow_query_logging performs a bootstrap with 768 token ranges. It works with `me` sstables, which have 2 open file descriptors per open sstable, but with `ms` sstables, which have 3 open file descriptors per open sstable, it fails with EMFILE. To avoid this problem, let's just decrease the number of vnodes for in the test suite. It's appropriate anyway, because it avoids some unneeded work without weakening the tests. (Note: pylib-based have been setting `num_tokens` to 16 for a long time too). This breaks `bypass_cache_test`, which is written in a way that expects a certain number of token ranges. We adjust the relevant parameter accordingly.	2025-11-21 00:38:50 +01:00
Piotr Dulikowski	22f22d183f	Merge 'Refine sstables_loader vs database dependency' from Pavel Emelyanov There are two issues with it. First, a small RPC helper struct carries database reference on board just to get feature state from it. Second, sstable_streamer uses database as proxy to feature service. This PR improves both. Services dependencies improvement, not need to backport Closes scylladb/scylladb#26989 * github.com:scylladb/scylladb: sstables_loader: Get LOAD_AND_STREAM_ABORT_RPC_MESSAGE from messaging sstables_loader: Keep bool on send_meta, not database reference	2025-11-20 16:13:16 +01:00
Asias He	d51b1fea94	tablets: Allow tablet merge when repair tasks exist Currently we do not allow tablet merge if either of the tablets contain a tablet repair request. This could block the tablet merge for a very long time if the repair requests could not be scheduled and executed. We can actually merge the repair tasks in most of the cases. This is because most of the time all tablets are requested to be repaired by a single API request, so they share the same task_id, request_type and other parameters. We can merge the repair task info and executes the repair after the merge. If they do not share the task info, we could not merge and have to wait for the repair before merge, which is both rare and ok. Another case is that one of the tablet has a repair task info (t1) while the other tablet (t2) does not have, it is possible the t2 has finished repair by the same repair request or t2 is not requested to be repaired at all. We allow merge in this case too to avoid blocking the tablet merge, with the price of reparing a bit more. Fixes #26844 Closes scylladb/scylladb#26922	2025-11-20 16:01:23 +01:00
Asias He	3cf1225ae6	docs: Add feature page for incremental repair Adds a new documentation page for the incremental repair feature. The page covers: - What incremental repair is and its benefits over the standard repair process. - How it works at a high level by tracking the repair status of SSTables. - The prerequisite of using the tablets architecture. - The different user-configurable modes: 'regular', 'full', and 'disabled'. Fixes #25600 Closes scylladb/scylladb#26221	2025-11-20 11:58:53 +02:00
Raphael S. Carvalho	74ecedfb5c	replica: Fail timed-out single-key read on cleaned up tablet replica Consider the following: 1) single-key read starts, blocks on replica e.g. waiting for memory. 2) the same replica is migrated away 3) single-key read expires, coordinator abandons it, releases erm. 4) migration advances to cleanup stage, barrier doesn't wait on timed-out read 5) compaction group of the replica is deallocated on cleanup 6) that single-key resumes, but doesn't find sstable set (post cleanup) 7) with abort-on-internal-error turned on, node crashes It's fine for abandoned (= timed out) reads to fail, since the coordinator is gone. For active reads (non timed out), the barrier will wait for them since their coordinator holds erm. This solution consists of failing reads which underlying tablet replica has been cleaned up, by just converting internal error to plain exception. Fixes #26229. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#27078	2025-11-20 11:44:03 +02:00
Geoff Montee	a0734b8605	Update update-topology-strategy-from-simple-to-network.rst: Multiple clarifications to page and sub-procedures Fixes #27077 Multiple points can be clarified relating to: * Names of each sub-procedure could be clearer * Requirements of each sub-procedure could be clearer * Clarify which keyspaces are relevant and how to check them * Fix typos in keyspace name Closes scylladb/scylladb#26855	2025-11-20 11:33:15 +02:00
Patryk Jędrzejczak	45ad93a52c	topology_coordinator: include all transitioning nodes in all global commands This change makes the code simpler and less vulnerable to regressions. There is no functional impact because: - we already include a decommissioning/bootstrapping/replacing node for `barrier` and `barrier_and_drain`, - we never execute global commands in the presence of a rebuilding node, - removing node always belongs to `exclude_nodes`, so it's filtered out anyway, - we execute global `stream_ranges` only for removenode, - we execute global `wait_for_ip` only for new nodes when there are no transitioning nodes. Fixes #20272 Fixes #27066 Closes scylladb/scylladb#27102	2025-11-20 11:11:32 +02:00
dependabot[bot]	2ca926f669	build(deps): bump sphinx-multiversion-scylla in /docs Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.2 to 0.3.3. --- updated-dependencies: - dependency-name: sphinx-multiversion-scylla dependency-version: 0.3.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#27081	2025-11-20 10:28:34 +02:00
Gleb Natapov	ad3cf2c174	utils: fix get_random_time_UUID_from_micros to generate correct time uuid According to the IETF spec uuid variant bits should be set to '10'. All others are either invalid or reserved. The patch change the code to follow the spec. Closes scylladb/scylladb#27073	2025-11-20 10:27:29 +02:00
Avi Kivity	5d761373c2	Update tools/cqlsh submodule * tools/cqlsh 19445a5...2240122 (3): > copyutil: adjust multiprocessing method to 'fork' > Drop serverless/cloudconf feature > Migrate workflows to Blacksmith Closes scylladb/scylladb#27065	2025-11-20 10:26:43 +02:00
Taras Veretilnyk	e5fbe3d217	docs: improve documentation of the scrub Update nodetool scrub documentation to include --quarantine-mode and --drop-unfixable-sstables options, add a section explaining quarantine modes and provide examples and procedures for handling and removing corrupted SSTables. Closes scylladb/scylladb#27018	2025-11-20 10:26:07 +02:00
Nadav Har'El	a9cf7d08da	test/cqlpy: remove USE from test, and test a USE-related bug One of the tests in test_describe.py used "USE {test_keyspace}" which affects the CQL session shared by all tests in an unrecoverable way (there is no "UNUSE" statement). As an example of what might happen if the shared CQL session is "polluted" by a USE, issue #26334 is about a bug we have in DESC KEYSPACES when USE is active. So in this patch, we: 1. Fix the test to not use USE on the shared CQL session - it's easy to create a separate session to use the "USE" on. With this fix, the test no longer leaves the shared CQL session in a "USE" state. 2. Add a new xfailing test to reproduce the DESC KEYSPACES bug. Refs #26334 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26345	2025-11-20 10:25:03 +02:00
Botond Dénes	a084094c18	Merge 'alternator and cql: tests for default sstable compression' from Nadav Har'El The purpose of this two-patch series is to reproduce a previously unknown bug, Refs #26914. Recently we saw a lot of patches that change how we create new schemas (keyspaces and tables), sometimes changing various long-time defaults. We started to worry that perhaps some of these defaults were applied only to CQL base tables and perhaps not to Alternator or to CQL's auxiliary tables (materialized views, secondary indexes, or CDC logs). For example, in Refs #26307 we wondered if perhaps the default "speculative_retry" option is different in Alternator than in CQL. The first patch includes Alternator tests, and the second CQL tests. In both tests we discover that although recently (commit `adf9c42`, Refs #26610) we changed the default sstable compressor from LZ4Compressor to LZ4WithDictsCompressor, actually this change was only applied to CQL base tables. All Alternator tables and all CQL auxiliary tables (views, indexes, CDC) still use the old LZ4Compressor. This is issue #26914. Closes scylladb/scylladb#26915 * github.com:scylladb/scylladb: test/cqlpy: test compression setting for auxiliary table test/alternator: tests for schema of Alternator table	2025-11-20 10:24:31 +02:00
Karol Nowacki	104de44a8d	vector_search: Add support for secondary vector store clients This change adds support for secondary vector store clients, typically located in different availability zones. Secondary clients serve as fallback targets when all primary clients are unavailable. New configuration option allows specifying secondary client addresses and ports. Fixes: VECTOR-187 Closes scylladb/scylladb#26484	2025-11-20 08:37:18 +01:00
Pavel Emelyanov	1cabc8d9b0	Merge 'streaming: fix loop break condition in tablet_sstable_streamer::stream' from Ernest Zaslavsky When streaming SSTables by tablet range, the original implementation of tablet_sstable_streamer::stream may break out of the loop too early when encountering a non-overlaping SSTable. As a result, subsequent SSTables that should be classified as partially contained are skipped entirely. Tablet range: [4, 5] SSTable ranges: [0,5] [0, 3] <--- is considered exhausted, and causes skip to next tablet [2, 5] <--- is missed for range [4, 5] The loop uses if (!overlaps) break; semantics, which conflated “no overlap” with “done scanning.” This caused premature termination when an SSTable did not overlapped but the following one did. Correct logic should be: before(sst_last) → skip and continue. after(sst_first) → break (no further SSTables can overlap). Otherwise → `contains` to classify as full or partial. Missing SSTables in streaming and potential data loss or incomplete streaming in repair/streaming operations. 1. Correct the loop termination logic that previously caused certain SSTables to be prematurely excluded, resulting in lost mutations. This change ensures all relevant SSTables are properly streamed and their mutations preserved. 2. Refactor the loop to use before() and after() checks explicitly, and only break when the SSTable is entirely after the tablet range 3. Add pytest to cover this case, full streaming flow by means of `restore` 4. Add boost tests to test the new refactored function This data corruption fix should be ported back to 2024.2, 2025.1, 2025.2, 2025.3 and 2025.4 Fixes: https://github.com/scylladb/scylladb/issues/26979 Closes scylladb/scylladb#26980 * github.com:scylladb/scylladb: streaming: fix loop break condition in tablet_sstable_streamer::stream streaming: add pytest case to reproduce mutation loss issue	2025-11-20 10:16:17 +03:00
Piotr Dulikowski	dc7944ce5c	Merge 'vector_search: Fix error handling and status parsing' from Karol Nowacki vector_search: Fix error handling and status parsing This change addresses two issues in the vector search client that caused validator test failures: incorrect handling of 5xx server errors and faulty status response parsing. 1. 5xx Error Handling: Previously, a 5xx response (e.g., 503 Service Unavailable) from the underlying vector store for an `/ann` search request was incorrectly interpreted as a node failure. This would cause the node to be marked as down, even for transient issues like an index scan being in progress. This change ensures that 5xx errors are treated as transient search failures, not node failures, preventing nodes from being incorrectly marked as down. 2. Status Response Parsing: The logic for parsing status responses from the vector store was flawed. This has been corrected to ensure proper parsing. Fixes: SCYLLADB-50 Backport to 2025.4 as this problem is present on this branch. Closes scylladb/scylladb#27111 * github.com:scylladb/scylladb: vector_search: Don't mark nodes as down on 5xx server errors test: vector_search: Move unavailable_server to dedicated file vector_search: Fix status response parsing	2025-11-20 08:14:28 +01:00
Botond Dénes	6ee0f1f3a7	Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski This patch adds a metric for pre-compression size of sstable files. This patch adds a per-table metric `scylla_column_family_total_disk_space_before_compression`, which measures the hypothetical total size of sstables on disk, if Data.db was replaced with an uncompressed equivalent. As for the implementation: Before the patch, tables and sstable sets are already tracking their total physical file size. Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats. To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size. New functionality, no backport needed. Closes scylladb/scylladb#26996 * github.com:scylladb/scylladb: replica/table: add a metric for hypothetical total file size without compression replica/table: keep track of total pre-compression file size	2025-11-20 09:10:38 +02:00
Karol Nowacki	9563d87f74	vector_search: Don't mark nodes as down on 5xx server errors For an `/ann` search request, a 5xx server response does not indicate that the node is down. It can signify a transient state, such as the index full scan being in progress. Previously, treating a 503 error as a node fault would cause the node to be incorrectly marked as down, for example, when a new index was being created. This commit ensures that such errors are treated as transient search failures, not node failures.	2025-11-20 08:10:20 +01:00
Karol Nowacki	366ecef1b9	test: vector_search: Move unavailable_server to dedicated file The unavailable_server code will be reused in upcoming client unit tests.	2025-11-20 08:09:21 +01:00
Benny Halevy	8ed36702ae	Update seastar submodule * seastar 63900e03...340e14a7 (19): > Merge 'rpc: harden sink_impl::close()' from Benny Halevy rpc: sink_impl::close: fixup indentation rpc: harden sink_impl::close() > http: Document the way "unread body bytes" accounting works > net: tighten port load balancing port access > coroutine: reimplement generator with buffered variant > Merge 'Stop using net::packet in posix data sink' from Pavel Emelyanov net/posix-stack: Don't use packet in posix_data_sink_impl reactor: Move fragment-vs-iovec static assertion reactor: Make backend::sendmsg() calls use std::span<iovec> utils: Introduce iovec_trim_front helper utils: Spannize iovec_len() > Merge 'Generalize memory data sink in tests' from Pavel Emelyanov test: Make output_stream_test splitting test case use own sink test: Make some output_stream_test cases use memory data sink test: Threadify one of output_stream_test test cases test: Make json_formatter_test use memory_data_sink test: Move memory_data_sink to its own header > dns: avoid using deprecated c-ares API > reactor: Move read_directory() to posix_file_impl > Merge 'rpc: sink_impl: batch sending and deletion of snd_buf:s' from Benny Halevy test: rpc_test: add test_rpc_stream_backpressure_across_shards reactor: add abort_on_too_long_task_queue option rpc: make sink flush and close noexcept rpc: sink_impl: batch sending and deletion of snd_buf:s rpc: move sink_impl and source_impl into internal namespace rpc: sink_impl: extend backpressure until snd_buf destroy > configure.py: fix --api-level help > Merge 'Close http client connection if handler doesn't consume all chunked-encoded body' from Pavel Emelyanov test: Fix indentation after previous patch test/http: Extend test for improper client handling of aborted requests test/http: Ignore EPIPE exception from server closing connection test/http: Split the partial response body read test http: Track "remaining bytes" for chunked_source_impl http: Switch content_length_source_impl to update remaining bytes > metrics: Add default ~config() > headers: Remove smp.hh from app-template.hh > prometheus: remove hostname and metric_help config > rpc: Tune up connection methods visibility > perf_tests: Fix build with fmt 12.0.0 by avoiding internal functions > doc: Fix some typos in codying style > reactor: Remove unused try_sleep() method directory_lister::get is adjusted in this patch to use the new experimental::coroutine::generator interface that was changed in scylladb/seastar@81f2dc9dd9 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26913	2025-11-20 07:29:47 +03:00
Pavel Emelyanov	53b71018e8	Merge 'Alternator: additional tests for ExclusiveStartKey' from Nadav Har'El In pull request #26960 there was some discussion about what is the valid form of ExclusiveStartKey, and whether we need to allow some "non-standard" uses of it for scan over system tables (which aren't real Alternator tables and may have multiple key columns, something not possible in normal Altenrator tables). This made me realize our tests for what is allowed - and what is not allowed - in ExclusiveStartKey - are very sparse and don't cover all the cases that are possible in Scan and Query, in base tables and in GSIs. So this small series attempts to increase the coverage of the tests for ExclusiveStartKey to make sure we are compatible with DynamoDB and also that we don't regress in #26960. The new tests reproduce a previously unknown error-path issues, #26988, where in some cases DynamoDB considers ExclusiveStartKey to be invalid but Alternator erronously accepts. Fortunately, we didn't find any success-path (correctness) bugs. Closes scylladb/scylladb#26994 * github.com:scylladb/scylladb: test/alternator: tests for ExclusiveStartKey in GSI test/alternator: more tests for ExclusiveStartKey in Scan test/alternator: more tests for ExclusiveStartKey in Query	2025-11-20 07:21:39 +03:00
Avi Kivity	0d68512b1f	stall_free: make variadic dispose_gently sequential Having variadic dispose_gently() clear inputs concurrently serves no purpose, since this is a CPU bound operation. It will just add more tasks for the reactor to process. Reduce disruption to other work by processing inputs sequentially. Closes scylladb/scylladb#26993	2025-11-20 07:16:16 +03:00
Benny Halevy	fd81333181	test/pylib/cpp: increase max-networking-io-control-blocks value Increase the value of the max-networking-io-control-blocks option for the cpp tests as it is too low and causes flakiness as seen in vector_search.vector_store_client_test.vector_store_client_single_status_check_after_concurrent_failures: ``` seastar/src/core/reactor_backend.cc:342: void seastar::aio_general_context::queue(linux_abi::iocb *): Assertion `last < end` failed. ``` See also https://github.com/scylladb/seastar/issues/976 Fixes #27056 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27117	2025-11-20 04:31:36 +01:00
Ernest Zaslavsky	dedc8bdf71	streaming: fix loop break condition in tablet_sstable_streamer::stream Correct the loop termination logic that previously caused certain SSTables to be prematurely excluded, resulting in lost mutations. This change ensures all relevant SSTables are properly streamed and their mutations preserved.	2025-11-19 17:32:49 +02:00
Tomasz Grabiec	f83c4ffc68	address_map: Use barrier() to wait for replication More efficient than 100 pings. There was one ping in test which was done "so this shard notices the clock advance". It's not necessary, since obsering completed SMP call implies that local shard sees the clock advancement done within in.	2025-11-19 15:21:02 +01:00
Tomasz Grabiec	4a85ea8eb2	address_map: Use more efficient and reliable replication method Primary issue with the old method is that each update is a separate cross-shard call, and all later updated queue behind it. If one of the shards has high latency for such calls, the queue may accumulate and system will appear unresponsive for mapping changes on non-zero shards. This happened in the field when one of the shards was overloaded with sstables and compaction work, which caused frequent stalls which delayed polling for ~100ms. A queue of 3k address updates accumulated. This made bootstrap impossible, since nodes couldn't learn about the IP mapping for the bootstrapping node and streaming failed. To protect against that, use a more efficient method of replication which requires a single cross-shard call to replicate all prior updates. It is also more reliable, if replication fails transiently for some reason, we don't give up and fail all later updates. Fixes #26865 Fixes #26835	2025-11-19 15:21:02 +01:00
Tomasz Grabiec	ed8d127457	utils: Introduce helper for replicated data structures Key goals: - efficient (batching updates) - reliable (no lost updates) Will be used in data structures maintained on one designed owning shard and replicated to other shards.	2025-11-19 15:21:02 +01:00
Michał Chojnowski	d8e299dbb2	sstables/trie/trie_writer: free nodes after they are flushed Somehow, the line of code responsible for freeing flushed nodes in `trie_writer` is missing from the implementation. This effectively means that `trie_writer` keeps the whole index in memory until the index writer is closed, which for many dataset is a guaranteed OOM. Fix that, and add some test that catches this. Fixes scylladb/scylladb#27082 Closes scylladb/scylladb#27083	2025-11-19 14:54:16 +02:00
Karol Nowacki	05b9cafb57	vector_search: Fix status response parsing The response was incorrectly parsed as a plain string and compared directly with C++ string. However, the body contains a JSON string, which includes escaped quotes that caused comparison failures.	2025-11-19 10:02:05 +01:00
Nadav Har'El	7b9428d8d7	test/cqlpy: test compression setting for auxiliary table In the previous patch we noticed that although recently (commit `adf9c42`, Refs #26610) we changed the default sstable compressor from LZ4Compressor to LZ4WithDictsCompressor, this change was only applied to CQL, not to Alternator. In this patch we add tests that demonstrate that it's even worse - the new compression only applies to CQL's base table - all the "auxiliary" tables - * Materialized views * Secondary index's materialized views * CDC log tables all still have the old LZ4Compressor, different from the base table's default compressor. The new test fails on Scylla, reproducing #26914, and passes on Cassandra (on Cassandra, we only compare the materialized view table, because SI and CDC is implemented differently). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-19 09:18:37 +02:00
Nadav Har'El	11f6a25d44	test/alternator: tests for schema of Alternator table This patch introduces a new test that exposed a previously unknown bug, Refs #26914: Recently we saw a lot of patches that change how we create new schemas (keyspaces and tables), sometimes changing various long-time defaults. We started to worry that perhaps some of these defaults were applied only to CQL and not to Alternator. For example, in Refs #26307 we wondered if perhaps the default "speculative_retry" option is different in Alternator than in CQL. This patch includes a new test file test/alternator/test_cql_schema.py, with tests for verifying how Alternator configures the underlying tables it creates. This test shows that the "speculative_retry" doesn't have this suspected bug - it defaults to "99.0PERCENTILE" in both CQL and Alternator. But unfortunately, we do have this bug with the "compression" option: It turns out that recently (commit `adf9c42`, Refs #26610) we changed the default sstable compressor from LZ4Compressor to LZ4WithDictsCompressor, but the change was only applied to CQL, not Alternator. So the test that "compression" is the same in both fails - and marked "xfails" and I created a new issue to track it - #26914. Another test verifies that Alternators "auxiliary" tables - holding GSIs, LSIs and Streams - have the same default properties as the base table. This currently seems to hold (there is no bug). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-19 09:18:37 +02:00
Pavel Emelyanov	4d5f7a57ea	sstables_loader: Get LOAD_AND_STREAM_ABORT_RPC_MESSAGE from messaging The feature in question is about the way streaming sink-and-source operate. Since sink-and-source itself are obtained from messaging service, the feature is better be fetched from it too. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-19 09:35:54 +03:00
Pavel Emelyanov	64e099f03b	sstables_loader: Keep bool on send_meta, not database reference The send_meta helper needs database reference to get feature_service from it (to check some feature state). That's too much, features are "immutable" throug the loader lifetime, it's enough to keep the boolean member on send_meta. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-19 09:32:56 +03:00
Patryk Jędrzejczak	e35ba974ce	test: test_raft_recovery_stuck: ensure mutual visibility before using driver Not waiting for nodes to see each other as alive can cause the driver to fail the request sent in `wait_for_upgrade_state()`. scylladb/scylladb#19771 has already replaced concurrent restarts with `ManagerClient.rolling_restart()`, but it has missed this single place, probably because we do concurrent starts here. Fixes #27055 Closes scylladb/scylladb#27075	2025-11-19 05:54:12 +01:00
David Garcia	3f2655a351	docs: add liveness::MustRestart support Closes scylladb/scylladb#27079	2025-11-18 15:28:55 +01:00
Szymon Wasik	f714876eaf	Add documentation about lack of returning similarity distances This patch adds the missing warning about the lack of possibility to return the similarity distance. This will be added in the next iteration. Fixes #27086 It has to be backported to 2025.4 as this is the limitation in 2025.4. Closes scylladb/scylladb#27096	2025-11-18 13:50:36 +01:00
Ernest Zaslavsky	656ce27e7f	streaming: add pytest case to reproduce mutation loss issue Introduce a test that demonstrates mutation loss caused by premature loop termination in tablet_sstable_streamer::stream. The code broke out of the SSTable iteration when encountering a non-overlapping range, which skipped subsequent SSTables that should have been partially contained. This test showcases the problem only. Example: Tablet range: [4, 5] SSTable ranges: [0,5] [0, 3] <--- is considered exhausted, and causes skip to next tablet [2, 5] <--- is missed for range [4, 5]	2025-11-18 09:34:41 +02:00
Avi Kivity	f7413a47e4	sstables: writer: avoid recursion in variadic write() Following `9b6ce030d0` ("sstables: remove quadratic (and possibly exponential) compile time in parse()"), where we removed recursion in reading, we do the same here for variadic write. This results in a small reduction in compile time. Note the problem isn't very bad here. This is tail-recursion, so likely removed by the compiler during optimization, and we don't have additional amplification due to future::then() double-compiling the ready-future and unready-future paths. Still, better to avoid quadratic compile times. Closes scylladb/scylladb#27050	2025-11-18 08:17:17 +02:00
Botond Dénes	2ca66133a4	Revert "db/config: don't use RBNO for scaling" This reverts commit `43738298be`. This commit causes instability in dtests. Several non-gating dtests started failing, as well as some gating ones, see #27047. Closes scylladb/scylladb#27067 Fixes #27047	2025-11-18 08:17:17 +02:00
Botond Dénes	0dbad38eed	Merge 'docs/dev/topology-over-raft: make various updates' from Patryk Jędrzejczak The updates include: - adding missing parts like topology states and table rows, - documenting zero-token nodes, - replacing the old recovery procedure with the new one. Fixes #26412 Updates of internal docs (usually read on master) don't require backporting. Closes scylladb/scylladb#27022 * github.com:scylladb/scylladb: docs/dev/topology-over-raft: update the recovery section docs/dev/topology-over-raft: document zero-token nodes docs/dev/topology-over-raft: clarify the lack of tablet-specific states docs/dev/topology-over-raft: add the missing join_group0 state docs/dev/topology-over-raft: update the topology columns	2025-11-18 08:17:17 +02:00
Patryk Jędrzejczak	adaa0560d9	Merge 'Automatic cleanup improvements' from Gleb Natapov This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously. Fixes https://github.com/scylladb/scylladb/issues/26866 Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing. Closes scylladb/scylladb#26868 * https://github.com/scylladb/scylladb: cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster cleanup: Add RESTful API to allow reset cleanup needed flag	2025-11-18 08:17:17 +02:00
Pavel Emelyanov	02513ac2b8	alternator: Get feature service from proxy directly The executor::add_stream_options() obtains local database reference from proxy just to get feature service from it. Similar chain is used in executor::update_time_to_live(). It's shorter to get features from proxy itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26973	2025-11-18 08:17:16 +02:00
Botond Dénes	514c1fc719	Merge 'db: batchlog_manager: update _last_replay only if all batches were re…' from Aleksandra Martyniuk …played Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection. Update _last_replay only if all batches were replayed. Fixes: https://github.com/scylladb/scylladb/issues/24415. Needs backport to all live versions. Closes scylladb/scylladb#26793 * github.com:scylladb/scylladb: test: extend test_batchlog_replay_failure_during_repair db: batchlog_manager: update _last_replay only if all batches were replayed	2025-11-18 08:17:16 +02:00
Nadav Har'El	5b78e1cebe	test/alternator: tests for ExclusiveStartKey in GSI After in the previous patches we added more exhaustive testing for the ExclusiveStartKey feature of Query and Scan, in this patch we add tests for this feature in the context of GSIs. Most interestingly, the ExclusiveStartKey when querying a GSI isn't just the key of the GSI, but also includes the key columns of the base - in other words, it is the key that Scylla uses for its materialized view. The tests here confirm that paging on GSI works - this paging uses ExclusiveStartKey of course - but also what is the specific structure and meaning of the content of ExclusiveStartKey. We also include two xfailing tests which again, like in the previous patches, show we don't do enough validation (issue #26988) and don't recognize wrong values or spurious columns in ExclusiveStartKey. As usual, all new tests pass on DynamoDB, and all except the xfailing ones pass on Alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-17 22:07:28 +02:00
Nadav Har'El	65b364d94a	test/alternator: more tests for ExclusiveStartKey in Scan In the previous patch we added more tests for ExclusiveStartKey in the context of the "Query" request. Here we do a similar thing for "Scan". There are fewer error cases for Scan. In particular, while it didn't make sense to use ExclusiveStartKey on a Query on a table without a sort key (since a Query there always returns a single item), for Scan it's needed - for paging. So we add in this patch a test (that we didn't have before!) that Scan paging works correctly also in the case of a table without a sort key. This patch has one xfailing test reproducing #26988, that we don't recognize and refuse spurious columns (columns not in the key) in ExclusiveStartKey. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-17 22:07:28 +02:00
Nadav Har'El	c049992a93	test/alternator: more tests for ExclusiveStartKey in Query We already have in test/alternator/test_query.py a test - test_query_exclusivestartkey - for one successful uses of ExclusiveStartKey. But we missed testing quite a few edge cases of this parameter, so this patch adds more tests for it - see the comments on each individual test explaining its purpose. With the new tests, we actually identified three cases where we got the error handling wrong - cases of ExclusiveStartKey which DynamoDB refuses, but Alternator allows. So three of the tests included here pass on DynamoDB but fail on Alternator, so are marked with "xfail". Refs #26988 - which is a new issue about these three cases of missing validation. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-17 22:07:27 +02:00
Andrzej Jackowski	35fd603acd	test: wait for read_barrier in wait_until_driver_service_level_created Previously, `wait_until_driver_service_level_created` only waited for the `driver` service level to appear in the output of `LIST ALL SERVICE_LEVELS`. However, the fact that one node lists `sl:driver` does not necessarily mean that all other nodes can see it yet. This caused sporadic test failures, especially in DEBUG builds. To prevent these failures, this change adds an extra wait for a `raft/read_barrier` after the `driver` service level first appears. This ensures the service level is globally visible across the cluster. Fixes: scylladb/scylladb#27019	2025-11-17 15:21:28 +01:00
Andrzej Jackowski	39bfad48cc	test: use ManagerClient in wait_until_driver_service_level_created Pass a ManagerClient instead of a `cql` session to `wait_until_driver_service_level_created`. This makes it easier to add additional functionality to the helper later (e.g. waiting for a Raft read barrier in a subsequent commit). Refs: scylladb/scylladb#27019	2025-11-17 14:55:14 +01:00
Botond Dénes	d54d409a52	Merge 'audit: write out to both table and syslog' from Dario Mirovic This patch adds support for multiple audit log outputs. If only one audit log output is enabled, the behavior does not change. If multiple audit log outputs are enabled, then the `audit_composite_storage_helper` class is used. It has a collection of `storage_helper` objects. Performance testing shows that read query throughput and auth request throughput are consistent even at high reactor utilization. It can also be observed that read query latency increases a bit. Read query ops = 60k/s AUTH ops = 200/s \| Audit Mode \| QUERY latency (p99) \| Δ% vs none \| \|------------\|---------------------\|------------\| \| none \| 777 \| 0 \| \|table\| 801 \| +3.09% \| \|syslog \| 803 \| +3.35% \| \|table,syslog \| 818 \| +5.28% \| Read query ops = 50k/s AUTH ops = 200/s \| Audit Mode \| QUERY latency (p99) \| Δ% vs none \| \|------------\|---------------------\|------------\| \| none \| 643 \| 0 \| \|table\| 647 \| +0.62% \| \|syslog \| 648 \| +0.78% \| \|table,syslog \| 656 \| +2.02% \| Detailed performance results are in the following Confluence document: [Audit performance impact test](https://scylladb.atlassian.net/wiki/spaces/RND/pages/148308005/Audit+performance+impact+test) Fixes #26022 Backport: The decision is to not backport for now. After making sure it works on the latest release, and if there is a need, we can do it. Closes scylladb/scylladb#26613 * github.com:scylladb/scylladb: test: dtest: audit_test.py: add AuditBackendComposite test: dtest: audit_test.py: group logs in dict per audit mode audit: write out to both table and syslog audit: move storage helper creation from `audit::start` to `audit::audit` audit: fix formatting in `audit::start_audit` audit: unify `create_audit` and `start_audit`	2025-11-17 15:04:15 +02:00
Gleb Natapov	0f0ab11311	cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster `97ab3f6622` changed "nodetool cleanup" (without arguments) to run cleanup on all dirty nodes in the cluster. This was somewhat unexpected, so this patch changes it back to run cleanup on the target node only (and reset "cleanup needed" flag afterwards) and it adds "nodetool cluster cleanup" command that runs the cleanup on all dirty nodes in the cluster.	2025-11-17 15:00:51 +02:00
Piotr Dulikowski	c29efa2cdb	Merge 'vector_search: Improve vector-store health checking' from Karol Nowacki A Vector Store node is now considered down if it returns an HTTP 500 server error. This can happen, for example, if the node fails to connect to the database or has not completed its initial full scan. The logic for marking a node as 'up' is also enhanced. A node is now only considered up when its status is explicitly 'SERVING'. Fixes: VECTOR-187 Backport to 2025.4 as this feature is expected to be available in 2025.4. Closes scylladb/scylladb#26413 * github.com:scylladb/scylladb: vector_search: Improve vector-store health checking vector_search: Move response_content_to_sstring to utils.hh vector_search: Add unit tests for client error handling vector_search: Enable mocking of status requests vector_search: Extract abort_source_timeout and repeat_until vector_search: Move vs_mock_server to dedicated files	2025-11-17 12:16:07 +01:00
Dawid Mędrek	0602afc085	cdc: Preserve properties when reattaching log table When we enable CDC on a table, Scylla creates a log table for it. It has default properties, but the user may change them later on. Furthermore, it's possible to detach that log table by simply disabling CDC on the base table: ```cql /* Create a table with CDC enabled. The log table is created. / CREATE TABLE ks.t (pk int PRIMARY KEY) WITH cdc = {'enabled': true}; / Detach the log table. / ALTER TABLE ks.t WITH cdc = {'enabled': false}; / Modify a property of the log table. / ALTER TABLE ks.t_scylla_cdc_log WITH bloom_filter_fp_chance = 0.13; ``` The log table can also be reattached by enabling CDC on the base table again: ```cql / Reattach the log table */ ALTER TABLE ks.t WITH cdc = {'enabled': true}; ``` However, because the process of reattachment goes through the same code that created it in the first place, the properties of the log table are rolled back to their default values. This may be confusing to the user and, if unnoticed, also have other consequences, e.g. affecting performance. To prevent that, we ensure that the properties are preserved. A reproducer test, `test_log_table_preserves_properties_after_reattachment`, has been provided to verify that the changes are correct. It fails before this commit. Another test, `test_log_table_preserves_id_after_reattachment`, has also been added because the current implementation sets properties and the ID separately. Fixes scylladb/scylladb#25523	2025-11-17 11:56:30 +01:00
Dawid Mędrek	10975bf65c	cdc: Extract creating columns in CDC log table to dedicated function We extract the portion of the code responsible for creating columns in a CDC log table to a separate, dedicated function. This should improve the overall readability of the function (and also making it very short now).	2025-11-17 11:54:48 +01:00
Dawid Mędrek	8bf09ac6f7	cdc: Extract default properties of CDC log tables to dedicated function We extract the portion of the code responsible for setting the default properties of a CDC log table to a separate function. This should improve the overall readability of the function. Also, it should be helpful when modifying the code later on in this commit series.	2025-11-17 11:50:35 +01:00
Dawid Mędrek	991c0f6e6d	schema/schema_builder.hh: Add set_properties We add a method used for overwriting the properties of a schema. It will be used to create a new schema based on another.	2025-11-17 11:46:32 +01:00
Dawid Mędrek	76b21d7a5a	schema: Add getter for schema::user_properties The getter will be used later to access the user properties and copy them to a fresh `schema_builder`.	2025-11-17 11:46:24 +01:00
Dawid Mędrek	3856c9d376	schema: Remove underscores in fields of schema::user_properties The fields are public, so according to the style guide, they should not start with an underscore.	2025-11-17 11:46:15 +01:00
Dawid Mędrek	5a0fddc9ee	schema: Extract user properties out of raw_schema The properties can be directly manipulated by the user via statements like `ALTER TABLE`. To better organize the structure of `raw_schema`, we encapsulate that data in the form of a dedicated struct. This change will be later used for applying multiple properties to `schema_builder` in one go.	2025-11-17 11:46:07 +01:00
Patryk Jędrzejczak	b5f38e4590	docs/dev/topology-over-raft: update the recovery section We have the new recovery procedure now, but this doc hasn't been updated. It still describes the old recovery procedure. For comparison, external docs can be found here: https://docs.scylladb.com/manual/master/troubleshooting/handling-node-failures.html#manual-recovery-procedure Fixes #26412	2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak	785a3302e6	docs/dev/topology-over-raft: document zero-token nodes The topology transitions are a bit different for zero-token nodes, which is worth mentioning.	2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak	d75558e455	docs/dev/topology-over-raft: clarify the lack of tablet-specific states Tablets are never mentioned before this part of the doc, so it may be confusing why some topology states are missing.	2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak	c362ea4dcb	docs/dev/topology-over-raft: add the missing join_group0 state This state was added as a part of the join procedure, and we didn't update this part of the doc.	2025-11-17 10:40:23 +01:00
Patryk Jędrzejczak	182d416949	docs/dev/topology-over-raft: update the topology columns Some of the columns were added, but the doc wasn't updated. `upgrade_state` was updated in only one of the two places. `ignore_nodes` was changed to a static column.	2025-11-17 10:40:20 +01:00
Piotr Dulikowski	f0039381d2	Merge 'db/view/view_building_worker: support staging sstables intra-node migration and tablet merge' from Michał Jadwiszczak This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge. To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair. There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine. For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard. The patch should be backported to 2025.4 Fixes https://github.com/scylladb/scylladb/issues/26244 Closes scylladb/scylladb#26454 * github.com:scylladb/scylladb: service/storage_service: migrate staging sstables in view building worker during intra-node migration db/view/view_building_worker: support sstables intra-node migration db/view_building_worker: fix indent db/view/view_building_worker: don't organize staging sstables by last token	2025-11-17 08:53:19 +01:00
Karol Nowacki	7f45f15237	vector_search: Improve vector-store health checking A Vector Store node is now considered down if it returns an HTTP 5xx status. This can happen, for example, if the node fails to connect to the database or has not completed its initial full scan. The logic for marking a node as 'up' is also enhanced. A node is now only considered up when its status is 'SERVING'.	2025-11-17 06:21:31 +01:00
Karol Nowacki	5c30994bc5	vector_search: Move response_content_to_sstring to utils.hh Move the response_content_to_sstring utility function from vector_store_client.cc to utils.hh to enable reuse across multiple files. This refactoring prepares for the upcoming `client.cc` implementation that will also need this functionality.	2025-11-17 06:21:31 +01:00
Karol Nowacki	4bbba099d7	vector_search: Add unit tests for client error handling Introduce dedicated unit tests for the client class to verify existing functionality and serve as regression tests. These tests ensure that invalid client requests do not cause nodes to be marked as down.	2025-11-17 06:21:31 +01:00
Karol Nowacki	cb654d2286	vector_search: Enable mocking of status requests Extend the mock server to allow inspecting incoming status requests and configuring their responses. This enables client unit tests to simulate various server behaviors, such as handling node failures and backoff logic.	2025-11-17 06:21:31 +01:00
Karol Nowacki	f665564537	vector_search: Extract abort_source_timeout and repeat_until The `abort_source_timeout` and `repeat_until` functions are moved to the shared utility header `test/vector_search/utils.hh`. This allows them to be reused by upcoming `client` unit tests, avoiding code duplication.	2025-11-17 06:21:31 +01:00
Karol Nowacki	ee3b83c9b0	vector_search: Move vs_mock_server to dedicated files The mock server utility is extracted into its own files so it can be reused by future `client` unit tests.	2025-11-17 06:21:30 +01:00
Artsiom Mishuta	696596a9ef	test.py: shutdown ManagerClient only in current loop In python 3.14 there is stricter policy regarding asyncio loops. This leads that we can not close clients from different loops. This change ensures that we are closing only client in the current loop. Closes scylladb/scylladb#26911	2025-11-16 19:19:46 +02:00
Jenkins Promoter	3672715211	Update pgo profiles - x86_64	2025-11-16 11:42:41 +02:00
Jenkins Promoter	41933b3f5d	Update pgo profiles - aarch64	2025-11-15 05:27:38 +02:00
Pavel Emelyanov	9cb776dee8	sstables_manager: Drop db::config from sstables_manager Now it has all it needs via its own specific config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	d55044b696	tools/sstable: Make shard_of_with_tablets use db::config argument Its caller, the shard_of_operation, already has it as argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	2ec3303edd	tools/sstable: Add db::config& to all operations It's not extremely elegant, but one tool operation needs db::config -- the "shard of" one. Currently it gets one from sstables_manager, but manager is going to stop using db::config, and the operation needs to get it elsehow. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	0fede18447	tools/sstable: Get endpoints from storage manager The tool may open sstables on S3. For that it gets configured endpoints with the help of db::config obtained from sstables_manager.db_config(). However, storage endpoints are maintained by sstables storage manager, and since tool has this instance, it's better to use storage manager to get list of endpoints. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	675eb3be98	sstables_manager: Hold sstable IO extensions on it Currently manager holds a reference on db::config and when sstables IO extensions are needed it grabs them from this config. Since db::config is going to be removed from sstables manager, it should either keep track of all config extensions, or only those that it needs. This patch makes the latter choice and keeps reference to sstable_file_io_ext. on manager. The reference is passed as constructor argument, not via manager config, but it's a random choice, no specific reason why not putting it on config itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	c853197281	sstables: Manager helper to grab file io extensions Currently all the code that needs to iterate over sstables extensions get config from manager, extensions from it and then iterate. Add a helper that returns extensions directly. No real changes, just a helper. Next patch will change the way the helper works. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	9868341c73	sstables_manager: Move default format on config It's explicitly `me` type by default, but places that can write sstables override it with db::config value: replica::database, tests and scylla sstable tool. Live-updateable, so use updateable_value<> type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	e6dee8aab5	sstables_manager: Move enable_sstable_data_integrity_check on config Set its default value to the one from db/config.cc. Only replica::database may want to re-configure it. Also not live-updateable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	78ab31118e	sstables_manager: Move data_file_directories on config Make it a reference, so all the code that configures it is updated to provide the target. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	cb1679d299	sstables_manager: Move components_memory_reclaim_threshold on config Set its default value to the one from db/config.cc. Only the replica::database and tests may want to re-configure it. This one is live-updateable, so use updateable_value<> type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:42 +03:00
Botond Dénes	8579e20bd1	Merge 'Enable digest+checksum verification for streaming/repair' from Taras Veretilnyk This PR enables integrity check of both checksum and digest for repair/streaming. In the past, streaming readers only verified the checksum of compressed SSTables. This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped. To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression. Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`. Backport is not required, it is an improvement Refs #21776 Closes scylladb/scylladb#26444 * github.com:scylladb/scylladb: boost/repair_test: add repair reader integrity verification test cases test/lib: allow to disable compression in random_mutation_generator sstables: Skip checksum and digest reads for unlinked SSTables table: enable integrity checks for streaming reader table: Add integrity option to table::make_sstable_reader() sstables: Add integrity option to create_single_key_sstable_reader	2025-11-14 18:00:33 +02:00
Benny Halevy	f9ce98384a	scylla-sstable: correctly dump sharding_metadata This patch fixes 2 issues at one go: First, Currently sstables::load clears the sharding metadata (via open_data()), and so scylla-sstable always prints an empty array for it. Second, printing token values would generate invalid json as they are currently printed as binary bytes, and they should be printed simply as numbers, as we do elsewhere, for example, for the first and last keys. Fixes #26982 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26991	2025-11-14 17:55:41 +02:00
Aleksandra Martyniuk	e3dcb7e827	test: extend test_batchlog_replay_failure_during_repair Modify test_batchlog_replay_failure_during_repair to also check that there isn't data resurrection if flushing hints falls within the repair cache timeout.	2025-11-14 14:18:07 +01:00
Pavel Emelyanov	1c9c4c8c8c	Merge 'service: attach storage_service to migration_manager using pluggable' from Marcin Maliszkiewicz Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734 Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made it hitting more often, as _ss was called earlier, but it's not released yet. Closes scylladb/scylladb#26779 * github.com:scylladb/scylladb: service: attach storage_service to migration_manager using pluggabe service: migration_manager: corutinize merge_schema_from service: migration_manager: corutinize reload_schema	2025-11-14 15:14:28 +03:00
Piotr Dulikowski	2ccc94c496	Merge 'topology_coordinator: include joining node in barrier' from Michael Litvak Previously, only nodes in the 'normal' state and decommissioning nodes were included in the set of nodes participating in barrier and barrier_and_drain commands. Joining nodes are not included because they don't coordinate requests, given their cql port is closed. However, joining nodes may receive mutations from other nodes, for which they may generate and coordinate materialized view updates. If their group0 state is not synchronized it could cause lost view updates. For example: 1. On the topology coordinator, the join completes and the joining node becomes normal, but the joining node's state lags behind. Since it's not synchronized by the barrier, it could be in an old state such as `write_both_read_old`. 2. A normal node coordinates a write and sends it to the new node as the new replica. 3. The new node applies the base mutation but doesn't generate a view update for it, because it calculates the base-view pairing according to its own state and replication map, and determines that it doesn't participate in the base-view pairing. Therefore, since the joining node participates as a coordinator for view updates, it should be included in these barriers as well. This ensures that before the join completes, the joining node's state is `write_both_read_new`, where it does generate view updates. Fixes https://github.com/scylladb/scylladb/issues/26976 backport to previous versions since it fixes a bug in MV with vnodes Closes scylladb/scylladb#27008 * github.com:scylladb/scylladb: test: add mv write during node join test topology_coordinator: include joining node in barrier	2025-11-14 12:41:16 +01:00
Pavel Emelyanov	604e5b6727	sstables_manager: Move column_index_auto_scale_threshold on config Set its default value to the one from db/config.cc. Only the replica::database may want to re-configure it. This one is live-updateable, so use updateable_value<> type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:30:49 +03:00
Pavel Emelyanov	8f9f92728e	sstables_manager: Move column_index_size on config Set its default value to the one from db/config.cc. Only replica::database may want to re-configure it. Also not live-updateable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:30:28 +03:00
Pavel Emelyanov	88bb203c9c	sstables_manager: Move sstable_summary_ratio on config Set its default value to the one from db/config.cc. Only replica::database may want to re-configure it. Also not live-updateable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:29:34 +03:00
Pavel Emelyanov	1f6918be3f	sstables_manager: Move enable_sstable_key_validation on config Make it OFF by default and update only those callers, that may have it ON -- the replica::database, tests and scylla-sstable tool. Also not live-updateable, so plain bool. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:28:14 +03:00
Pavel Emelyanov	79d0f93693	sstables_manager: Move available_memory on config Currently, this parameter is passed to sstables_manager as explicit constructor argument. Also, it's not live-updateable, so a plain size_t type for it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:27:14 +03:00
Pavel Emelyanov	218916e7c2	code: Introduce sstables_manager::config This is specific configuration for sstables_manager. All places that construct sstables manager are updated to provide config to it. For now the config is empty and exists alongside with db::config. Further patches will populate the former config with data and the latter config will be eventually removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:25:18 +03:00
Pavel Emelyanov	004ba32fa5	sstables: Patch get_local_directories() to work on vector of paths Now it uses db::config. Next patches will eliminate db::config from this code and the helper in question will need to get datadir names explicitly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:24:04 +03:00
Pavel Emelyanov	1895d85ed2	code: Rename sstables_manager::config() into db_config() The config() method name is going to return sstables_manager config, so first need to set this name free. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:23:08 +03:00
Patryk Jędrzejczak	1141342c4f	Merge 'topology: refactor excluded nodes' from Petr Gusev This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`. The PR improves codes readability and efficiency, no behavior changes. backport: this is a refactoring/optimization, no need to backport Closes scylladb/scylladb#26907 * https://github.com/scylladb/scylladb: topology_coordinator: drop unused exec_global_command overload topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request topology_state_machine: inline get_excluded_nodes messaging_service: simplify and optimize ban_host storage_service: topology_state_load: extract topology variable topology_coordinator: excluded_tablet_nodes -> ignored_nodes topology_state_machine: add excluded_tablet_nodes field	2025-11-14 11:52:00 +01:00
Piotr Dulikowski	68407a09ed	Merge 'vector_store_client: Add support for failed-node backoff' from Karol Nowacki vector_search: Add backoff for failed nodes Introduces logic to mark nodes that fail to answer an ANN request as "down". Down nodes are omitted from further requests until they successfully respond to a health check. Health checks for down nodes are performed in the background using the `status` endpoint, with an exponential backoff retry policy ranging from 100ms to 20s. Client list management is moved to separate files (clients.cc/clients.hh) to improve code organization and modularity. References: VECTOR-187. Backport to 2025.4 as this feature is expected to be available in 2025.4. Closes scylladb/scylladb#26308 * github.com:scylladb/scylladb: vector_search: Set max backoff delay to 2x read request timeout vector_search: Report status check exception via on_internal_error_noexcept vector_search: Extract client management into dedicated class vector_search: Add backoff for failed clients vector_search: Make endpoint available vector_search: Use std::expected for low-level client errors vector_search: Extract client class	2025-11-14 11:49:18 +01:00
Piotr Dulikowski	833b824905	Merge 'service/qos: Fall back to default scheduling group when using maintenance socket' from Dawid Mędrek The service level controller relies on `auth::service` to collect information about roles and the relation between them and the service levels (those attached to them). Unfortunately, the service level controller is initialized way earlier than `auth::service` and so we had to prevent potential invalid queries of user service levels (cf. `46193f5e79`). Unfortunately, that came at a price: it made the maintenance socket incompatible with the current implementation of the service level controller. The maintenance socket starts early, before the `auth::service` is fully initialized and registered, and is exposed almost immediately. If the user attempts to connect to Scylla within this time window, via the maintenance socket, one of the things that will happen is choosing the right service level for the connection. Since the `auth::service` is not registered, Scylla with fail an assertion and crash. A similar scenario occurs when using maintenance mode. The maintenance socket is how the user communicates with the database, and we're not prepared for that either. To avoid unnecessary crashes, we add new branches if the passed user is absent or if it corresponds to the anonymous role. Since the role corresponding to a connection via the maintenance socket is the anonymous role, that solves the problem. Some accesses to `auth::service` are not affected and we do not modify those. Fixes scylladb/scylladb#26816 Backport: yes. This is a fix of a regression. Closes scylladb/scylladb#26856 * github.com:scylladb/scylladb: test/cluster/test_maintenance_mode.py: Wait for initialization test: Disable maintenance mode correctly in test_maintenance_mode.py test: Fix keyspace in test_maintenance_mode.py service/qos: Do not crash Scylla if auth_integration absent	2025-11-14 11:12:28 +01:00
Botond Dénes	43738298be	db/config: don't use RBNO for scaling Remove bootstrap and decomission from allowed_repair_based_node_ops. Using RBNO over streaming for these operations has no benefits, as they are not exposed to the out-of-date replica problem that replace, removenode and rebuild are. On top of that, RBNO is known to have problems with empty user tables. Using streaming for boostrap and decomission is safe and faster than RBNO in all condition, especially when the table is small. One test needs adjustment as it relies on RBNO being used for all node ops. Fixes: #24664 Closes scylladb/scylladb#26330	2025-11-14 13:03:50 +03:00
Piotr Dulikowski	43506e5f28	Merge 'db/view: Add backoff when RPC fails' from Dawid Mędrek The view building coordinator manages the process by sending RPC requests to all nodes in the cluster, instructing them what to do. If processing that message fails, the coordinator decides if it wants to retry it or (temporarily) abandon the work. An example of the latter scenario could be if one of the target nodes dies and any attempts to communicate with it would fail. Unfortunately, the current approach to it is not perfect and may result in a storm of warnings, effectively clogging the logs. As an example, take a look at scylladb/scylladb#26686: the gossiper failed to mark one of the dead nodes as DOWN fast enough, and it resulted in a warning storm. To prevent situations like that, we implement a form of backoff. If processing an RPC message fails, we postpone finishing the task for a second. That should reduce the number of messages in the logs and avoid retries that are likely to fail as well. We provide a reproducer test. Fixes scylladb/scylladb#26686 Backport: impact on the user. We should backport it to 2025.4. Closes scylladb/scylladb#26729 * github.com:scylladb/scylladb: tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc db/view/view_building_coordinator: Rate limit logging failed RPC db/view: Add backoff when RPC fails	2025-11-14 10:17:57 +01:00
Piotr Dulikowski	308c5d0563	Merge 'cdc: set column drop timestamp in the future' from Michael Litvak When dropping a column from a CDC log table, set the column drop timestamp several seconds into the future. If a value is written to a column concurrently with dropping that column, the value's timestamp may be after the column drop timestamp. If this value is also flushed to an SSTable, the SSTable would be corrupted, because it considers the column missing after the drop timestamp and doesn't allow values for it. While this issue affects general tables, it especially impacts CDC tables because this scenario can occur when writing to a table with CDC preimage enabled while dropping a column from the base table. This happens even if the base mutation doesn't write to the dropped column, because CDC log mutations can generate values for a column even if the base mutation doesn't. For general tables, this issue can be avoided by simply not writing to a column while dropping it. We fix this for the more problematic case of CDC log tables by setting the column drop timestamp several seconds into the future, ensuring that writes concurrent with column drops are much less likely to have timestamps greater than the column drop timestamp. Fixes https://github.com/scylladb/scylladb/issues/26340 the issue affects all previous releases, backport to improve stability Closes scylladb/scylladb#26533 * github.com:scylladb/scylladb: test: test concurrent writes with column drop with cdc preimage cdc: check if recreating a column too soon cdc: set column drop timestamp in the future migration_manager: pass timestamp to pre_create	2025-11-14 08:52:34 +01:00
Marcin Maliszkiewicz	958d04c349	service: attach storage_service to migration_manager using pluggabe Migration manager depends on storage service. For instance, it has a reload_schema_in_bg background task which calls _ss.local() so it expects that storage service is not stopped before it stops. To solve this we use permit approach, and during storage_service stop: - we ignore new code execution in migration_manager which'd use storage_service - but wait with storage_service shutdown until all existing executions are done Fixes scylladb/scylladb#26734	2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz	cf9b2de18b	service: migration_manager: corutinize merge_schema_from It's needed to easily keep-alive pluggable storage_service permit in a following commit.	2025-11-14 08:50:19 +01:00
Marcin Maliszkiewicz	5241e9476f	service: migration_manager: corutinize reload_schema It's needed to easily keep-alive pluggable storage_service permit in a following commit.	2025-11-14 08:50:18 +01:00
Tomasz Grabiec	27e74fa567	tools: scylla-sstable: Print filename and tablet ids on error Since error is not printed to stdout, when working with multiple files, we don't know whith which sstable the error is associated with. Closes scylladb/scylladb#27009	2025-11-14 09:47:38 +02:00
Karol Nowacki	1972fb315b	vector_search: Set max backoff delay to 2x read request timeout The maximum backoff delay for status checking now depends on the `read_request_timeout_in_ms` configuration option. The delay is set to twice the value of this parameter.	2025-11-14 08:05:21 +01:00
Karol Nowacki	097c0f9592	vector_search: Report status check exception via on_internal_error_noexcept This exception should only occur due to internal errors, not client or external issues. If triggered, it indicates an internal problem. Therefore, we notify about this exception using on_internal_error_noexcept.	2025-11-14 08:05:21 +01:00
Karol Nowacki	940ed239b2	vector_search: Extract client management into dedicated class Refactor client list management by moving it to separate files (clients.cc/clients.hh) to improve code organization and modularity.	2025-11-14 08:05:21 +01:00
Karol Nowacki	009d3ea278	vector_search: Add backoff for failed clients Introduces logic to mark clients that fail to answer an ANN request as "down". Down clients are omitted from further requests until they successfully respond to a health check. Health checks for down clients are performed in the background using the `status` endpoint, with an exponential backoff retry policy ranging from 100ms to 20s.	2025-11-14 07:38:01 +01:00
Karol Nowacki	190459aefa	vector_search: Make endpoint available In preparation for a new feature, the tests need the ability to make an endpoint that was previously unavailable, available again. This is achieved by adding an `unavailable_server::take_socket` method. This method allows transferring the listening socket from the `unavailable_server` to the `mock_vs_server`, ensuring they both operate on the same endpoint.	2025-11-14 07:23:40 +01:00
Karol Nowacki	49a177b51e	vector_search: Use std::expected for low-level client errors To unify error handling, the low-level client methods now return `std::expected` instead of throwing exceptions. This allows for consistent and explicit error propagation from the client up to the caller. The relevant error types have been moved to a new `vector_search/error.hh` header to centralize their definitions.	2025-11-14 07:23:40 +01:00
Karol Nowacki	62f8b26bd7	vector_search: Extract client class This refactoring extracts low-level client logic into a new, dedicated `client` class. The new class is responsible for connecting to the server and serializing requests. This change prepares for extending the `vector_store_client` to check node status via the `api/v1/status` endpoint. `/ann` Response deserialization remains in the `vector_store_client` as it is schema-dependent.	2025-11-14 07:23:40 +01:00
Lakshmi Narayanan Sreethar	3eba90041f	sstables: prevent oversized allocation when parsing summary positions During sstable summary parsing, the entire header was read into a single buffer upfront and then parsed to obtain the positions. If the header was too large, it could trigger oversized allocation warnings. This commit updates the parse method to read one position at a time from the input stream instead of reading the entire header at once. Since `random_access_reader` already maintains an internal buffer of 128 KB, there is no need to pre read the entire header upfront. Fixes #24428 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26846	2025-11-14 06:40:53 +02:00
Dawid Mędrek	393f1ca6e6	tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc After the changes in the test, we clean up its syntax. It boils down to very simple modifications.	2025-11-13 17:57:33 +01:00
Dawid Mędrek	acd9120181	db/view/view_building_coordinator: Rate limit logging failed RPC The view building coordinator sends tasks in form of RPC messages to other nodes in the cluster. If processing that RPC fails, the coordinator logs the error. However, since tasks are per replica (so per shard), it may happen that we end up with a large number of similar messages, e.g. if the target node has died, because every shard will fail to process its RPC message. It might become even worse in the case of a network partition. To mitigate that, we rate limit the logging by 1 seconds. We extend the test `test_backoff_when_node_fails_task_rpc` so that it allows the view building coordinator to have multiple tablet replica targets. If not for rate limiting the warning messages, we should start getting more of them, potentially leading to a test failure.	2025-11-13 17:57:23 +01:00
Dawid Mędrek	4a5b1ab40a	db/view: Add backoff when RPC fails The view building coordinator manages the process of view building by sending RPC requests to all nodes in the cluster, instructing them what to do. If processing that message fails, the coordinator decides if it wants to retry it or (temporarily) abandon the work. An example of the latter scenario could be if one of the target nodes dies and any attempts to communicate with it would fail. Unfortunately, the current approach to it is not perfect and may result in a storm of warnings, effectively clogging the logs. As an example, take a look at scylladb/scylladb#26686: the gossiper failed to mark one of the dead nodes as DOWN fast enough, and it resulted in a warning storm. To prevent situations like that, we implement a form of backoff. If processing an RPC message fails, we postpone finishing the task for a second. That should reduce the number of messages in the logs and avoid retries that are likely to fail as well. We provide a reproducer test: it fails before this commit and succeeds with it. Fixes scylladb/scylladb#26686	2025-11-13 17:55:41 +01:00
Michał Hudobski	7646dde25b	select_statement: add a warning about unsupported paging for vs queries Currently we do not support paging for vector search queries. When we get such a query with paging enabled we ignore the paging and return the entire result. This behavior can be confusing for users, as there is no warning about paging not working with vector search. This patch fixes that by adding a warning to the result of ANN queries with paging enabled. Closes scylladb/scylladb#26384	2025-11-13 18:47:05 +02:00
Michael Litvak	e85051068d	test: test concurrent writes with column drop with cdc preimage add a test that writes to a table concurrently with dropping a column, where the table has CDC enabled with preimage. the test reproduces issue #26340 where this results in a malformed sstable.	2025-11-13 17:00:08 +01:00
Michael Litvak	039323d889	cdc: check if recreating a column too soon When we drop a column from a CDC log table, we set the column drop timestamp a few seconds into the future. This can cause unexpected problems if a user tries to recreate a CDC column too soon, before the drop timestamp has passed. To prevent this issue, when creating a CDC column we check its creation timestamp against the existing drop timestamp, if any, and fail with an informative error if the recreation attempt is too soon.	2025-11-13 17:00:07 +01:00
Michael Litvak	48298e38ab	cdc: set column drop timestamp in the future When dropping a column from a CDC log table, set the column drop timestamp several seconds into the future. If a value is written to a column concurrently with dropping that column, the value's timestamp may be after the column drop timestamp. If this value is also flushed to an SSTable, the SSTable would be corrupted, because it considers the column missing after the drop timestamp and doesn't allow values for it. While this issue affects general tables, it especially impacts CDC tables because this scenario can occur when writing to a table with CDC preimage enabled while dropping a column from the base table. This happens even if the base mutation doesn't write to the dropped column, because CDC log mutations can generate values for a column even if the base mutation doesn't. For general tables, this issue can be avoided by simply not writing to a column while dropping it. We fix this for the more problematic case of CDC log tables by setting the column drop timestamp several seconds into the future, ensuring that writes concurrent with column drops are much less likely to have timestamps greater than the column drop timestamp. Fixes scylladb/scylladb#26340	2025-11-13 16:59:43 +01:00
Michael Litvak	eefae4cc4e	migration_manager: pass timestamp to pre_create pass the write timestamp as parameter to the on_pre_create_column_families notification.	2025-11-13 16:59:43 +01:00
Piotr Dulikowski	7f482c39eb	Merge '[schema] Speculative retry rounding fix' from Dario Mirovic This patch series re-enables support for speculative retry values `0` and `100`. These values have been supported some time ago, before [schema: fix issue 21825: add validation for PERCENTILE values in speculative_retry configuration. #21879 ](https://github.com/scylladb/scylladb/pull/21879). When that PR prevented using invalid `101PERCENTILE` values, valid `100PERCENTILE` and `0PERCENTILE` value were prevented too. Reproduction steps from [[Bug]: drop schema and all tables after apply speculative_retry = '99.99PERCENTILE' #26369](https://github.com/scylladb/scylladb/issues/26369) are unable to reproduce the issue after the fix. A test is added to make sure the inclusive border values `0` and `100` are supported. Documentation is updated to give more information to the users. It now states that these border values are inclusive, and also that the precision, with automatic rounding, is 1 decimal digit. Fixes #26369 This is a bug fix. If at any time a client tries to use value >= 99.5 and < 100, the raft error will happen. Backport is needed. The code which introduced inconsistency is introduced in 2025.2, so no backporting to 2025.1. Closes scylladb/scylladb#26909 * github.com:scylladb/scylladb: test: cqlpy: add test case for non-numeric PERCENTILE value schema: speculative_retry: update exception type for sstring ops docs: cql: ddl.rst: update speculative-retry-options test: cqlpy: add test for valid speculative_retry values schema: speculative_retry: allow 0 and 100 PERCENTILE values	2025-11-13 15:27:45 +01:00
Petr Gusev	d3bd8c924d	topology_coordinator: drop unused exec_global_command overload	2025-11-13 14:19:03 +01:00
Petr Gusev	45d1302066	topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request This method is specific to topology requests -- node joining, replacing, decommissioning etc, everything that goes through topology::transition_state::write_both_read_old and raft_topology_cmd::command::stream_ranges. It shouldn't be used in other contexts -- to handle global topology requests (e.g. truncate table) or for tablets. Rename the method to make this more explicit.	2025-11-13 14:19:03 +01:00
Petr Gusev	bf8cc5358b	topology_state_machine: inline get_excluded_nodes The method is specific to topology_coordinator, which already contains a wrapper for it, so inline the topology method into it. Also, make the logic of the method more explicit and remove multiple transition_nodes lookups.	2025-11-13 14:18:46 +01:00
Taras Veretilnyk	e7ceb13c3b	boost/repair_test: add repair reader integrity verification test cases Adds test cases to verify that repair_reader correctly detects SSTable(both comprossed and uncompressed) checksum mismatch. Digest mismatch verification is not possible as repair readar may skip some sstable data, which automatically disables digest verification. Each test corrupts the Data component on disk and ensures the reader throws a malformed_sstable_exception with the expected error message.	2025-11-13 14:08:33 +01:00
Taras Veretilnyk	554ce17769	test/lib: allow to disable compression in random_mutation_generator Adds a compress flag to random_mutation_generator, allowing tests to disable compression in generated mutations. When set to compress::no, the schema builder uses no_compression() parameters.	2025-11-13 14:08:33 +01:00
Taras Veretilnyk	add60d7576	sstables: Skip checksum and digest reads for unlinked SSTables Add an _unlinked flag to track SSTable unlink state and check it in read_digest() and read_checksum() methods to skip file reads for unlinked SSTables, preventing potential file not found errors.	2025-11-13 14:08:26 +01:00
Michael Litvak	b925e047be	test: add mv write during node join test Add a test that reproduces the issue scylladb/scylladb#26976. The test adds a new node with delayed group0 apply, and does writes with MV updates right after the join completes on the coordinator and while the joining node's state is behind. The test fails before fixing the issue and passes after.	2025-11-13 12:24:32 +01:00
Michael Litvak	13d94576e5	topology_coordinator: include joining node in barrier Previously, only nodes in the 'normal' state and decommissioning nodes were included in the set of nodes participating in barrier and barrier_and_drain commands. Joining nodes are not included because they don't coordinate requests, given their cql port is closed. However, joining nodes may receive mutations from other nodes, for which they may generate and coordinate materialized view updates. If their group0 state is not synchronized it could cause lost view updates. For example: 1. On the topology coordinator, the join completes and the joining node becomes normal, but the joining node's state lags behind. Since it's not synchronized by the barrier, it could be in an old state such as `write_both_read_old`. 2. A normal node coordinates a write and sends it to the new node as the new replica. 3. The new node applies the base mutation but doesn't generate a view update for it, because it calculates the base-view pairing according to its own state and replication map, and determines that it doesn't participate in the base-view pairing. Therefore, since the joining node participates as a coordinator for view updates, it should be included in these barriers as well. This ensures that before the join completes, the joining node's state is `write_both_read_new`, where it does generate view updates. Fixes scylladb/scylladb#26976	2025-11-13 12:24:31 +01:00
Michał Chojnowski	346e0f64e2	replica/table: add a metric for hypothetical total file size without compression This patch adds a per-table metric `scylla_column_family_total_disk_space_before_compression`, which measures the hypothetical total size of sstables on disk, if Data.db was replaced with an uncompressed equivalent.	2025-11-13 11:28:19 +01:00
Dawid Mędrek	b357c8278f	test/cluster/test_maintenance_mode.py: Wait for initialization If we try to perform queries too early, before the call to `storage_service::start_maintenance_mode` has finished, we will fail with the following error: ``` ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index! ``` To avoid that, we should wait until initialization is complete.	2025-11-13 11:07:45 +01:00
Aleksandra Martyniuk	4d0de1126f	db: batchlog_manager: update _last_replay only if all batches were replayed Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection. Update _last_replay only if all batches were replayed. Fixes: https://github.com/scylladb/scylladb/issues/24415.	2025-11-13 10:40:19 +01:00
Piotr Dulikowski	2e5eb92f21	Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema. The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error. We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table. When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well. The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit. Fixes https://github.com/scylladb/scylladb/issues/26405 backport not needed - enhancement Closes scylladb/scylladb#24960 * github.com:scylladb/scylladb: test: cdc: test cdc compatible schema cdc: use compatiable cdc schema db: schema_applier: create schema with pointer to CDC schema db: schema_applier: extract cdc tables schema: add pointer to CDC schema schema_registry: remove base_info from global_schema_ptr schema_registry: use extended_frozen_schema in schema load schema_registry: replace frozen_schema+base_info with extended_frozen_schema frozen_schema: extract info from schema_ptr in the constructor frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema	2025-11-13 10:11:54 +01:00
Pavel Emelyanov	f47f2db710	Merge 'Support local primary-replica-only for native restore' from Robert Bindar This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with: - `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only - `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only. - `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself) - `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense. The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope. Fixes #26584 Closes scylladb/scylladb#26609 * github.com:scylladb/scylladb: Add cluster tests for checking scoped primary_replica_only streaming Improve choice distribution for primary replica Refactor cluster/object_store/test_backup nodetool restore: add primary-replica-only option nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only Enable scoped primary replica only streaming Support primary_replica_only for native restore API	2025-11-13 12:11:18 +03:00
Michał Chojnowski	1cfce430f1	replica/table: keep track of total pre-compression file size Every table and sstable set keeps track of the total file size of contained sstables. Due to a feature request, we also want to keep track of the hypothetical file size if Data files were uncompressed, to add a metric that shows the compression ratio of sstables. We achieve this by replacing the relevant `uint_64 bytes_on_disk` counters everywhere with a struct that contains both the actual (post-compression) size and the hypothetical pre-compression size. This patch isn't supposed to change any observable behavior. In the next patch, we will use these changes to add a new metric.	2025-11-13 00:49:57 +01:00
Tomasz Grabiec	10b893dc27	Merge 'load_stats: fix bug in migrate_tablet_size()' from Ferenc Szili `topology_cooridinator::migrate_tablet_size()` was introduced in `10f07fb95a`. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search: ``` if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) { if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) { ``` This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats. This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method. A version containing this bug has not been released yet, so no backport is needed. Closes scylladb/scylladb#26946 * github.com:scylladb/scylladb: load_stats: add test for migrate_tablet_size() load_stats: fix problem with tablet size migration	2025-11-12 23:48:37 +01:00
Nadav Har'El	5839574294	Merge 'cql3: Fix std::bad_cast when deserializing vectors of collections' from Karol Nowacki cql3: Fix std::bad_cast when deserializing vectors of collections This PR fixes a bug where attempting to INSERT a vector containing collections (e.g., `vector<set<int>,1>`) would fail. On the client side, this manifested as a `ServerError: std::bad_cast`. The cause was "type slicing" issue in the reserialize_value function. When retrieving the vector's element type, the result was being assigned by value (using auto) instead of by reference. This "sliced" the polymorphic abstract_type object, stripping it of its actual derived type information. As a result, a subsequent dynamic_cast would fail, even if the underlying type was correct. To prevent this entire class of bugs from happening again, I've made the polymorphic base class `abstract_type` explicitly uncopyable. Fixes: #26704 This fix needs to be backported as these releases are affected: `2025.4` , `2025.3`. Closes scylladb/scylladb#26740 * github.com:scylladb/scylladb: cql3: Make abstract_type explicitly noncopyable cql3: Fix std::bad_cast when deserializing vectors of collections	2025-11-13 00:24:25 +02:00
Petr Gusev	9fed80c4be	messaging_service: simplify and optimize ban_host We do one cross-shard call for all left+ignored nodes.	2025-11-12 12:27:44 +01:00
Petr Gusev	52cccc999e	storage_service: topology_state_load: extract topology variable It's inconvinient to always write the long expression _topology_state_machine._topology.	2025-11-12 12:27:44 +01:00
Petr Gusev	66063f202b	topology_coordinator: excluded_tablet_nodes -> ignored_nodes ignored_nodes is sufficient in these cases. excluded_tablet_nodes also includes left_nodes_rs, which are not needed here — global_token_metadata_barrier runs the barrier only on normal and transition nodes, not on left nodes.	2025-11-12 12:27:44 +01:00
Petr Gusev	82da83d0e5	topology_state_machine: add excluded_tablet_nodes field The topology_coordinator::is_excluded() creates a temporary hash map for each call. This is probably not a performance problem since left_nodes_rs contains only those left nodes that are referenced from tablet replicas, this happens temporarily while e.g. a replaced node is being rebuilt. On the other hand, why not just have a dedicated field in the topology_state_machine, then this code wouldn't look suspicious.	2025-11-12 12:27:43 +01:00
Gleb Natapov	e872f9cb4e	cleanup: Add RESTful API to allow reset cleanup needed flag Cleaning up a node using per keyspace/table interface does not reset cleanup needed flag in the topology. The assumption was that running cleanup on already clean node does nothing and completes quickly. But due to https://github.com/scylladb/scylladb/issues/12215 (which is closed as WONTFIX) this is not the case. This patch provides the ability to reset the flag in the topology if operator cleaned up the node manually already.	2025-11-12 10:56:57 +02:00
Nadav Har'El	4de88a7fdc	test/cqlpy: fix run script for materialized views on tablets Recently we enabled tablets by default, but it is necessary to enable rf_rack_valid_keyspaces if materialized views are to be used with tablets, and this option is not the default. We did add this option in test/pylib/scylla_cluster.py which is used by test.py, but we didn't add it to test/cqlpy/run.py, so the test/cqlpy/run script is no longer able to run tests with materialized views. So this patch adds the missing configuration to run.py. FIxes #26918 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26919	2025-11-12 11:56:21 +03:00
Karol Nowacki	77da4517d2	cql3: Make abstract_type explicitly noncopyable The polymorphic abstract_type class serves as an interface and should not be copied. To prevent accidental and unsafe copies, make it explicitly uncopyable.	2025-11-12 09:11:56 +01:00
Karol Nowacki	960fe3da60	cql3: Fix std::bad_cast when deserializing vectors of collections When deserializing a vector whose elements are collections (e.g., set, list), the operation raises a `std::bad_cast` exception. This was caused by type slicing due to an incorrect assignment of a polymorphic type by value instead of by reference. This resulted in a failed `dynamic_cast` even when the underlying type was correct.	2025-11-12 09:11:56 +01:00
Botond Dénes	6f6ee5581e	Merge 'encryption::kms_host: Add exponential backoff-retry for 503 errors' from Calle Wilund Refs #26822 AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff. Note: we do _not_ retry forever. Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe. Closes scylladb/scylladb#26934 * github.com:scylladb/scylladb: encryption::kms_host: Add exponential backoff-retry for 503 errors encryption::kms_host: Include http error code in kms_error	2025-11-12 08:33:33 +02:00
Yaron Kaikov	3ade3d8f5b	auto-backport: Add support for JIRA issue references - Added support for JIRA issue references in PR body and commit messages - Supports both short format (PKG-92) and full URL format - Maintains existing GitHub issue reference support - JIRA pattern matches https://scylladb.atlassian.net/browse/{PROJECT-ID} - Allows backporting for PRs that reference JIRA issues with 'fixes' keyword Fixes: https://github.com/scylladb/scylladb/issues/26955 Closes scylladb/scylladb#26954	2025-11-12 08:15:06 +02:00
Calle Wilund	d22e0acf0b	encryption::kms_host: Add exponential backoff-retry for 503 errors Refs #26822 AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff. Note: we do _not_ retry forever. Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe. v2: * Use utils::exponential_backoff_retry	2025-11-11 21:02:32 +00:00
Calle Wilund	190e3666cb	encryption::kms_host: Include http error code in kms_error Keep track of actual HTTP failure.	2025-11-11 21:02:32 +00:00
Ferenc Szili	fcbc239413	load_stats: add test for migrate_tablet_size() This change adds tests which validate the functionality of load_stats::migrate_tablet_size()	2025-11-11 14:28:31 +01:00
Ferenc Szili	b77ea1b8e1	load_stats: fix problem with tablet size migration This patch fixes a bug with tablet size migration in load_stats. has_tablet_size() lambda in topology_coordinator::migrate_tablet_size() was returning false in all cases due to incorrect search iterator comparison after a table and tablet saeach. This change moves load_stats migrate_tablet_sizes() functionaility into a separate method of load_stats.	2025-11-11 14:26:09 +01:00
Yehuda Lebi	a05ebbbfbb	dist/docker: add configurable blocked-reactor-notify-ms parameter Add --blocked-reactor-notify-ms argument to allow overriding the default blocked reactor notification timeout value of 25 ms. This change provides users the flexibility to customize the reactor notification timeout as needed. Fixes: scylladb/scylla-enterprise#5525 Closes scylladb/scylladb#26892	2025-11-11 12:38:40 +02:00
Benny Halevy	a290505239	utils: stall_free: add dispose_gently dispose_gently consumes the object moved to it, clearing it gently before it's destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26356	2025-11-11 12:20:18 +02:00
Yaron Kaikov	c601371b57	install-dependencies.sh: update node_exporter to 1.10.2 Update node exporter to solve CVE-2025-22871 [regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz ] Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-5 Closes scylladb/scylladb#26916	2025-11-11 11:36:13 +02:00
Nadav Har'El	b659dfcbe9	test/cqlpy: comment out Cassandra check that is no longer relevant In the test translated from Cassandra validation/operations/alter_test.py we had two lines in the beginning of an unrelated test that verified that CREATE KEYSPACE is not allowed without replication parameters. But starting recently, ScyllaDB does have defaults and does allow these CREATE KEYSPACE. So comment out these two test lines. We didn't notice that this test started to fail, because it was already marked xfail, because in the main part of this test, it reproduces a different issue! The annoying side-affect of these no-longer-passing checks was that because the test expected a CREATE KEYSPACE to fail, it didn't bother to delete this keyspace when it finished, which causes test.py to report that there's a problem because some keyspaces still exist at the end of the test. Now that we fixed this problem, we no longer need to list this test in test/cqlpy/suite.yaml as a test that leaves behind undeleted keyspaces. Fixes #26292 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26341	2025-11-11 10:34:27 +02:00
Nikos Dragazis	56e5dfc14b	migration_manager: Add missing validations for schema extensions The migration manager offers some free functions to prepare mutations for a new/updated table/view. Most of them include a validation check for the schema extensions, but in the following ones it's missing: * `prepare_new_column_family_announcement` (overload with vector as out parameter) * `prepare_new_column_families_announcement` Presumably, this was just an omission. It's also not a very important one since the only extension having validation logic is the `encryption_schema_extension`, but none of these functions is connected to user queries where encryption options can be provided in the schema. User queries go through the other `prepare_new_column_family_announcement` overload, which does perform a validation check. Add validation in the missing places. Fixes #26470. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#26487	2025-11-11 10:08:58 +02:00
Botond Dénes	042303f0c9	Merge 'Alternator: enable tablets by default - depending on tablets_mode_for_new_keyspaces' from Nadav Har'El Before this series, Alternator's CreateTable operation defaults to creating a table replicated with vnodes, not tablets. The reasons for this default included missing support for LWT, Materialized Views, Alternator TTL and Alternator Streams if tablets are used. But today, all of these (except the still-experimental Alternator Streams) are now fully available with tablets, so we are finally ready to switch Alternator to use tablets by default in new tables. We will use the same configuration parameter that CQL uses, tablets_mode_for_new_keyspaces, to determine whether new keyspaces use tablets by default. If set to `enabled`, tablets are used by default on new tables. If set to `disabled`, tablets will not be used by default (i.e., vnodes will be used, as before). A third value, `enforced` is similar to `enabled` but forbids overriding the default to vnodes when creating a table. As before, the user can set a tag during the CreateTable operation to override the default choice of tablets or vnodes (unless in `enforced` mode). This tag is now named `system:initial_tablets` - whereas before this patch it was called `experimental:initial_tablets`. The rules stay the same as with the earlier, experimental:initial_tablets tag: when supplied with a numeric value, the table will use tablets. When supplied with something else (like a string "none"), the table will use vnodes. Fixes https://github.com/scylladb/scylladb/issues/22463 Backport to 2025.4, it's important not to delay phasing out vnodes. Closes scylladb/scylladb#26836 * github.com:scylladb/scylladb: test,alternator: use 3-rack clusters in tests alternator: improve error in tablets_mode_for_new_keyspaces=enforced config: make tablets_mode_for_new_keyspaces live-updatable alternator: improve comment about non-hidden system tags alternator: Fix test_ttl_expiration_streams() alternator: Fix test_scan_paging_missing_limit() alternator: Don't require vnodes for TTL tests alternator: Remove obsolete test from test_table.py alternator: Fix tag name to request vnodes alternator: Fix test name clash in test_tablets.py alternator: test_tablets.py handles new policy reg. tablets alternator: Update doc regarding tablets support alternator: Support `tablets_mode_for_new_keyspaces` config flag Fix incorrect hint for tablets_mode_for_new_keyspaces Fix comment for tablets_mode_for_new_keyspaces	2025-11-11 09:45:29 +02:00
Avi Kivity	bae2654b34	tools: dbuild: avoid `test -v` incompatibility with MacOS shell `test -v` isn't present on the MacOS shell. Since dbuild is intended as a compatibility bridge between the host environment and the build environment, don't use it there. Use ${var+text_if_set} expansion as a workaround. Fixes #26937 Closes scylladb/scylladb#26939	2025-11-11 09:43:14 +02:00
Nikos Dragazis	94c4f651ca	test/cqlpy: Test secondary index with short reads Add a test to check that paged secondary index queries behave correctly when pages are short. This is currently failing in Scylla, but passes in Cassandra 5, therefore marked as "xfailing". Refer to the test's docstring for more details. The bug is a regression introduced by commit `f6f18b1`. `test/cqlpy/run --release ...` shows that the test passes in 5.1 but fails in 5.2 onwards. Refs #25839. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com> Closes scylladb/scylladb#25843	2025-11-11 09:28:45 +02:00
Robert Bindar	a04ebb829c	Add cluster tests for checking scoped primary_replica_only streaming This commits adds a tests checking various scenarios of restoring via load and stream with primary_replica_only and a scope specified. The tests check that in a few topologies, a mutation is replicated a correct amount of times given primary_replica_only and that streaming happens according to the scope rule passed. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	817fdadd49	Improve choice distribution for primary replica I noticed during tests that `maybe_get_primary_replica` would not distribute uniformly the choice of primary replica because `info.replicas` on some shards would have an order whilst on others it'd be ordered differently, thus making the function choose a node as primary replica multiple times when it clearly could've chosen a different nodes. This patch sorts the replica set before passing it through the scope filter. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	d4e43bd34c	Refactor cluster/object_store/test_backup This PR splits the suppport code from test_backup.py into multiple functions so less duplicated code is produced by new tests using it. It also makes it a bit easier to understand. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	c1b3fe30be	nodetool restore: add primary-replica-only option Add --primary-replica-only and update docs page for nodetool restore. The relationship with the scope parameter is: - scope=all primary_replica_only=true gets the global primary replica - scope=dc primary_replica_only=true gets the local primary replica - scope=rack primary_replica_only=true is like a noop, it gets the only replica in the rack (rf=#racks) - scope=node primary_replica_only=node is not allowed Fixes #26584 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	83aee954b4	nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only So far it was not allowed to pass a scope when using the primary_replica_only option, this patch enables it because the concepts are now combined so that: - scope=all primary_replica_only=true gets the global primary replica - scope=dc primary_replica_only=true gets the local primary replica - scope=rack primary_replica_only=true is like a noop, it gets the only replica in the rack (rf=#racks) - scope=node primary_replica_only=node is not allowed Fixes #26584 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	136b45d657	Enable scoped primary replica only streaming This patch removes the restriction for streaming to primary replica only within a scope. Node scope streaming to primary replica is dissallowed. Fixes #26584 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:18:01 +02:00
Robert Bindar	965a16ce6f	Support primary_replica_only for native restore API Current native restore does not support primary_replica_only, it is hard-coded disabled and this may lead to data amplification issues. This patch extends the restore REST API to accept a primary_replica_only parameter and propagates it to sstables_loader so it gets correctly passed to load_and_stream. Fixes #26584 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2025-11-11 09:17:52 +02:00
Dawid Mędrek	394207fd69	test: Disable maintenance mode correctly in test_maintenance_mode.py Although setting the value of `maintenance_mode` to the string `"false"` disables maintenance mode, the testing framework misinterprets the value and thinks that it's actually enabled. As a result, it might try to connect to Scylla via the maintenance socket, which we don't want.	2025-11-10 19:22:06 +01:00
Dawid Mędrek	222eab45f8	test: Fix keyspace in test_maintenance_mode.py The keyspace used in the test is not necessarily called `ks`.	2025-11-10 19:21:58 +01:00
Dawid Mędrek	c0f7622d12	service/qos: Do not crash Scylla if auth_integration absent If the user connects to Scylla via the maintenance socket, it may happen that `auth_integration` has not been registered in the service level controller yet. One example is maintenance mode when that will never happen; another when the connection occurs before Scylla is fully initialized. To avoid unnecessary crashes, we add new branches if the passed user is absent or if it corresponds to the anonymous role. Since the role corresponding to a connection via the maintenance socket is the anonymous role, that solves the problem. In those cases, we completely circumvent any calls to `auth_integration` and handle them separately. The modified methods are: * `get_user_scheduling_group`, * `with_user_service_level`, * `describe_service_levels`. For the first two, the new behavior is in line with the previous implementation of those functions. The last behaves differently now, but since it's a soft error, crashing the node is not necessary anyway. We throw an exception instead, whose error message should give the user a hint of what might be wrong. The other uses of `auth_integration` within the service level controller are not problematic: * `find_effective_service_level`, * `find_cached_effective_service_level`. They take the name of a role as their argument. Since the anonymous role doesn't have a name, it's not possible to call them with it. Fixes scylladb/scylladb#26816	2025-11-10 19:21:36 +01:00
Yaron Kaikov	850ec2c2b0	Trigger scylla-ci directly from PR instead of scylla-ci-route job Refactoring scylla-ci to be triggered directly from each PR using GitHub action. This will allow us to skip triggering CI when PR commit message was updated (which will save us un-needed CI runs) Also we can remove `Scylla-CI-route` pipeline which route each PR to the proper CI job under the release (GitHub action will do it automatically), to reduce complexity Fixes: https://scylladb.atlassian.net/browse/PKG-69 Closes scylladb/scylladb#26799	2025-11-10 15:10:11 +02:00
Pavel Emelyanov	decf86b146	Merge 'Make AWS & Azure KMS boost testing use fixture + include Azure in pytests' from Calle Wilund * Adds test fixture for AWS KMS * Adds test fixture for Azure KMS * Adds key provider proxy for Azure to pytests (ported dtests) * Make test gather for boost tests handle suites * Fix GCP test snafu Fixes #26781 Fixes #26780 Fixes #26776 Fixes #26775 Closes scylladb/scylladb#26785 * github.com:scylladb/scylladb: gcp_object_storage_test: Re-enable parallelism. test::pylib: Add azure (mock) testing to EAR matrix test::boost::encryption_at_rest: Remove redundant azure test indent test::boost::encryption_at_rest: Move azure tests to use fixture test::lib: Add azure mock/real server fixture test::pylib::boost: Fix test gather to handle test suites utils::gcp::object_storage: Fix typo in semaphore init test::boost::encryption_at_rest_test: Remove redundant indent test::boost::test_encryption_at_rest: Move to AWS KMS fixture for kms test test::boost::test_encryption_at_rest: Reorder tests and helpers ent::encryption: Make text helper routines take std::string test::pylib::dockerized_service: Handle docker/podman bind error message test::lib::aws_kms_fixture: Add a fixture object to run mock AWS KMS test::lib::gcs_fixture: Only set port if running docker image + more retry	2025-11-10 14:35:05 +03:00
Michał Jadwiszczak	9345c33d27	service/storage_service: migrate staging sstables in view building worker during intra-node migration Use methods introduces in previous commit and: - load staging sstables to the view building worker on the target shard, at the end of `streaming` stage - clear migrated staging sstables on source shard in `cleanup` stage This patch also removes skip mark in `test_staging_sstables_with_tablet_merge`. Fixes scylladb/scylladb#26244	2025-11-10 10:38:08 +01:00
Michał Jadwiszczak	4bc6361766	db/view/view_building_worker: support sstables intra-node migration We need to be able to load sstables on the target shard during intra-node tablet migration and to cleanup migrated sstables on the source shard.	2025-11-10 10:36:32 +01:00
Michał Jadwiszczak	c99231c4c2	db/view_building_worker: fix indent	2025-11-10 09:02:16 +01:00
Michał Jadwiszczak	2e8c096930	db/view/view_building_worker: don't organize staging sstables by last token There was a problem with staging sstables after tablet merge. Let's say there were 2 tablets and tablet 1 (lower last token) had an staging sstable. Then a tablet merge occured, so there is only one tablet now (higher last token). But entries in `_staging_sstables`, which are grouped by last token, are never adjusted. Since there shouldn't be thousands of sstables, we can just hold list of sstables per table and filter necessary entries when doing `process_staging` view building task.	2025-11-10 09:02:16 +01:00
Nadav Har'El	35f3a8d7db	docs/alternator: fix small mistake in compatibility.md docs/alternator/compatibility.md describes support for global (multi-DC) tables, and suggests that the CQL command "ALTER TABLE" should be used to change the replication of an Alternator table. But actually, the right command is "ALTER KEYSPACE", not "ALTER TABLE". So fix the document. Fixes #26737 Closes scylladb/scylladb#26872	2025-11-10 08:48:18 +03:00
Yauheni Khatsianevich	d3e62b15db	fix(test): minor typo fix, removing redundant param from logging Closes scylladb/scylladb#26901	2025-11-10 08:42:11 +03:00
Dario Mirovic	d364904ebe	test: dtest: audit_test.py: add AuditBackendComposite Add `AuditBackendComposite`, a test class which allows testing multiple audit outputs in a single run, implemented in `audit_composite_storage_helper` class. Add two more tests. `test_composite_audit_type_invalid` tests if an invalid audit mode among correct ones causes the same error as when it is the only specified audit mode. `test_composite_audit_empty_settings` tests if `'none'` audit mode, when specified along other audit modes, properly disables audit logging. Refs #26022	2025-11-10 00:31:34 +01:00
Dario Mirovic	a8ed607440	test: dtest: audit_test.py: group logs in dict per audit mode Before this patch audit test could process audit logs from a single audit output. This patch adds support for multiple audit outputs in the same run. The change is needed in order to test `audit_composite_storage_helper`, which can write to multiple audit outputs. Refs #26022	2025-11-10 00:31:34 +01:00
Dario Mirovic	afca230890	audit: write out to both table and syslog This patch adds support for multiple audit log outputs. If only one audit log output is enabled, the behavior does not change. If multiple audit log outputs are enabled, then the `audit_composite_storage_helper` class is used. It has a collection of `storage_helper` objects. Fixes #26022	2025-11-10 00:31:30 +01:00
Dario Mirovic	7ec9e23ee3	test: cqlpy: add test case for non-numeric PERCENTILE value Add test case for non-numeric PERCENTILE value, which raises an error different to the out-of-range invalid values. Regex in the test test_invalid_percentile_speculative_retry_values is expanded. Refs #26369	2025-11-09 13:59:36 +01:00
Dario Mirovic	85f059c148	schema: speculative_retry: update exception type for sstring ops Change speculative_retry::to_sstring and speculative_retry::from_sstring to throw exceptions::configuration_exception instead of std::invalid_argument. These errors can be triggered by CQL, so appropriate CQL exception should be used. Reference: https://github.com/scylladb/scylladb/issues/24748#issuecomment-3025213304 Refs #26369	2025-11-09 13:55:57 +01:00
Dario Mirovic	aba4c006ba	docs: cql: ddl.rst: update speculative-retry-options Clarify how the value of `XPERCENTILE` is handled: - Values 0 and 100 are supported - The percentile value is rounded to the nearest 0.1 (1 decimal place) Refs #26369	2025-11-09 13:23:29 +01:00
Dario Mirovic	5d1913a502	test: cqlpy: add test for valid speculative_retry values test_valid_percentile_speculative_retry_values is introduced to test that valid values for speculative_retry are properly accepted. Some of the values are moved from the test_invalid_percentile_speculative_retry_values test, because the previous commit added support for them. Refs #26369	2025-11-09 13:23:26 +01:00
Dario Mirovic	da2ac90bb6	schema: speculative_retry: allow 0 and 100 PERCENTILE values This patch allows specifying 0 and 100 PERCENTILE values in speculative_retry. It was possible to specify these values before #21825. #21825 prevented specifying invalid values, like -1 and 101, but also prevented using 0 and 100. On top of that, speculative_retry::to_sstring function did rounding when formatting the string, which introduced inconsistency. Fixes #26369	2025-11-09 12:26:27 +01:00
Nadav Har'El	65ed678109	test,alternator: use 3-rack clusters in tests With tablets enabled, we can't create an Alternator table on a three- node cluster with a single rack, since Scylla refuses RF=3 with just one rack and we get the error: An error occurred (InternalServerError) when calling the CreateTable operation: ... Replication factor 3 exceeds the number of racks (1) in dc datacenter1 So in test/cluster/test_alternator.py we need to use the incantation "auto_rack_dc='dc1'" every time that we create a three-node cluster. Before this patch, several tests in test/cluster/test_alternator.py failed on this error, with this patch all of them pass. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-09 12:52:29 +02:00
Nadav Har'El	c03081eb12	alternator: improve error in tablets_mode_for_new_keyspaces=enforced When in tablets_mode_for_new_keyspaces=enforced mode, Alternator is supposed to fail when CreateTable asks explicitly for vnodes. Before this patch, this error was an ugly "Internal Server Error" (an exception thrown from deep inside the implementation), this patch checks for this case in the right place, to generate a proper ValidationException with a proper error message. We also enable the test test_tablets_tag_vs_config which should have caught this error, but didn't because it was marked xfail because tablets_mode_for_new_keyspaces had not been live-updatable. Now that it is, we can enable the test. I also improved the test to be slightly faster (no need to change the configuration so many times) and also check the ordinary case - where the schema doesn't choose neither vnodes nor tablets explicitly and we should just use the default. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-09 12:52:29 +02:00
Nadav Har'El	25439127c8	config: make tablets_mode_for_new_keyspaces live-updatable We have a configuration option "tablets_mode_for_new_keyspaces" which determines whether new keyspaces should use tablets or vnodes. For some reason, this configuration parameter was never marked live- updatable, so in this patch we add flag. No other changes are needed - the existing code that uses this flag always uses it through the up-to-date configuration. In the previous patches we start to honor tablets_mode_for_new_keyspaces also in Alternator CreateTable, and we wanted to test this but couldn't do this in test/alternator because the option was not live-updatable. Now that it will be, we'll be able to test this feature in test/alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-09 12:52:29 +02:00
Nadav Har'El	b34f28dae2	alternator: improve comment about non-hidden system tags The previous patches added a somewhat misleading comment in front of system:initial_tablets, which this patch improves. That tag is NOT where Alternator "stores" table properties like the existing comment claimed. In fact, the whole point is that it's the opposite - Alternator never writes to this tag - it's a user-writable tag which Alternator reads, to configure the new table. And this is why it obviously can't be hidden from the user. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	eeb3a40afb	alternator: Fix test_ttl_expiration_streams() The test is now aware of the new name of the `system:initial_tablets` tag.	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	a659698c6d	alternator: Fix test_scan_paging_missing_limit() With tablets, the test begun failing. The failure was correlated with the number of initial tablets, which when kept at default, equals 4 tablets per shard in release build and 2 tablets per shard in dev build. In this patch we split the test into two - one with a more data in the table to check the original purpose of this test - that Scan doesn't return the entire table in one page if "Limit" is missing. The other test reproduces issue #10327 - that when the table is small, Scan's page size isn't strictly limited to 1MB as it is in DynamoDB. Experimentally, 8000 KB of data (compared to 6000 KB before this patch) is enough when we have up to 4 initial tablets per shard (so 8 initial tablets on a two-shard node as we typically run in tests). Original patch by Piotr Szymaniak <piotr.szymaniak@scylladb.com> modified by Nadav Har'El <nyh@scylladb.com>	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	345747775b	alternator: Don't require vnodes for TTL tests Since #23662 Alternator supports TTL with tablets too. Let's clear some leftovers causing Alternator to test TTL with vnodes instead of with what is default for Alternator (tablets or vnodes).	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	274d0b6d62	alternator: Remove obsolete test from test_table.py Since Alternator is capable of runnng with tablets according to the flag in config, remove the obsolete test that is making sure that Alternator runs with vnodes.	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	63897370cb	alternator: Fix tag name to request vnodes The tag was lately renamed from `experimental:initial_tablets` to `system::initial_tablets`. This commit fixes both the tests as well as the exceptions sent to the user instructing how to create table with vnodes.	2025-11-09 12:52:29 +02:00
Piotr Szymaniak	c7de7e76f4	alternator: Fix test name clash in test_tablets.py	2025-11-09 12:52:28 +02:00
Piotr Szymaniak	7466325028	alternator: test_tablets.py handles new policy reg. tablets Adjust the tests so they are in-line with the config flag 'tablets_mode_for_new_keyspaces` that the Alternator learned to honour.	2025-11-09 12:52:28 +02:00
Piotr Szymaniak	35216d2f01	alternator: Update doc regarding tablets support Reflect honouring by Alternator the value of the config flag `tablets_mode_for_new_keyspaces`, as well as renaming of the tag `experimental:initial_tablets` into `system:initial_tablets`.	2025-11-09 12:52:28 +02:00
Piotr Szymaniak	376a2f2109	alternator: Support `tablets_mode_for_new_keyspaces` config flag Until now, tablets in Alternator were experimental feature enabled only when a TAG "experimental:initial_tablets" was present when creating a table and associated with a numeric value. After this patch, Alternator honours the value of `tablets_mode_for_new_keyspaces` config flag. Each table can be overriden to use tablets or not by supplying a new TAG "system:initial_tablets". The rules stay the same as with the earlier, experimental tag: when supplied with a numeric value, the table will use tablets (as long as they are supported). When supplied with something else (like a string "none"), the table will use vnodes, provided that tablets are not `enforced` by the config flag. Fixes #22463	2025-11-09 12:52:17 +02:00
Piotr Szymaniak	af00b59930	Fix incorrect hint for tablets_mode_for_new_keyspaces	2025-11-09 10:49:46 +02:00
Piotr Szymaniak	403068cb3d	Fix comment for tablets_mode_for_new_keyspaces The comment was not listing all the 3 possible values correctly, despite an explanation just below covers all 3 values.	2025-11-09 10:49:46 +02:00
Botond Dénes	cdba3bebda	Merge 'Generalize directory checks in database_test's snapshot test cases' from Pavel Emelyanov Those test cases use lister::scan_dir() to validate the contents of snapshot directory of a table against this table's base directory. This PR generalizes the listing code making it shorter. Also, the snapshot_skip_flush_works case is missing the check for "schema.cql" file. Nothing is wrong with it, but the test is more accurate if checking it. Also, the snapshot_with_quarantine_works case tries to check if one set of names is sub-set of another using lengthy code. Using std::includes improves the test readability a lot. Also, the PR replaces lister::scan_dir() with directory_lister. The former is going to be removed some day (see also #26586) Improving existing working test, no backport is needed. Closes scylladb/scylladb#26693 * github.com:scylladb/scylladb: database_test: Simplify snapshot_with_quarantine_works() test database_test: Improve snapshot_skip_flush_works test database_test: Simplify snapshot_works() tests database_test: Use collect_files() to remove files database_test: Use collectz_files() to count files in directory database_test: Introduce collect_files() helper	2025-11-07 16:04:02 +02:00
Michał Chojnowski	b82c2aec96	sstables/trie: fix an assertion violation in bti_partition_index_writer_impl::write_last_key _last_key is a multi-fragment buffer. Some prefix of _last_key (up to _last_key_mismatch) is unneeded because it's already a part of the trie. Some suffix of _last_key (after needed_prefix) is unneeded because _last_key can be differentiated from its neighbors even without it. The job of write_last_key() is to find the middle fragments, (containing the range `[_last_key_mismatch, needed_prefix)`) trim the first and last of the middle fragments appropriately, and feed them to the trie writer. But there's an error in the current logic, in the case where `_last_key_mismatch` falls on a fragment boundary. To describe it with an example, if the key is fragmented like `aaa\|bbb\|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`, then the intended output to the trie writer is `bbb\|c`, but the actual output is `\|bbb\|c`. (I.e. the first fragment is empty). Technically the trie writer could handle empty fragments, but it has an assertion against them, because they are a questionable thing. Fix that. We also extend bti_index_test so that it's able to hit the assert violation (before the patch). The reason why it wasn't able to do that before the patch is that the violation requires decorated keys to differ on the _first_ byte of a partition key column, but the keys generated by the test only differed on the last byte of the column. (Because the test was using sequential integers to make the values more human-readable during debugging). So we modify the key generation to use random values that can differ on any position. Fixes scylladb/scylladb#26819 Closes scylladb/scylladb#26839	2025-11-07 11:25:07 +02:00
Abhinav Jha	ab0e0eab90	raft topology: skip non-idempotent steps in decommission path to avoid problems during races In the present scenario, there are issues in left_token_ring transition state execution in the decommissioning path. In case of concurrent mutation race conditions, we enter left_token_ring more than once, and apparently if we enter left token ring second time, we try to barrier the decommisioned node, which at this point is no longer possible. That's what causes the errors. This pr resolves the issue by adding a check right in the start of left_token_ring to check if the first topology state update, which marks the request as done is completed. In this case, its confirmed that this is the second time flow is entering left_token_ring and the steps preceding the request status update should be skipped. In such cases, all the rest steps are skipped and topology node status update( which threw error in previous trial) is executed directly. Node removal status from group0 is also checked and remove operation is retried if failed last time. Although these changes are done with regard to the decommission operation behavior in `left_token_ring` transition state, but since the pr doesn't interfere with the core logic, it should not derail any rollback specific logic. The changes just prevent some non-idempotent operations from re-occuring in case of failures. Rest of the core logic remain intact. Test is also added to confirm the proper working of the same. Fixes: scylladb/scylladb#20865 Backport is not needed, since this is not a super critical bug fix. Closes scylladb/scylladb#26717	2025-11-07 10:07:49 +01:00
Ran Regev	aaf53e9c42	nodetool refresh primary-replica-only Fixes: #26440 1. Added description to primary-replica-only option 2. Fixed code text to better reflect the constrained cheked in the code itself. namely: that both primary replica only and scope must be applied only if load and steam is applied too, and that they are mutual exclusive to each other. Note: when https://github.com/scylladb/scylladb/issues/26584 is implemented (with #26609) there will be a need to align the docs as well - namely, primary-replica-only and scope will no longer be mutual exclusive Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#26480	2025-11-07 10:59:27 +02:00
Avi Kivity	245173cc33	tools: toolchain: optimized_clang: remove unused variable CLANG_SUFFIX The variable was unused since `cae999c094` ("toolchain: change optimized clang install method to standard one"), and now causes the differential shellcheck continuous integration test to fail whenever it is changed. Remove it. Closes scylladb/scylladb#26796	2025-11-07 10:08:23 +02:00
Patryk Jędrzejczak	d6c64097ad	Merge 'storage_proxy: use gates to track write handlers destruction' from Petr Gusev In [#26408](https://github.com/scylladb/scylladb/pull/26408) a `write_handler_destroy_promise` class was introduced to wait for `abstract_write_response_handler` instances destruction. We strived to minimize the memory footprint of `abstract_write_response_handler`, with `write_handler_destroy_promise`-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector can become big and cause 'oversized allocation' seastar warnings. Another concern with `write_handler_destroy_promise`-es [was that they were more complicated than it was worth](https://github.com/scylladb/scylladb/pull/26408#pullrequestreview-3361001103). In this commit we replace `write_handler_destroy_promise` with simple gates. One or more gates can be attached to an `abstract_write_response_handler` to wait for its destruction. We use `utils::small_vector` to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is `storage_proxy::_write_handlers_gate`, which is used to wait for all handlers in `cancel_all_write_response_handlers`. Another one can be attached by a caller of `cancel_write_handlers`. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare. The `sizeof(utils::small_vector) == 40`, this is `40.0 / 488 * 100 ~ 8%` increase in `sizeof(abstract_write_response_handler)`, which seems acceptable. Fixes [scylladb/scylladb#26788](https://github.com/scylladb/scylladb/issues/26788) backport: need to backport to 2025.4 (LWT for tablets release) Closes scylladb/scylladb#26827 * https://github.com/scylladb/scylladb: storage_proxy: use coroutine::maybe_yield(); storage_proxy: use gates to track write handlers destruction	2025-11-06 10:17:04 +01:00
Nadav Har'El	b8da623574	Update tools/cqlsh submodule * tools/cqlsh f852b1f5...19445a5c (2): > Update scylla-driver version to 3.29.4 Update tools/cqlsh submodule for scylla-driver 3.29.4 The motivation for this update is to resolve a driver-side serialization bug that was blocking work on #26740. The bug affected vector<collection> types (e.g., vector<set<int>,1>) and is fixed in scylla-driver versions 3.29.2+. Refs #26704	2025-11-06 10:01:26 +02:00
Asias He	dbeca7c14d	repair: Add metric for time spent on tablet repair It is useful to check time spent on tablet repair. It can be used to compare incremental repair and non-incremental repair. The time does not include the time waiting for the tablet scheduler to schedule the tablet repair task. Fixes #26505 Closes scylladb/scylladb#26502	2025-11-06 10:00:20 +03:00
Dario Mirovic	c3a673d37f	audit: move storage helper creation from `audit::start` to `audit::audit` Extract storage helper creation into `create_storage_helper` function. Call this function from `audit::audit`. It will be called per shard inside `sharded<audit>::start` method. Refs #26022	2025-11-06 03:05:43 +01:00
Dario Mirovic	28c1c0f78d	audit: fix formatting in `audit::start_audit` Refs #26022	2025-11-06 03:05:17 +01:00
Dario Mirovic	549e6307ec	audit: unify `create_audit` and `start_audit` There is no need to have `create_audit` separate from `start_audit`. `create_audit` just stores the passed parameters, while `start_audit` does the actual initialization and startup work. Refs #26022	2025-11-06 03:05:06 +01:00
Calle Wilund	b0061e8c6a	gcp_object_storage_test: Re-enable parallelism. Re-enable parallel execution to get better logs. Note, this is somewhat wasteful, as we won't re-use test fixture here, but in the end, it is probably an improvement.	2025-11-05 15:07:26 +00:00
Wojciech Mitros	0a22ac3c9e	mv: don't mark the view as built if the reader produced no partitions When we build a materialized view we read the entire base table from start to end to generate all required view udpates. If a view is created while another view is being built on the same base table, this is optimized - we start generating view udpates for the new view from the base table rows that we're currently reading, and we read the missed initial range again after the previous view finishes building. The view building progress is only updated after generating view updates for some read partitions. However, there are scenarios where we'll generate no view updates for the entire read range. If this was not handled we could end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293 To handle this, we mark the view as built if the reader generated no partitions. However, this is not always the correct conclusion. Another scenario where the reader won't encounter any partitions is when view building is interrupted, and then we perform a reshard. In this scenario, we set the reader for all shards to the last unbuilt token for an existing partition before the reshard. However, this partition may not exist on a shard after reshard, and if there are also no partitions with higher tokens, the reader will generate no partitions even though it hasn't finished view building. Additionally, we already have a check that prevents infinite view building loops without taking the partitions generated by the reader into account. At the end of stream, before looping back to the start, we advance current_key to the end of the built range and check for built views in that range. This handles the case where the entire range is empty - the conditions for a built view are: 1. the "next_token" is no greater than "first_token" (the view building process looped back, so we've built all tokens above "first_token") 2. the "current_token" is no less than "first_token" (after looping back, we've built all tokens below "first_token") If the range is empty, we'll pass these conditions on an empty range after advancing "current_key" to the end because: 1. after looping back, "next_token" will be set to `dht::minimum_token` 2. "current_key" will be set to `dht::ring_position::max()` In this patch we remove the check for partitions generated by the reader. This fixes the issue with resharding and it does not resurrect the issue with infinite view building that the check was introduced for. Fixes https://github.com/scylladb/scylladb/issues/26523 Closes scylladb/scylladb#26635	2025-11-05 17:02:32 +02:00
Petr Gusev	5bda226ff6	storage_proxy: use coroutine::maybe_yield(); This is a small "while at it" refactoring -- better to use coroutine::maybe_yield with co_await-s.	2025-11-05 14:38:19 +01:00
Petr Gusev	4578304b76	storage_proxy: use gates to track write handlers destruction In #26408 a write_handler_destroy_promise class was introduced to wait for abstract_write_response_handler instances destruction. We strived to minimize the memory footprint of abstract_write_response_handler, with write_handler_destroy_promise-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector<write_handler_destroy_promise> can become big and cause 'oversized allocation' seastar warnings. Another concern with write_handler_destroy_promise-es was that they were more complicated than it was worth. In this commit we replace write_handler_destroy_promise with simple gates. One or more gates can be attached to an abstract_write_response_handler to wait for its destruction. We use utils::small_vector<gate::holder, 2> to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is storage_proxy::_write_handlers_gate, which is used to wait for all handlers in cancel_all_write_response_handlers. Another one can be attached by a caller of cancel_write_handlers. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare. The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is 40.0 / 488 * 100 ~ 8% increase in sizeof(abstract_write_response_handler), which seems acceptable. Fixes scylladb/scylladb#26788	2025-11-05 14:37:52 +01:00
Nadav Har'El	8a07b41ae4	test/cqlpy: add test confirming page_size=0 disables paging In pull request #26384 a discussion started whether page_size=0 really disables paging, or maybe one needs page_size=-1 to truly disable paging. The reason for that discussion was commit `08c81427b` that started to use page_size=-1 for internal unpaged queries, and commit `76b31a3` that incorrectly claimed that page_size>=0 means paging is enabled. This patch introduces a test that confirms that with page_size=0, paging is truly disabled - including the size-based (1MB) paging. The new test is Scylla-only, because Cassandra is anyway missing the size-based page cutoff (see CASSANDRA-11745). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26742	2025-11-05 15:52:16 +03:00
Tomasz Grabiec	f8879d797d	tablet_allocator: Avoid load balancer failure when replacing the last node in a rack Introduced in `9ebdeb2` The problem is specific to node replacing and rack-list RF. The culprit is in the part of the load balancer which determines rack's shard count. If we're replacing the last node, the rack will contain no normal nodes, and shards_per_rack will have no entry for the rack, on which the table still has replicas. This throws std::out_of_range and fails the tablet draining stage, and node replace is failed. No backport because the problem exists only on master. Fixes #26768 Closes scylladb/scylladb#26783	2025-11-05 15:49:51 +03:00
Avi Kivity	8e480110c2	dist: housekeeping: set python.multiprocessing fork mode to "fork" Python 3.14 changed the multiprocessing fork mode to "forkserver", presumably for good reasons. However, it conflicts with our relocatable Python system. "forkserver" forks and execs a Python process at startup, but it does this without supplying our relocated ld.so. The system ld.so detects a conflict and crashes. Fix this by switching back to "fork", which is sufficient for housekeeping's modest needs. Closes scylladb/scylladb#26831	2025-11-05 15:47:38 +03:00
Pavel Emelyanov	05d711f221	database_test: Simplify snapshot_with_quarantine_works() test The test collects Data files from table dir, then _all_ files from snapshot dir and then checks whether the former is the subset of the latter. Using std::includes over two sets makes the code much shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:35:28 +03:00
Pavel Emelyanov	c8492b3562	database_test: Improve snapshot_skip_flush_works test It has two inaccuracies. First, when checking the contents of table directory, it uses pre-populated expected list with "manifest.json" in it. Weird. Second, when cechking the contents of snapshot directory it doesn't check if the "schema.cql" is there. It's always there, but if something breaks in the future it may come unnoticed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:35:26 +03:00
Pavel Emelyanov	5a25d74b12	database_test: Simplify snapshot_works() tests No functional changes here, just make use of the new lister to shorten the code. A small side effect -- if the test fails because contents of directories changes, it will print the exact difference in logs, not just that N files are missing/present. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:34:25 +03:00
Pavel Emelyanov	365044cdbb	database_test: Use collect_files() to remove files Some test cases remove files from table directory to perform some checks over the taken snapshots. Using collect_files() helper makes the code easier to read. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:34:24 +03:00
Pavel Emelyanov	e1f326d133	database_test: Use collectz_files() to count files in directory Some test cases want to see that there are more than one file in a directory, so they can just re-use the new helper. Much shorter this way. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:32:58 +03:00
Pavel Emelyanov	60d1f78239	database_test: Introduce collect_files() helper It returns a set of files in a given directoy. Will be used by all next patches. Implemented using directory_lister, not lister::scan_dir in order to help removing the latter one in the future. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-05 15:32:58 +03:00
Calle Wilund	6c6105e72e	test::pylib: Add azure (mock) testing to EAR matrix Fixes #26782 Adds a provider proxy for azure, using the existing mock server, now as a fixture.	2025-11-05 10:22:23 +00:00
Calle Wilund	b8a6b6dba9	test::boost::encryption_at_rest: Remove redundant azure test indent	2025-11-05 10:22:23 +00:00
Calle Wilund	10e591bd6b	test::boost::encryption_at_rest: Move azure tests to use fixture Fixes #26781 Makes the test independent of wrapping scripts. Note: retains the split into "real" and "mock" tests. For other tests, we either all mock, or allow the environment to select mock or real. Here we have them combined. More expensive, but otoh more thourough.	2025-11-05 10:22:22 +00:00
Calle Wilund	1d37873cba	test::lib: Add azure mock/real server fixture Wraps the real/mock azure server for test in a fixture. Note: retains the current test setup which explicitly runs some tests with "real" azure, if avail, and some always mock.	2025-11-05 10:22:22 +00:00
Calle Wilund	10041419dc	test::pylib::boost: Fix test gather to handle test suites Fixes #26775	2025-11-05 10:22:22 +00:00
Calle Wilund	565c701226	utils::gcp::object_storage: Fix typo in semaphore init Fixes #26776 Semaphore storage is ssize_t, not size_t.	2025-11-05 10:22:22 +00:00
Calle Wilund	2edf6cf325	test::boost::encryption_at_rest_test: Remove redundant indent Removed empty scope and reindents kms test using fixtures.	2025-11-05 10:22:22 +00:00
Calle Wilund	286a655bc0	test::boost::test_encryption_at_rest: Move to AWS KMS fixture for kms test Fixes #26780 Uses fake/real CI endpoint for AWS KMS tests, and moves these into a suite for sharing the mock server.	2025-11-05 10:22:22 +00:00
Calle Wilund	a1cc866f35	test::boost::test_encryption_at_rest: Reorder tests and helpers No code changes. Just reorders code to organize more by provider etc, prepping for fixtures and test suites.	2025-11-05 10:22:22 +00:00
Calle Wilund	af85b7f61b	ent::encryption: Make text helper routines take std::string Moving away from custom string type. Pure cosmetics.	2025-11-05 10:22:22 +00:00
Calle Wilund	1b0394762e	test::pylib::dockerized_service: Handle docker/podman bind error message If we run non-dbuild, docker/podman can/will cause first bind error, we should check these too.	2025-11-05 10:22:22 +00:00
Calle Wilund	0842b2ae55	test::lib::aws_kms_fixture: Add a fixture object to run mock AWS KMS Runs local-kms mock AWS KMS server unless overridden by env var. Allows tests to use real or fake AWS KMS endpoint and shared fixture for quicker execution.	2025-11-05 10:22:21 +00:00
Calle Wilund	98c060232e	test::lib::gcs_fixture: Only set port if running docker image + more retry Our connect can spuriously fail. Just retry.	2025-11-05 10:22:21 +00:00
Wojciech Mitros	977fa91e3d	view_building_coordinator: rollback tasks on the leaving tablet replica When a tablet migration is started, we abort the corresponding view building tasks (i.e. we change the state of those tasks to "ABORTED"). However, we don't change the host and shard of these tasks until the migration successfully completes. When for some reason we have to rollback the migration, that means the migration didn't finish and the aborted task still has the host and shard of the migration source. So when we recreate tasks that should no longer be aborted due to a rolled-back migration, we should look at the aborted tasks of the source (leaving) replica. But we don't do it and we look at the aborted tasks of the target replica. In this patch we adjust the rollback mechanism to recreate tasks for the migration source instead of destination. We also fix the test that should have detected this issue - the injection that the test was using didn't make us rollback, but we simply retried a stage of the tablet migration. By using one_shot=False and adding a second injection, we can now guarantee that the migration will eventually fail and we'll continue to the 'cleanup_target' and 'revert_migration' stages. Fixes https://github.com/scylladb/scylladb/issues/26691 Closes scylladb/scylladb#26825	2025-11-05 10:44:06 +01:00
Pavel Emelyanov	2cb98fd612	Merge 'api: storage_service: tasks: unify sync and async compaction APIs' from Aleksandra Martyniuk Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Unify the handlers of synchronous and asynchronous cleanup, major compaction, and upgrade_sstables. Fixes: https://github.com/scylladb/scylladb/issues/26715. Requires backports to all live versions Closes scylladb/scylladb#26746 * github.com:scylladb/scylladb: api: storage_service: tasks: unify upgrade_sstable api: storage_service: tasks: force_keyspace_cleanup api: storage_service: tasks: unify force_keyspace_compaction	2025-11-05 10:47:14 +03:00
Pavel Emelyanov	59019bc9a9	Merge 'Alternator: allow warning on auth errors before enabling enforcement' from Nadav Har'El An Alternator user was recently "bit" when switching `alternator_enforce_authorization` from "false" to "true": ְְְAfter the configuration change, all application requests suddenly failed because unbeknownst to the user, their application used incorrect secret keys. This series introduces a solution for users who want to safely switch `alternator_enforce_authorization` from "false" to "true": Before switching from "false" to "true", the user can temporarily switch a new option, `alternator_warn_authorization`, to true. In this "warn" mode, authentication and authorization errors are counted in metrics (`scylla_alternator_authentication_failures` and `scylla_alternator_authorization_failures`) and logged as WARNings, but the user's application continues to work. The user can use these metrics or log messages to learn of errors in their application's setup, fix them, and only do the switch of `alternator_enforce_authorization` when the metrics or log messages show there are no more errors. The first patch is the implementation of the the feature - the new configuration option, the metrics and the log messages, the second patch is a test for the new feature, and the third patch is documentation recommending how to use the warn mode and the associated metrics or log messages to safely switch `alternaor_enforce_authorization` from false to true. Fixes #25308 This is a feature that users need, so it should probably be backported to live branches. Closes scylladb/scylladb#25457 * github.com:scylladb/scylladb: docs/alternator: explain alternator_warn_authorization test/alternator: tests for new auth failure metrics and log messages alternator: add alternator_warn_authorization config	2025-11-05 10:45:17 +03:00
Pavel Emelyanov	fc37518aff	test: Check file existence directly There's a test that checks if temporary-statistics file is gone at some point. It does it by listing the directory it expects the file to be in and then comparing the names met with the temp. stat. file name. It looks like a single file_exists() call is enough for that purpose. As a "sanity" check this patch adds a validation that non-temporary statistics file is there, all the more so this file is removed after the test. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26743	2025-11-04 19:37:55 +01:00
Avi Kivity	95700c5f7f	Merge 'Support counters with tablets' from Michael Litvak Support the counters feature in tablets keyspaces. The main change is to fix the counter update during tablets intranode migration. Counter cell is c = map<host_id, value>. A counter update is applied by doing read-modify-write on a leader replica to retrieve the current host's counter value and transform the mutation to contain the updated value for the host, then apply the mutation and replicate it to other hosts. the read-modify-write is protected against concurrent updates by locking the counter cell. When the counter is migrated between two shards, it's not enough to lock the counter on the read shard, because in the stage write_both_read_new the read shard is switched, and then we can have concurrent updates reach either the old or the new shard. In order to keep the counter update exclusive we lock both shards when in the stage write_both_read_new. Also, when applying the transformed mutation we need to respect write_both stages and apply the mutation on both shards. We change it to use `apply_on_shards` similarly to other methods in storage proxy. The change applies to both tablets and vnodes, they use the same implementation, but for vnodes the behavior should remain equivalent up to some small reordering of the code since it doesn't have intranode migration and reduces to single read shard = write shard. Fixes https://github.com/scylladb/scylladb/issues/18180 no backport - new feature Closes scylladb/scylladb#26636 * github.com:scylladb/scylladb: docs: counters now work with tablets pgo: enable counters with tablets test: enable counters tests with tablets test: add counters with tablets test cql3: remove warning when creating keyspace with tablets cql3: allow counters with tablets storage_proxy: lock all read shards for counter update storage_proxy: apply counter mutation on all write shards storage_proxy: move counter update coordination to storage proxy storage_proxy: refactor mutate_counter_on_leader replica/db: add counter update guard replica/db: split counter update helper functions	2025-11-03 22:28:10 +01:00
Raphael S. Carvalho	7f34366b9d	sstables_loader: Don't bypass synchronization with busy topology The patch `c543059f86` fixed the synchronization issue between tablet split and load-and-stream. The synchronization worked only with raft topology, and therefore was disabled with gossip. To do the check, storage_service::raft_topology_change_enabled() but the topology kind is only available/set on shard 0, so it caused the synchronization to be bypassed when load-and-stream runs on any shard other than 0. The reason the reproducer didn't catch it is that it was restricted to single cpu. It will now run with multi cpu and catch the problem observed. Fixes #22707 Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#26730	2025-11-03 18:10:08 +01:00
Michael Litvak	8555fd42df	docs: counters now work with tablets Counters are now supported in tablet-enabled keyspaces, so remove the documentation that listed counters as an unsupported feature and the note warning users about the limitation.	2025-11-03 16:04:37 +01:00
Michael Litvak	1337f4213f	pgo: enable counters with tablets Now that counters are supported with tablets, update the keyspace statement for counters to allow it to run with tablets.	2025-11-03 16:04:37 +01:00
Michael Litvak	1dbf53ca29	test: enable counters tests with tablets Enable all counters-related tests that were disabled for tablets because counters was not supported with tablets until now. Some tests were parametrized to run with both vnodes and tablets, and the tablets case was skipped, in order to not lose coverage. We change them to run with the default configuration since now counters is supported with both vnodes and tablets, and the implementation is the same, so there is no benefit in running them with both configurations.	2025-11-03 16:04:37 +01:00
Michael Litvak	a6c12ed1ef	test: add counters with tablets test add a new test for counters with tablets to test things that are specific to tablets. test counter updates that are concurrent with tablet internode and intranode migrations and verify it remains consistent and no updates are lost.	2025-11-03 16:04:37 +01:00
Michael Litvak	60ac13d75d	cql3: remove warning when creating keyspace with tablets When creating a keyspace with tablets, a warning is shown with all the unsupported features for tablets, which is only counters currently. Now that counters is also supported with tablets, we can remove this warning entirely.	2025-11-03 16:04:37 +01:00
Michael Litvak	9208b2f317	cql3: allow counters with tablets Now that counters work with tablets, allow to create a table with counters in a tablets-enabled keyspace, and remove the warning about counters not being supported when creating a keyspace with tablets. We allow to use counters with tablets only when all nodes are upgraded and support counters with tablets. We add a new feature flag to determine if this is the case. Fixes scylladb/scylladb#18180	2025-11-03 16:04:37 +01:00
Michael Litvak	296b116ae2	storage_proxy: lock all read shards for counter update Previously in a counter update we lock the read shard to protect the counter's read-modify-write against concurrent updates. This is not sufficient when the counter is migrated between different shards, because there is a stage where the read shard switches from the old shard to the new shard, and during that switch there can be concurrent counter updates on both shards. If each shard takes only its own lock, the operations will not be exclusive anymore, and this can cause lost counter updates. To fix this, we acquire the counter lock on both shards in the stage write_both_read_new, when both shards can serve reads. This guarantees that counter updates continue to be exclusive during intranode migration.	2025-11-03 16:04:35 +01:00
Michael Litvak	de321218bc	storage_proxy: apply counter mutation on all write shards When applying a counter mutation, use apply_on_shards to apply the mutation on all write shards, similarly to the way other mutations are applied in the storage proxy. Previously the mutation was applied only on the current shard which is the read shard. This is needed to respect the write_both stages of intranode migration where we need to apply the mutation on both the old and the new shards.	2025-11-03 16:03:29 +01:00
Michael Litvak	c7e7a9e120	storage_proxy: move counter update coordination to storage proxy Refactor the counter update to split the functions and have them called by the storage proxy to prepare for a later change. Previously in mutate_counter the storage proxy calls the replica function apply_counter_update that does a few things: 1. checks that the operation can be done: check timeout, disk utilization 2. acquire counter locks 3. do read-modify-write and transform the counter mutation 4. apply the mutation in the replica In this commit we change it so that these functions are split and called from the storage proxy, so that we have better control from the storage proxy when we change it later to work across multiple shards. For example, we will want to acquire locks on multiple shards, transform it on one shard, and then apply the mutation on multiple shards. After the change it works as follows in storage proxy: 1. acquire counter locks 2. call replica prepare to check the operation and transform the mutation 3. call replica apply to apply the transformed mutation	2025-11-03 15:59:46 +01:00
Tomasz Grabiec	e878042987	Revert "Revert "tests(lwt): new test for LWT testing during tablet resize"" This reverts commit `6cb14c7793`. The issue causing the previous revert was fixed in `88765f627a`.	2025-11-03 10:38:00 +01:00
Michael Litvak	579031cfc8	storage_proxy: refactor mutate_counter_on_leader Slightly reorganize the mutate counter function to prepare it for a later change. Move the code that finds the read shard and invokes the rest of the function on the read shard to the caller function. This simplifies the function mutate_counter_on_leader_and_replicate which now runs on the read shard and will make it easier to extend.	2025-11-03 08:43:11 +01:00
Michael Litvak	7cc6b0d960	replica/db: add counter update guard Add a RAII guard for counter update that holds the counter locks and the table operation, and extract the creation of the guard to a separate function. This prepares it for a later change where we will want to obtain the guard externally from the storage proxy.	2025-11-03 08:43:11 +01:00
Michael Litvak	88fd9a34c4	replica/db: split counter update helper functions Split do_apply_counter_update to a few smaller and simpler functions to help prepare for a later change.	2025-11-03 08:43:11 +01:00
Avi Kivity	9b6ce030d0	sstables: remove quadratic (and possibly exponential) compile time in parse() parse() taking a list of elements is quadratic (during compile time) in that it generates recursive calls to itself, each time with one fewer parameter. The total size of the parameter lists in all these generated functions is quadratic in the initial parameter list size. It's also exponential if we ignore inlining limits, since each .then() call expands to two branches - a ready future branch and a non-ready future branch. If the compiler did not give up, we'd have 2^list_len branches. For sure the compiler does not do so indefinitely, but the effort getting there is wasted. Simplify by using a fold expression over the comma operator. Instead of passing the remaining parameter list in each step, we pass only the parameter we are processing now, making processing linear, and not generating unnecessary functions. It would be better expressed using pack expansion statements, but these are part of C++26. The largest offender is probably stats_metadata, with 21 elements. dev-mode sstables.o: text data bss dec hex filename 1760059 1312 7673 1769044 1afe54 sstables.o.before 1745533 1312 7673 1754518 1ac596 sstables.o.after We save about 15k of text with presumably a corresponding (small) decrease in compile time. Closes scylladb/scylladb#26735	2025-11-02 13:09:37 +01:00
Jenkins Promoter	cb30eb2e21	Update pgo profiles - aarch64	2025-11-01 05:23:52 +02:00
Jenkins Promoter	e3a0935482	Update pgo profiles - x86_64	2025-11-01 04:54:49 +02:00
Petr Gusev	88765f627a	paxos_state: get_replica_lock: remove shard check This check is incorrect: the current shard may be looking at the old version of tablets map: * an accept RPC comes to replica shard 0, which is already at write_both_read_new * the new shard is shard 1, so paxos_state::accept is called on shard 1 * shard 1 is still at "streaming" -> shards_ready_for_reads() returns old shard 0 Fixes scylladb/scylladb#26801 Closes scylladb/scylladb#26809	2025-10-31 21:37:39 +01:00
Avi Kivity	7a72155374	Merge 'Introduce nodetool excludenode' from Tomasz Grabiec If a node is dead and cannot be brought back, tablet migrations are stuck, until the node is explicitly marked as "permanently dead" / "ignored node" / "excluded" (name differs in different contexts). Currently, this is done during removenode and replace operations but it should be possible to only mark the node as dead, for the purpose of unblocking migrations or other topology operations, without doing the actual removenode, because full removal might be currently impossible, or not desirable due to lack of capacity or priorities. This patch introduces this kind of API: ``` nodetool excludenode <host-id> [ ... <host-id> ] ``` Having this kind of API is an improvement in user experience in several cases. For example, when we lose a rack, the only viable option for recovery is to run removenode with an extra --ignore-dead-nodes option. This removenode will fail in the tablet draining phase, as there is no live node in the rack to rebuild replicas in. This is confusing to the operator. But necessary before ALTER KEYSPACE can proceed in order to change replication options to drop the rack from RF. Having this API allows operators to have more unified procedures, where "nodetool excludenode" is always the first step of recovery, which unblocks further topology operations, both those which restore capacity, but also auto-scaling, tablet split/merge, load balancing, etc. Fixes #21281 The PR also changes "nodetool status" to show excluded nodes, they have 'X' in their status instead of 'D'. Closes scylladb/scylladb#26659 * github.com:scylladb/scylladb: nodetool: status: Show excluded nodes as having status 'X' test: py: Test scenario involving excludenode API nodetool: Introduce excludenode command	2025-10-31 22:14:57 +02:00
Avi Kivity	d458dd41c6	Merge 'Avoid input_/output_stream-s default initialization and move-assignment' from Pavel Emelyanov Recent seastar update deprecated in/out streams usage pattern when a stream is default constructed early and them move-assigned with the proper one (see scylladb/seastar#3051). This PR fixes few places in Scylla that still use one. Adopting newer seastar API, no need to backport Closes scylladb/scylladb#26747 * github.com:scylladb/scylladb: commitlog: Remove unused work::r stream variable ec2_snitch: Fix indentation after previous patch ec2_snitch: Coroutinize the aws_api_call_once() sstable: Construct output_stream for data instantly test: Don't reuse on-stack input stream	2025-10-31 21:22:41 +02:00
Avi Kivity	adf9c426c2	Merge 'db/config: Change default SSTable compressor to LZ4WithDictsCompressor' from Nikos Dragazis `sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is LZ4Compressor (inherited from Cassandra). Make LZ4WithDictsCompressor the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios. If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback. Fixes #26610. Closes scylladb/scylladb#26697 * github.com:scylladb/scylladb: test/cluster: Add test for default SSTable compressor db/config: Change default SSTable compressor to LZ4WithDictsCompressor db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl boost/cql_query_test: Get expected compressor from config	2025-10-31 21:15:18 +02:00
Lakshmi Narayanan Sreethar	3eb7193458	backlog_controller: compute backlog even when static shares are set The compaction manager backlog is exposed via metrics, but if static shares are set, the backlog is never calculated. As a result, there is no way to determine the backlog and if the static shares need adjustment. Fix that by calculating backlog even when static shares are set. Fixes #26287 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26778	2025-10-31 18:18:36 +02:00
Michał Hudobski	fd521cee6f	test: fix typo in vector_index test Unfortunately in https://github.com/scylladb/scylladb/pull/26508 a typo than changes a behavior of a test was introduced. This patch fixes the typo. Closes scylladb/scylladb#26803	2025-10-31 18:35:02 +03:00
Tomasz Grabiec	284c73d466	scripts: pull_github_pr.sh: Fix auth problem detection Before the patch, the script printed: parse error: Invalid numeric literal at line 2, column 0 Closes scylladb/scylladb#26818	2025-10-31 18:32:58 +03:00
Michael Litvak	e7dbccd59e	cdc: use chunked_vector instead of vector for stream ids use utils::chunked_vector instead of std::vector to store cdc stream sets for tablets. a cdc stream set usually represents all streams for a specific table and timestamp, and has a stream id per each tablet of the table. each stream id is represented by 16 bytes. thus the vector could require quite large contiguous allocations for a table that has many tablets. change it to chunked_vector to avoid large contiguous allocations. Fixes scylladb/scylladb#26791 Closes scylladb/scylladb#26792	2025-10-31 13:02:34 +01:00
Tomasz Grabiec	1c0d847281	Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations. This is the second part of the size based load balancing changes: - First part for tablet size collection via load_stats: #26035 - Second part reconcile load_stats: #26152 - The third part for load_sketch changes: #26153 - The fourth part which performs tablet load balancing based on tablet size: #26254 This is a new feature and backport is not needed. Closes scylladb/scylladb#26152 * github.com:scylladb/scylladb: load_balancer: load_stats reconcile after tablet migration and table resize load_stats: change data structure which contains tablet sizes	2025-10-31 09:58:25 +01:00
Tomasz Grabiec	2bd173da97	nodetool: status: Show excluded nodes as having status 'X' Example: $ build/dev/scylla nodetool status Datacenter: dc1 =============== Status=Up/Down/eXcluded \|/ State=Normal/Leaving/Joining/Moving -- Address Load Tokens Owns Host ID Rack UN 127.0.0.1 783.42 KB 1 ? 753cb7b0-1b90-4614-ae17-2cfe470f5104 rack1 XN 127.0.0.2 785.10 KB 1 ? 92ccdd23-5526-4863-844a-5c8e8906fa55 rack2 UN 127.0.0.3 708.91 KB 1 ? 781646ad-c85b-4d77-b7e3-8d50c34f1f17 rack3	2025-10-31 09:03:20 +01:00
Tomasz Grabiec	87492d3073	test: py: Test scenario involving excludenode API	2025-10-31 09:03:20 +01:00
Tomasz Grabiec	55ecd92feb	nodetool: Introduce excludenode command If a node is dead and cannot be brought back, tablet migrations are stuck, until the node is explicitly marked as "permanently dead" / "ignored node" / "excluded" (name differs in different contexts). Currently, this is done during removenode and replace operations but it should be possible to only mark the node as dead, for the purpose of unblocking migrations or other topology operations, without doing the actual removenode, because full removal might be currently impossible, or not desirable due to lack of capacity or priorities. This patch introduces this kind of API: nodetool excludenode <host-id> [ ... <host-id> ] Having this kind of API is an improvement in user experience in several cases. For example, when we lose a rack, the only viable option for recovery is to run removenode with an extra --ignore-dead-nodes option. This removenode will fail in the tablet draining phase, as there is no live node in the rack to rebuild replicas in. This is confusing to the operator. But necessary before ALTER KEYSPACE can proceed in order to change replication options to drop the rack from RF. Having this API allows operators to have more unified procedures, where "nodetool excludenode" is always the first step of recovery, which unblocks further topology operations, both those which restore capacity, but also auto-scaling, tablet split/merge, load balancing, etc. Fixes #21281	2025-10-31 09:03:20 +01:00
Avi Kivity	04a289cae6	Merge 'Auto expand to rack list' from Tomasz Grabiec We want to move towards rack-list based replication factor for tablets being the default mode, and in the future the only supported mode. This PR is a step towards that. We auto-expand numeric RF to rack list on keyspace creation and ALTER when rf_rack_valid_keyspaces option is enabled. The PR is mostly about adjusting tests. The main logic change is in the last patch, which modifies option post-processing in ks_prop_defs. Fixes #26397 Closes scylladb/scylladb#26692 * github.com:scylladb/scylladb: cql3: ks_prop_defs: Expand numeric RF to rack list locator: Move rack_list to topology.hh alternator: Do not set RF for zero-token DCs alternator: Switch keyspace creation to use ks_prop_defs test: alternator: Adjust for rack lists cql3: Move validation of invalid ALTER KEYSPACE earlier, to ks_prop_defs test: cqlpy: Mark tests using rack lists as scylla-only test: Switch to rack-list based RF test: Generalize tests to work with both numeric RF and rack lists test: cluster: test_zero_token_nodes_multidc: Adjust to rack list RF test: Prepare for handling errors specific to rack list path test: cluster: dtest: alternator: Force RF=1 in test_putitem_contention test: Create cluster with multiple racks in multi-dc setups test: boost: network_topology_strategy_test: Adjust to rack-list RF test: tablets: Adjust to rack list test: cluster: test_group0_schema_versioning: Use smaller RF to respect rf-rack-validness test: tablets_test: Convert test_per_shard_goal_mixed_dc_rf to be rack-valid test: object_store: test_backup: Adjust for rack lists test: cluster: tablets: Do not move tablet across racks in test_tablet_transition_sanity test: cluster: mv: Do not move tablets across racks test: cluster: util: Fix docstring for parse_replication_options() tablets, topology_coordinator: Skip tablet draining on replace	2025-10-30 21:54:08 +02:00
Avi Kivity	c0222e4d3c	Merge 'replica/table: do not stop major compaction when disabling auto compaction' from Lakshmi Narayanan Sreethar When auto compaction is disabled, all ongoing compactions, including major compactions, are stopped. However, major compactions should not be stopped, since the disable request applies only to regular auto compactions. This PR fixes the issue by tagging major compaction tasks with a newly introduced `compaction_type::Major` enum. Since `table::disable_auto_compaction()` already requests the compaction manager to stop only tasks of type `compaction_type::Compaction`, major compactions will no longer be stopped. Fixes #24501 PR improves how the compactions are stopped when a disable auto compaction request is executed. No need to backport Closes scylladb/scylladb#26288 * github.com:scylladb/scylladb: replica/table: do not stop major compaction when disabling auto compaction compaction/compaction_descriptor: introduce compaction_type::Major	2025-10-30 21:45:57 +02:00
Nikos Dragazis	a0bf932caa	test/cluster: Add test for default SSTable compressor The previous patch made the default compressor dependent on the SSTABLE_COMPRESSION_DICTS feature: * LZ4Compressor if the feature is disabled * LZ4WithDictsCompressor if the feature is enabled Add a test to verify that the cluster uses the right default in every case. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-10-30 15:53:54 +02:00
Nikos Dragazis	2fc812a1b9	db/config: Change default SSTable compressor to LZ4WithDictsCompressor `sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is `LZ4Compressor` (inherited from Cassandra). Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios. If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-10-30 15:53:49 +02:00
Pavel Emelyanov	395e275e03	Merge 'test/cluster/random_failures: Adjust to RF-rack-validity' from Dawid Mędrek We adjust the test to RF-rack-validity and then re-enable index random events, which requires the configuration option `rf_rack_valid_keyspaces` to be enabled. Fixes scylladb/scylladb#26422 Backport: I'd rather not backport these changes. They're almost a hack and poses too much risk for little gain. Closes scylladb/scylladb#26591 * github.com:scylladb/scylladb: test/cluster/random_failures: Re-enable index events test/cluster/random_failures: Enable rf_rack_valid_keyspaces test/cluster/random_failures: Adjust to RF-rack-validity	2025-10-30 15:39:38 +03:00
Aleksandra Martyniuk	fdd623e6bc	api: storage_service: tasks: unify upgrade_sstable Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Unify the handlers of /storage_service/keyspace_upgrade_sstables/{keyspace} and /tasks/compaction/keyspace_upgrade_sstables/{keyspace}.	2025-10-30 11:42:48 +01:00
Aleksandra Martyniuk	044b001bb4	api: storage_service: tasks: force_keyspace_cleanup Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Unify the handlers of /storage_service/keyspace_cleanup/{keyspace} and /tasks/compaction/keyspace_cleanup/{keyspace}.	2025-10-30 11:42:47 +01:00
Aleksandra Martyniuk	12dabdec66	api: storage_service: tasks: unify force_keyspace_compaction Currently, all apis that start a compaction have two versions: synchronous and asynchronous. They share most of the implementation, but some checks and params have diverged. Add consider_only_existing_data parameter to /tasks/compaction/keyspace_compaction/{keyspace}, to match the synchronous version of the api (/storage_service/keyspace_compaction/{keyspace}). Unify the handlers of both apis.	2025-10-30 11:33:17 +01:00
Tomasz Grabiec	6cb14c7793	Revert "tests(lwt): new test for LWT testing during tablet resize" This reverts commit `99dc31e71a`. The test is not stable due to #26801	2025-10-30 08:50:40 +01:00
Piotr Wieczorek	0398bc0056	test/alternator: Enable the tests failing because of #6918 The tests pass only with alternator_streams_strict_compatibility flag enabled, because of a suspected non-negligible performance impact (i.e. an additional entire-item comparison and type conversions). Refs https://github.com/scylladb/scylladb/issues/6918	2025-10-30 08:38:31 +01:00
Piotr Wieczorek	66ac66178b	alternator, cdc: Don't emit events for no-op removes Deletes that don't change the state of the database visible to the user (e.g. an attempt to delete a missing item) shouldn't produce a cdc log. This commit addresses this DynamoDB compatibility issue if the delete is a partition delete, a row delete, or a cell delete. It works under the assumption that the change was produced by Alternator. This means that it doesn't support range deletes, static row deletes, deletes of collection cells other than a map, etc. See also its parent commit, which introduces the methods that this commit extends. This commit handles the following cases: - `DeleteItem of nonexistent item: nothing`, - `BatchWriteItem.DeleteItem of nonexistent item: nothing`. Refs https://github.com/scylladb/scylladb/pull/26121	2025-10-30 08:38:30 +01:00
Piotr Wieczorek	a32e8091a9	alternator, cdc: Don't emit an event for equal items This commit adds a function that compares split mutations with the `row_state`, that was selected as a preimage or propagated through cdc options by a caller. If the items are equal, the corresponding log row isn't generated. The result being that creating an item with BatchWriteItem, PutItem, or UpdateItem doesn't emit an INSERT/MODIFY event if exactly identical item already exists. Comparing the items may be costly, so this logic is controlled by `alternator_streams_compabitiblity` flag. This commit handles the following cases: - `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and equal item: nothing`	2025-10-30 08:38:30 +01:00
Piotr Wieczorek	8c2f60f111	alternator/streams, cdc: Differentiate item replace and item update in CDC This commit improves compatibility with DynamoDB streams by changing the emitted events when creating/updating an item. Replace/update operations of an existing item emit a MODIFY, whereas replacing/updating a missing item results in an INSERT. If the state of the item doesn't change after applying the operation, no event is emitted. This commit handles the following cases: - `PutItem/UpdateItem/BatchWriteItem.PutItem of an existing and not equal item: MODIFY` - `PutItem/UpdateItem/BatchWriteItem.PutItem of a nonexistent item: INSERT` Refs https://github.com/scylladb/scylladb/issues/6918	2025-10-30 07:40:31 +01:00
Piotr Wieczorek	4f6aeb7b6b	alternator: Change the return type of rmw_operation_return Change the type from future<executor::request_return_type> to executor::request_return_type, because the method isn't async and one out of two callers unwraps the future immediately. This simplifies the code a little and probably saves a few instructions, since we suspect that moving a future<X> is more expensive than just moving X.	2025-10-30 07:40:31 +01:00
Piotr Wieczorek	ffdc8d49c7	config: Add alternator_streams_strict_compatibility flag With this flag enabled, Alternator Streams produces more accurate event types: - nop operations (i.e. replacing an item with an identical one, deleting a nonexistent item) don't produce an event, - updates of an existing item produce a MODIFY event, instead of INSERT, - etc. This flag affects the internal behaviour of some operations, i.e. Alternator may select a preimage and propagate it to CDC (in contrary to CDC making the request), or do extra item comparisons (i.e. compare the existing item with the new one). These operations may be costly, and users that don't use Streams won't need them. This flag is live-updatable. An operation reads this flag once, and uses its value for the entire operation.	2025-10-30 07:40:31 +01:00
Piotr Wieczorek	e3fde8087a	cdc: Don't split a row marker away from row cells CDC log table records a mutation as a sequence of log rows that record an atomic change (i.e. a row marker, tombstones, etc.), whereas a mutation in Alternator Streams always appears as a single log row. The type of operation is determined based on the type of the last log row in CDC. As a result, updates that create a row always appeared to Alternator Streams as an update (row marker + data), rather than an insert. This commit makes them a single log row. Its operation type is insert if it contains a row marker, and an update otherwise, which gives results consistent with DynamoDB Streams.	2025-10-30 07:40:31 +01:00
Tomasz Grabiec	28f6bdc99b	cql3: ks_prop_defs: Expand numeric RF to rack list Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for new DCs specified in the statement. Doesn't auto-expand existing options, as the rack choice may not be in line with current replica placement. This requires co-locating tablet replicas, and tracking of co-location state, which is not implemented yet. Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-29 23:32:59 +01:00
Tomasz Grabiec	35166809cb	locator: Move rack_list to topology.hh So that we can use it in locator/tablets.hh and avoid circular dependency between that header and abstract_replication_strategy.hh	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	f6dfea2fb1	alternator: Do not set RF for zero-token DCs That will fail with tablets because it won't be able to allocate replicas.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	21db21af7e	alternator: Switch keyspace creation to use ks_prop_defs So that we get the same validation and option post-processing as during regular keyspace creation. RF auto-expansion logic happens in ks_prop_defs, and we want that for tablets.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	7f66f67d95	test: alternator: Adjust for rack lists To achieve RF=3 with tablets and rf_rack_valid_keyspaces, we need 3 racks. So change the test to create 3 racks. Alternator was bypassing standard keyspace creation path, so it escaped validation. But this will change, and the test will stop wroking. Also, after auto-expansion of RF to rack list, not all of 4 nodes will host replicas. So need to adjust expectations.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	a88f70ce2c	cql3: Move validation of invalid ALTER KEYSPACE earlier, to ks_prop_defs Tests expect this failure in some scenarios, but later changes make us fail ealier due to topology constraints. As a rule, more general validation should come before more specific validation. So syntax validation before topology validation.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	8e69c65124	test: cqlpy: Mark tests using rack lists as scylla-only Those tests are intended to be also run against Cassandra, which doesn't support rack lists.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	ba53f41f59	test: Switch to rack-list based RF Have to do that before we enable auto-expansion of numeric RF to rack-lists, because those tests alter the replication factor, and altering from rack-list to numeric will not be allowed.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	d2e7d6fad2	test: Generalize tests to work with both numeric RF and rack lists	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	aa05f0fad0	test: cluster: test_zero_token_nodes_multidc: Adjust to rack list RF Two changes here: 1) Allocate nodes in dc2 in separeate racks to make the test stronger - it invites bugs where RF==nr_racks succeeds despite there being zero-token nodes, and not simply fail due to rack count. 2) Due to auto-expansion to rack list, scylla throws in keyspace creation rather than table creation.	2025-10-29 23:32:58 +01:00
Benny Halevy	e8b9f13061	test: Prepare for handling errors specific to rack list path	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	255f429a80	test: cluster: dtest: alternator: Force RF=1 in test_putitem_contention With rf_rack_valid_keyspaces enabled, RF of alternator tables will be equal to the number of racks (in this test: nodes). Prior to that, if number of nodes is smaller than 3, alternator creates the keyspace with RF=1. Turns out, with RF=2 the test fails with write timeouts due to contention. Enforce RF=1 by creating the table with one node before adding the second node.	2025-10-29 23:32:58 +01:00
Tomasz Grabiec	40e7543361	test: Create cluster with multiple racks in multi-dc setups To allow auto-expansion of numeric RF to rack list. Otherwise, keyspace creation will be rejected if rf-rack-valid keyspaces are enforced.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	723622cf70	test: boost: network_topology_strategy_test: Adjust to rack-list RF	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	19d0beff38	test: tablets: Adjust to rack list test_decommission_rack_load_failure expects some tablets to land in the rack which only has the decommissioning node. Since the table uses RF=1, auto-expansion may choose the other rack and put all tablets there, and the expected failure will not happen. Force placement by using rack-list RF.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	7ccc2a3560	test: cluster: test_group0_schema_versioning: Use smaller RF to respect rf-rack-validness	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	0f38f7185c	test: tablets_test: Convert test_per_shard_goal_mixed_dc_rf to be rack-valid	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	5962498983	test: object_store: test_backup: Adjust for rack lists With rack lists, not all nodes in a rack will receive streams if RF=1. Adjust expectations.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	3b8a3823db	test: cluster: tablets: Do not move tablet across racks in test_tablet_transition_sanity Choose old_replica and new_replica so that they're both in rack r1. After later changes (rack list auto expansion), it's no longer guaranteed that the first replica will be on r1.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	5bf7112fe6	test: cluster: mv: Do not move tablets across racks It's illegal with rf-rack-valid keyspaces.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	e34548ccdb	test: cluster: util: Fix docstring for parse_replication_options() rack lists are now in replication_v2, which is also parsed with this function.	2025-10-29 23:32:57 +01:00
Tomasz Grabiec	288e75fe22	tablets, topology_coordinator: Skip tablet draining on replace Replace doesn't drain (rebuild) tablets during topology change. They are rebuilt afterwards when the replaced node is in "left" state and replacing node is in normal state. So there is no point in attempting to drain, as nothing will be drained. Not only that, doing so has a risk, because the load balancer is invoked on a transitional topology state in which we can end up with no normal nodes in a rack. That's the case if the replaced node was the last one in the rack. This tripped one of the algorithms which computes rack's shard count for the purpose of determining ideal tablet count, it was not prepared to find an empty rack to which a table is still repliacated. That was fixed separately, but to avoid this, we better skip tablet draining here.	2025-10-29 23:32:57 +01:00
Taras Veretilnyk	c922256616	sstables: add overload of data_stream() to accept custom file_input_stream_options This patch introduces a new overload of 'sstable::data_stream()' that allows callers to provide their own 'file_input_stream_options'. This change will be useful in the next commit to enable integrity checking for file streaming.	2025-10-29 22:30:18 +01:00
Nikos Dragazis	96e727d7b9	db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl The option is a knob that allows to reject dictionary-aware compressors in the validation stage of CREATE/ALTER statements, and in the validation of `sstable_compression_user_table_options`. It was introduced in `7d26d3c7cb` to allow the admins of Scylla Cloud to selectively enable it in certain clusters. For more details, check: https://github.com/scylladb/scylla-enterprise/issues/5435 As of this series, we want to start offering dictionary compression as the default option in all clusters, i.e., treat it as a generally available feature. This makes the knob redundant. Additionally, making dictionary compression the default choice in `sstable_compression_user_table_options` creates an awkward dependency with the knob (disabling the knob should cause `sstable_compression_user_table_options` to fall back to a non-dict compressor as default). That may not be very clear to the end user. For these reasons, mark the option as "Deprecated", remove all relevant tests, and adjust the business logic as if dictionary compression is always available. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-10-29 20:13:08 +02:00
Nadav Har'El	aa34f0b875	alternator: fix CDC events for TTL expiration In commit `a3ec6c7d1d` we supposedly implemented the feature of telling TTL experation events from regular user-sent deletions. However, that implementation did not actually work at all... It had two bugs: 1. It created an null rjson::value() instead of an empty dictionary rjson::empty_object(), so GetRecords failed every time such a TTL expiration event was generated. 2. In the output, it used lowercase field names "type" and "principalId" instead of the uppercase "Type" and "PrincipalId". This is not the correct capitalization, and when boto3 recieves such incorrect fields it silently deletes them and never passes them to the user's get_records() call. This patch fixes those two bugs, and importantly - enables a test for this feature. We did already have such a test but it was marked as "veryslow" so doesn't run in CI and apparently not even run once to check the new feature. This test is not actually very long on Alternator when the TTL period is set very low (as we do in our tests), so I replaced the "veryslow" marker by "waits_for_expiration". The latter marker means that the test is still very slow - as much as half an hour - on DynamoDB - but runs quickly on Scylla in our test setup, and enabled in CI by default. The enabled test failed badly before this patch (a server error during GetRecords), and passes with this patch. Also, the aforementioned commit forgot to remove the paragraph in Alternator's compatibility.md that claims we don't have that feature yet. So we do it now. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26633	2025-10-29 17:08:20 +01:00
Piotr Wieczorek	2812e67f47	cdc: Emit a preimage for non-clustered tables Until this patch, CDC haven't fetched a preimage for mutations containing only a partition tombstone. Therefore, single-row deletions in a table witout a clustering key didn't include a preimage, which was inconsistent with single-row clustered deletions. This commit addresses this inconsistency. Second reason is compatibility with DynamoDB Streams, which doesn't support entire-partition deletes. Alternator uses partition tombstones for single-row deletions, though, and in these cases the 'OldImage' was missing from REMOVE records. Fixes https://github.com/scylladb/scylladb/issues/26382 Closes scylladb/scylladb#26578	2025-10-29 17:54:58 +02:00
Nadav Har'El	29ed1f3de7	Merge 'cql3: Refactor vector search select impl into a dedicated class' from Karol Nowacki cql3: Refactor vector search select impl into a dedicated class The motivation for this change is crash fixed in https://github.com/scylladb/scylladb/pull/25500. This commit refactors how ANN ordered select statements are handled to prevent a potential null pointer dereference and improve code organization. Previously, vector search selects were managed by `indexed_table_select_statement`, which unconditionally dereferenced a `view_ptr`. This assumption is invalid for vector search indexes where no view exists, creating a risk of crashes. To address this, the refactoring introduces the following changes: - A new `vector_indexed_table_select_statement` class is created to specifically handle ANN-ordered selects. This class operates without a view_ptr, resolving the null pointer risk. - The `indexed_table_select_statement` is renamed to `view_indexed_table_select_statement` to more accurately reflect its function with view-based indexes. - An assertion has been added to `indexed_table_select_statement` constructor to ensure view_ptr is not null, preventing similar issues in the future. Fixes: VECTOR-162 No backport is needed, as this is refactoring. Closes scylladb/scylladb#25798 * github.com:scylladb/scylladb: cql3: Rename indexed_table_select_statement cql3: Move vector search select to dedicated class	2025-10-29 17:49:24 +02:00
Lakshmi Narayanan Sreethar	7eac18229c	replica/table: do not stop major compaction when disabling auto compaction When auto compaction is disabled, all ongoing compactions, including major compactions, are stopped. However, major compactions should not be stopped, since the disable request applies only to regular auto compactions. This patch fixes the issue by tagging major compaction tasks with the newly introduced `compaction_type::MajorCompaction`. Since `table::disable_auto_compaction()` already requests the compaction manager to stop only tasks of type `compaction_type::Compaction`, major compactions will no longer be stopped. Fixes #24501 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-29 19:22:07 +05:30
Lakshmi Narayanan Sreethar	4d442f48db	compaction/compaction_descriptor: introduce compaction_type::Major Introduce a new compaction_type enum : `Major`. This type will be used by the next patches to differentiate between major compaction and regular compaction (compaction_type::Compaction). Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-29 19:21:53 +05:30
Nikos Dragazis	d95ebe7058	boost/cql_query_test: Get expected compressor from config Since `5b6570be52`, the default SSTable compression algorithm for user tables is no longer hardcoded; it can be configured via the `sstable_compression_user_table_options.sstable_compression` option in scylla.yaml. Modify the `test_table_compression` test to get the expected value from the configuration. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-10-29 14:52:43 +02:00
Piotr Dulikowski	aba922ea65	Merge 'cdc: improve cdc metadata loading' from Michael Litvak when loading CDC streams metadata for tablets from the tables, read only new entries from the history table instead of reading all entries. This improves the CDC metadata reloading, making it more efficient and predictable. the CDC metadata is loaded as part of group0 reload whenever the internal CDC tables are modified. on tablet split / merge, we create a new CDC timestamp and streams by writing them to the cdc_streams_history table by group0 operation, and when it's applied we reload the in-memory CDC streams map by reading from the tables and constructing the updated map. Previously, on every update, we would read the entire cdc_streams_history entries for the changed table, constructing all its streams and creating a new map from scratch. We improve this now by reading only new entries from cdc_streams_history and append them to the existing map. we can do this because we only append new entries to cdc_streams_history with higher timestamp than all previous entries. This makes this reloading more efficient and predictable, because previously we would read a number of entries that depends on the number of tablets splits and merges, which increases over time and is unbounded, whereas now we read only a single stream set on each update. Fixes https://github.com/scylladb/scylladb/issues/26732 backport to 2025.4 where cdc with tablets is introduced Closes scylladb/scylladb#26160 * github.com:scylladb/scylladb: test: cdc: extend cdc with tablets tests cdc: improve cdc metadata loading	2025-10-29 11:07:48 +01:00
Michał Hudobski	46589bc64c	secondary_index: disallow multiple vector indexes on the same column We currently allow creating multiple vector indexes on one column. This doesn't make much sense as we do not support picking one when making ann queries. To make this less confusing and to make our behavior similar to Cassandra we disallow the creation of multiple vector indexes on one column. We also add a test that checks this behavior. Fixes: VECTOR-254 Fixes: #26672 Closes scylladb/scylladb#26508	2025-10-29 11:55:38 +02:00
Patryk Jędrzejczak	7304afb75a	Merge 'vnodes cleanup: renames and code comments fixes' from Petr Gusev This is a follow-up for https://github.com/scylladb/scylladb/pull/26315. Fixes several review comments that were left unresolved in the original PR. backport: not needed, this PR contains only renames and code comment fixes Closes scylladb/scylladb#26745 * https://github.com/scylladb/scylladb: test_automatic_cleanup: fix comment storage_proxy: remove stale comment storage_proxy: improve run_fenceable_write comment topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup	2025-10-29 10:38:27 +01:00
Nadav Har'El	492c664fbb	docs/alternator: explain alternator_warn_authorization The previous patches added the ability to set alternator_warn_authorization. In this patch we add to our documentation a recommendation that this setting be used as an intermediate step when wanting to change alternator_enforce_authorization from "false" to "true". We explain why this is useful and important. The new documentation is in docs/alternator/compatibility.md, where we previously explained the alternator_enforce_authorization configuration. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-29 11:16:29 +02:00
Nadav Har'El	2dbd1a85a3	test/alternator: tests for new auth failure metrics and log messages This patch adds to test_metrics.py tests that authentication and authorization errors increment, respectively, the new metrics scylla_alternator_authentication_failures scylla_alternator_authorization_failures This patch also adds in test_logs.py tests that verify that that log messages are generated on different types of authentication/authorization failures. The tests also check how configuring alternator_enforce_authorization and alternator_warn_authorization changes these behaviors: * alternator_enforce_authorization determines whether an auth error will cause the request to fail, or the failure is counted but then ignored. * alternator_warn_authorization determines whether an auth error will cause a WARN-level log message to be generated (and also the failure is counted. * If both configuration flags are false, Alternator doesn't even attempt to check authentication or authorization - so errors aren't even counted. Because the new tests live-update the alternator_*_authorization configuration options, they also serve as a test that live-updating this option works correctly. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-29 11:16:29 +02:00
Nadav Har'El	51186b2f2c	alternator: add alternator_warn_authorization config Before this patch, the configuration alternator_enforce_authorization is a boolean: true means enforce authentication checks (i.e., each request is signed by a valid user) and authorization checks (the user who signed the request is allowed by RBAC to perform this request). This patch adds a second boolean configuration option, alternator_warn_authorization. When alternator_enforce_authorization is false but alternator_warn_authorization is true, authentication and authorization checks are performed as in enforce mode, but failures are ignored and counted in two new metrics: scylla_alternator_authentication_failures scylla_alternator_authorization_failures additionally,also each authentication or authorization error is logged as a WARN-level log message. Some users prefer those log messages over metrics, as the log messages contain additional information about the failure that can be useful - such as the address of the misconfigured client, or the username attempted in the request. All combinations of the two configuration options are allowed: * If just "enforce" is true, auth failures cause a request failure. The failures are counted, but not logged. * If both "enforce" and "warn" are true, auth failures cause a request failure. The failures are both counted and logged. * If just "warn" is true, auth failures are ignored (the request is allowed to compelete) but are counted and logged. * If neither "enforce" nor "warn" are true, no authentication or authorization check are done at all. So we don't know about failures, so naturally we don't count them and don't log them. This patch is fairly straightforward, doing mainly the following things: 1. Add an alternator_warn_authorization config parameter. 2. Make sure alternator_enforce_authorization is live-updatable (we'll use this in a test in the next patch). It "almost" was, but a typo prevented the live update from working properly. 3. Add the two new metrics, and increment them in every type of authentication or authorization error. Some code that needs to increment these new metrics didn't have access to the "stats" object, so we had to pass it around more. 4. Add log messages when alternator_warn_authorization is true. 5. If alternator_enforce_authorization is false, allow the auth check to allow the request to proceed (after having counted and/or logged the auth error). A separate patch will follow and add documentation suggesting to users how to use the new "warn" options to safely switch between non-enforcing to enforcing mode. Another patch will add tests for the new configuration options, new metrics and new log messages. Fixes #25308. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-29 11:16:26 +02:00
Dawid Mędrek	48cbf6b37a	test/cluster/test_tablets: Migrate dtest We migrate `tablets_test.py::TestTablets::test_moving_tablets_replica_on_node` from dtests to the repository of Scylla. We divide the test into two steps to make testing easier and even possible with RF-rack-valid keyspaces being enforced. Closes scylladb/scylladb#26285	2025-10-29 11:09:48 +02:00
Karol Nowacki	9f1fd7f5a0	cql3: Rename indexed_table_select_statement To align with `vector_indexed_table_select_statement`, this commit renames `indexed_table_select_statement` to `view_indexed_table_select_statement` to clarify its usage with materialized views.	2025-10-29 08:37:25 +01:00
Karol Nowacki	357c0a8218	cql3: Move vector search select to dedicated class The execution of SELECT statements with ANN ordering (vector search) was previously implemented within `indexed_table_select_statement`. This was not ideal, as vector search logic is independent of secondary index selects. This resulted in unnecessary complexity because vector search queries don't use features like aggregates or paging. More importantly, `indexed_table_select_statement` assumed a non-null `view_schema` pointer, which doesn't hold for vector indexes (where `view_ptr` is null). This caused null pointer dereferences during ANN ordered selects, leading to crashes (VECTOR-179). Other parts of the class still dereference `view_schema` without null checks. Moving the vector search select logic out of `indexed_table_select_statement` simplifies the code and prevents these null pointer dereferences.	2025-10-29 08:37:21 +01:00
Taras Veretilnyk	e62ebdb967	table: enable integrity checks for streaming reader Previously, streaming readers only verified the checksum of compressed SSTables. This patch extends checks to also include the digest and the uncompressed checksum (CRC). These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest. If the reader range doesn't cover the full SSTable, the digest check is skipped.	2025-10-28 19:27:35 +01:00
Taras Veretilnyk	06e1b47ec6	table: Add integrity option to table::make_sstable_reader()	2025-10-28 19:27:35 +01:00
Taras Veretilnyk	deb8e32e86	sstables: Add integrity option to create_single_key_sstable_reader Added an sstables::integrity_check parameter to create_single_key_sstable_reader methods across its implementations. This allows callers to enable SSTable integrity checks during single-key reads.	2025-10-28 19:27:35 +01:00
Petr Gusev	b6bcd062de	test_automatic_cleanup: fix comment	2025-10-28 17:55:20 +01:00
Petr Gusev	d49be677d5	storage_proxy: remove stale comment	2025-10-28 17:55:20 +01:00
Petr Gusev	c60223f009	storage_proxy: improve run_fenceable_write comment	2025-10-28 17:55:20 +01:00
Petr Gusev	58d100a0cb	topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes	2025-10-28 17:55:20 +01:00
Petr Gusev	fa9dc71f30	storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed	2025-10-28 17:55:19 +01:00
Pavel Emelyanov	e99c8eee08	commitlog: Remove unused work::r stream variable Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:46:29 +03:00
Dawid Mędrek	5e03b01107	test/cluster: Add test_simulate_upgrade_legacy_to_raft_listener_registration We provide a reproducer test of the bug described in scylladb/scylladb#18049. It should fail before the fix introduced in scylladb/scylladb@7ea6e1ec0a, and it should succeed after it. Refs scylladb/scylladb#18049 Fixes scylladb/scylladb#18071 Closes scylladb/scylladb#26621	2025-10-28 17:32:15 +01:00
Pavel Emelyanov	92462e502f	ec2_snitch: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:31:08 +03:00
Pavel Emelyanov	7640ade04d	ec2_snitch: Coroutinize the aws_api_call_once() The method connects a socket, grabs in/out streams from it then writes HTTP request and reads+parses the response. For that it uses class variables for socket and streams, but there's no real need for that -- all three actually exists throughput the method "lifetime". To fix it, coroutinizes the method. The same could be achieved my moving the connected socket and streams into do_with() context, but coroutine is better than that. (indentation is left broken) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:29:25 +03:00
Pavel Emelyanov	5d89816fed	sstable: Construct output_stream for data instantly This changes makes local output_stream variable be constructed in the declaration statement with the help of ternary operator thus avoiding both -- default-initialization and move-assignment depending on the standalone condition checking. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:27:22 +03:00
Pavel Emelyanov	37b9cccc1c	test: Don't reuse on-stack input stream The test consists of several snippets, each creating an input_stream for some short operation and checking the result. Each snipped over-writes the local `input_stream in` variable with the new one. This change wraps each of those snippets into own code block in order to have own new `input_stream in` variable in each. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-28 19:25:07 +03:00
Yauheni Khatsianevich	99dc31e71a	tests(lwt): new test for LWT testing during tablet resize – Workload: N workers perform CAS updates UPDATE … SET s{i}=new WHERE pk=? IF (∀j≠i: s{j}>=guard_j) AND s{i}=prev at CL=LOCAL_QUORUM / SERIAL=LOCAL_SERIAL. Non-apply without timeout is treated as contention; “uncertainty” timeouts are resolved via LOCAL_SERIAL read. - Enable balancing and increase min_tablet_count to force split, flush and lower min_tablet_count to merge. - “Uncertainty” timeouts (write timeout due to uncertainty) are resolved via a LOCAL_SERIAL read to determine whether the CAS actually applied. - Invariants: after the run, for every pk and column s{i}, the stored value equals the number of confirmed CAS by worker i (no lost or phantom updates) despite ongoing tablet moves. Closes scylladb/scylladb#26113	2025-10-28 16:48:57 +01:00
Petr Gusev	d300adc10c	storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup This cleanup is only for vnodes-based tables, reflect this in the function name.	2025-10-28 15:37:28 +01:00
Michael Litvak	4cc0a80b79	test: cdc: extend cdc with tablets tests extend and improve the tests of virtual tables for cdc with tablets. split the existing virtual tables test to one test that validates the virtual tables against the internal cdc tables, and triggering some tablet splits in order to create entries in the cdc_streams_history table, and add another test with basic validation of the virtual tables when there are multiple cdc tables.	2025-10-28 15:06:21 +01:00
Pavel Emelyanov	ae0136792b	utils: Make directory_lister use generator lister from seastar The directory_lister uses utils::lister under the hood which accepts a callback to put directory_entry-s in. The directory_lister's callback then puts the entries into a queue and its .get() method pops up entries from there to return to caller. This patch simplifies this code by switching the directory_lister to use experimental generator lister from seastar. With it, the entries to be returned from .get() are simply co_await-ed from calling the generator object (wich co_yield-s them). As a result the directory_lister becomes smaller and drops the need for utils::lister. Since directory_lister was created as a replacement for that callback-based lister, the latter can be eventually removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26586	2025-10-28 15:20:20 +02:00
Pavel Emelyanov	948cefa5f9	test: Extend API consistency test with tokens_endpoint endpoint Recently (#26231) there was added a test to check that several API endpoints, that return tokens and corresponding replica nodes, are consistent with tablet map. This patch adds one more API endpoint to the validation -- the /storage_service/tokens_endpoint one. The extention is pretty straightforward, but the new endpoint returns back a single (primary) replica for a token, so the test check is slightly modified to account for that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26580	2025-10-28 15:18:09 +02:00
Dawid Mędrek	535d31b588	test/cluster/random_failures: Re-enable index events We've enabled the configuration option `rf_rack_valid_keyspaces`, so we can finally re-enable the events creating and dropping secondary indexes. Fixes scylladb/scylladb#26422	2025-10-28 14:17:14 +01:00
Dawid Mędrek	b4898e50bf	test/cluster/random_failures: Enable rf_rack_valid_keyspaces Now that the test has been adjusted to work with the configuration option, we enable it.	2025-10-28 14:17:09 +01:00
Dawid Mędrek	59b2a41c49	test/cluster/random_failures: Adjust to RF-rack-validity We adjust the test to work with the configuration option `rf_rack_valid_keyspaces` enabled. For that, we ensure that there is always at least one node in each of the three racks. This way, all keyspaces we create and manipulate will remain RF-rack-valid since they all use RF=3. ------------------------------------------------------------------------ To achieve that, we only need to adjust the following events: 1. `init_tablet_transfer` The event creates a new keyspace and table and manually migrates a tablet belonging to it. As long as we make sure the migration occurs within the same rack, there will be no problem. Since RF == #racks, each rack will have exactly one tablet replica, so we can migrate the tablet to an arbitrary node in the same rack. Note that there must exist a node that's not a replica. If there weren't such a node, the test wouldn't have worked before this commit because it's not possible to migrate a tablet from one node being its replica to another. In other words, we have a guarantee that there are at least 4 nodes in the cluster when we try to migrate a tablet replica. That said, we check it anyway. If there's no viable node to migrate the tablet replica to, we log that information and do nothing. That should be an acceptable solution. 2. `add_new_node` As long as we add a node to an existing rack, there's no way to violate the invariant imposed by the configuration option, so we pick a random rack out of the existing three and create a node in it. 3. `decommission_node` We need to ensure that the node we'll be trying to decommission is not the only one in its rack. Following pretty much the same reasoning as in `init_tablet_transfer`, we conclude there must be a rack with at least two nodes in it. Otherwise we'd end up having to migrate a tablet from one replica node to another, which is not possible. What's more, decommissioning a node is not possible if any node in the cluster is dead, so we can assume that `manager.running_servers` returns the whole cluster. 4. `remove_node` The same as `decommission_node`. Just note although the node we choose to remove must be first stopped, none other node can be dead, so the whole cluster must be returned by `manager.running_servers`. ------------------------------------------------------------------------ There's one more important thing to note. The test may sometimes trigger a sequence of events where a new node is started, but, due to an error injection, its initialization is not completed. Among other things, the node may NOT have a host ID recognized by the rest of the nodes in the cluster, and operations like tablet migration will fail if they target it. Thankfully, there seems to be a way to avoid problems stemming from that. When a new node is added to the cluster, it should appear at the end of the list returned by `manager.running_servers`. This most likely stems from how dictionaries work in Python: "Keys and values are iterated over in insertion order." -- https://docs.python.org/3/library/stdtypes.html#dict-views and the fact that we keep track of running servers using a dictionary. Furthermore, we rely on the assumption that the test currently works correctly. Assume, to the contrary, that among the nodes taking part in the operations listed above, there is at most one node per rack that has its host ID recognized by the rest of the cluster. Note that only those nodes can store any tablets. Let's refer to the set of those nodes as X. Assume that we're dealing with tablet migration, decommissioning, or removing a node. Since those operations involve tablet migration, at least one tablet will need to be migrated from the node in question to another node in X. However, since X consists of at most three nodes, and one of them is losing its tablet, there is no viable target for the tablet, so the operation fails. Using those assumptions, an auxiliary function, `select_viable_rack`, was designed to carefully choose a correct rack, which we'll then pick nodes from to perform the topological operations. It's simple: we just find the first rack in the list that has at least two nodes in it. That should ensure that we perform an operation that doesn't lead to any unforeseen disaster. ------------------------------------------------------------------------ Since the test effectively becomes more complex due to more care for keeping the topology of the cluster valid, we extend the log messages to make them more helpful when debugging a failure.	2025-10-28 14:15:57 +01:00
Nadav Har'El	87573197d4	test/alternator: reproducers for missing headers and request limit This patch adds reproducing tests in test/alternator for issue #23438, which is about missing checks for the length of headers and the URL in Alternator requests. These should be limited, because Seastar's HTTP server, which Scylla uses, reads them into memory so they can OOM Scylla. The tests demonstrate that DynamoDB enforces a 16 KB limit on the headers and the URL of the request, but Scylla doesn't (a code inspection suggests it does not in fact have any limit). The two tests pass on DynamoDB and currently xfail on Alternator. Refs #23438. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23442	2025-10-28 15:12:25 +02:00
Pavel Emelyanov	d9bfbeda9a	lister: Fix race between readdir and stat Sometimes file::list_directory() returns entries without type set. In thase case lister calls file_type() on the entry name to get it. In case the call returns disengated type, the code assumes that some error occurred and resolves into exception. That's not correct. The file_type() method returns disengated type only if the file being inspected is missing (i.e. on ENOENT errno). But this can validly happen if a file is removed bettween readdir and stat. In that case it's not "some error happened", but a enry should be just skipped. In "some error happened", then file_type() would resolve into exceptional future on its own. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26595	2025-10-28 15:10:22 +02:00
Botond Dénes	ac618a53f4	Merge 'db: repair: do not update repair_time if batchlog replay failed' from Aleksandra Martyniuk Currently, batchlog replay is considered successful even if all batches fail to be sent (they are replayed later). However, repair requires all batches to be sent successfully. Currently, if batchlog isn't cleared, the repair never learns and updates the repair_time. If GC mode is set to "repair", this means that the tombstones written before the repair_time (minus propagation_delay) can be GC'd while not all batches were replied. Consider a scenario: - Table t has a row with (pk=1, v=0); - There is an entry in the batchlog that sets (pk=1, v=1) in table t; - The row with pk=1 is deleted from table t; - Table t is repaired: - batchlog reply fails; - repair_time is updated; - propagation_delay seconds passes and the tombstone of pk=1 is GC'd; - batchlog is replayed and (pk=1, v=1) inserted - data resurrection! Do not update repair_time if sending any batch fails. The data is still repaired. For tablet repair the repair runs, but at the end the exception is passed to topology coordinator. Thanks to that the repair_time isn't updated. The repair request isn't removed as well, due to which the repair will need to rerun. Apart from that, a batch is removed from the batchlog if its version is invalid or unknown. The condition on which we consider a batch too fresh to replay is updated to consider propagation_delay. Fixes: https://github.com/scylladb/scylladb/issues/24415 Data resurrection fix; needs backport to all versions Closes scylladb/scylladb#26319 * github.com:scylladb/scylladb: db: fix indentation test: add reproducer for data resurrection repair: fail tablet repair if any batch wasn't sent successfully db/batchlog_manager: fix making decision to skip batch replay db: repair: throw if replay fails db/batchlog_manager: delete batch with incorrect or unknown version db/batchlog_manager: coroutinize replay_all_failed_batches	2025-10-28 14:52:59 +02:00
Botond Dénes	f3cec5f11a	Merge 'index: Set tombstone_gc when creating underlying view' from Dawid Mędrek Before this commit, when the underlying materialized view was created, it didn't have the property `tombstone_gc` set to any value. We fix the bug in this PR. Implementation strategy: 1. Move code responsible for producing the schema of a secondary index to the file that handles `CREATE INDEX`. 2. Set the property when creating the view. 3. Add reproducer tests. Fixes scylladb/scylladb#26542 Backport: we can discuss it. Closes scylladb/scylladb#26543 * github.com:scylladb/scylladb: index: Set tombstone_gc when creating secondary index index: Make `create_view_for_index` method of `create_index_statement` index: Move code for creating MV of secondary index to cql3 db, cql3: Move creation of underlying MV for index	2025-10-28 14:42:42 +02:00
Nadav Har'El	c3593462a4	alternator: improve protection against oversized requests Following DynamoDB, Alternator also places a 16 MB limit on the size of a request. Such a limit is necessary to avoid running out of memory - because the AWS message authentication protocol requires reading the entire request into memory before its signature can be verified. Our implementation for this limit used Seastar's HTTP server's content_length_limit feature. However, this Seastar feature is incomplete - it only works when the request uses the Content-Length header, and doesn't do anything if the request doesn't have a Content-Length (it may use chunked encoding, or have no length at all). So malicious users can cause Scylla to OOM by sending a huge request without a Content-Length. So in this patch we stop using the incomplete Seastar feature, and implement the length limit in Scylla in a way that works correctly with or without Content-Length: We read from the input stream and if we go over 16MB, we generate an error. Because we dropped Seastar's protection against a long Content-Length, we also need to fix a piece of code which used Content-Length to reserve some semaphore units to prevent reading many large requests in parallel. We fix two problems in the code: 1. If Content-Length is over the limit, we shouldn't attempt to reserve semaphore units - this should just be a Payload Too Large error. 2. If Content-Length is missing, the existing code did nothing and had a TODO that we should. In this patch we implement what was suggested in that TODO: We temporarily reserve the whole 16 MB limit, and after reading the actual request, we return part of the reservation according to the real request size. That last fix is important, because typically the largest requests will be BatchWriteItem where a well-written client would want to use chunked encoding, not Content-Length, to avoid materializing the entire request up-front. For such clients, the memory use semaphore did nothing, and now it does the right thing. Note that this patch does not solve the problem #12166 that existed with Seastar's length-limiting implementation but still exists in the new in-Scylla length-limiting implementation: The fact we send an error response in the middle of the request and then close the connection, while the client continues to send the request, can lead to an RST being sent by the server kernel. Usually this will be fine - well-written client libraries will be able to read the response before the RST. But even with a well-written library in some rare timings the client may get the RST before the response, and will miss the response, and get an empty or partial response or "connection reset by peer". This issue existed before this patch, and still exists, but is probably of minor impact. Fixes #8196 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#23434	2025-10-28 15:24:46 +03:00
Lakshmi Narayanan Sreethar	64c1ec99e0	cmake: link crypto lib to utils The utils library requires OpenSSL's libcrypto for cryptographic operations and without linking libcrypto, builds fail with undefined symbol errors. Fix that by linking `crypto` to `utils` library when compiled with cmake. The build files generated with configure.py already have `crypto` lib linked, so they do not have this issue. Fix #26705 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26707	2025-10-28 14:11:03 +02:00
Ferenc Szili	10f07fb95a	load_balancer: load_stats reconcile after tablet migration and table resize This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to issue migrations which improve load balance.	2025-10-28 12:12:09 +01:00
Anna Stuchlik	6fa342fb18	doc: add OS support for version 2025.4 Fixes https://github.com/scylladb/scylladb/issues/26450 Closes scylladb/scylladb#26616	2025-10-28 13:29:40 +03:00
Radosław Cybulski	ea6b22f461	Add max trace size output configuration variable In #24031 users complained, that trace message is truncated, namely it's no longer json parsable and table name might not be part of the output. This path enables users to configure maximum size of trace message. In case user wanted `table` name, but didn't care about message size, #26634 will help. - add configuration varable `alternator_max_users_query_size_in_trace_output` with default value of 4096 (4 times old default value). - modify `truncated_content_view` function to use new configuration variable for truncation limit - update `truncated_content_view` to consistently truncate at given size, previously trunctation would also happen when data arrived in more than one chunk - update `truncated_content_view` to better handle truncated value (limit number of copies) - fix `scylla_config_read` call - call to `query` for a configuration name that is not existing will return `Items` array empty (but present) - this would raise array access exception few lines below. - add test Refs #26634 Refs #24031 Closes scylladb/scylladb#26618	2025-10-28 13:29:15 +03:00
Pavel Emelyanov	ac1d709709	Merge 'tablet_sstable_streamer: defer SSTable unlinking until fully streamed' from Taras Veretilnyk When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patch replaces unlink() with mark_for_deletion() deferring actual unlinking till sstable::close_files. test_tablets2::test_tablet_load_and_stream was enhanced to also verify that SSTables are removed after being streamed. Fixes #26606 Backport is not required, although it is a bug fix, but it isn't visible. This is more of a preparatory fix for https://github.com/scylladb/scylladb/pull/26444. Closes scylladb/scylladb#26622 * github.com:scylladb/scylladb: test_tablets2: verify SSTable cleanup after tablet load and stream tablet_sstable_streamer: replace unlink() call with mark_for_deletion()	2025-10-28 13:25:40 +03:00
Patryk Jędrzejczak	5321720853	test: test_raft_recovery_stuck: reconnect driver after rolling restarts It turns out that #21477 wasn't sufficient to fix the issue. The driver may still decide to reconnect the connection after `rolling_restart` returns. One possible explanation is that the driver sometimes handles the DOWN notification after all nodes consider each other UP. Reconnecting the driver after restarting nodes seems to be a reliable workaround that many tests use. We also use it here. Fixes #19959 Closes scylladb/scylladb#26638	2025-10-28 13:24:23 +03:00
Anna Stuchlik	bd5b966208	doc: add --list-active-releases to Web Installer Fixes https://github.com/scylladb/scylladb/issues/26688 V2 of https://github.com/scylladb/scylladb/pull/26687 Closes scylladb/scylladb#26689	2025-10-28 13:21:57 +03:00
Pavel Emelyanov	54a117b19d	Merge 'retry_strategy: Switch to using seastar's retry_strategy (take two)' from Ernest Zaslavsky With the recent introduction of retry_strategy to Seastar, the pure virtual class previously defined in ScyllaDB is now redundant. This change allows us to streamline our codebase by directly inheriting from Seastar’s implementation, eliminating duplication in ScyllaDB. Despite this update is purely a refactoring effort and does not introduce functional changes it should be ported back to 2025.3 and 2025.4 otherwise it will make future backports of bugfixes/improvements related to `s3_client` near to impossible ref: https://github.com/scylladb/seastar/issues/2803 depends on: https://github.com/scylladb/seastar/pull/2960 Closes scylladb/scylladb#25801 * github.com:scylladb/scylladb: s3_client: remove unnecessary `co_await` in `make_request` s3 cleanup: remove obsolete retry-related classes s3_client: remove unused `filler_exception` s3_client: fix indentation s3_client: simplify chunked download error handling using `make_request` s3_client: reformat `make_request` functions for readability s3_client: eliminate duplication in `make_request` by using overload s3_client: reformat `make_request` function declarations for readability s3_client: reorder `make_request` and helper declarations s3_client: add `make_request` override with custom retry and error handler s3_client: migrate s3_client to Seastar HTTP client s3_client: fix crash in `copy_s3_object` due to dangling stream s3_client: coroutinize `copy_s3_object` response callback aws_error: handle missing `unexpected_status_error` case s3_creds: use Seastar HTTP client with retry strategy retry_strategy: add exponential backoff to `default_aws_retry_strategy` retry_strategy: introduce Seastar-based retry strategy retry_strategy: update CMake and configure.py for new strategy retry_strategy: rename `default_retry_strategy` to `default_aws_retry_strategy` retry_strategy: fix include retry_strategy: Copied utils/s3/retry_strategy.hh to utils/s3/default_aws_retry_strategy.hh retry_strategy: Copied utils/s3/retry_strategy.cc to utils/s3/default_aws_retry_strategy.cc	2025-10-28 13:08:42 +03:00
Patryk Jędrzejczak	820c8e7bc4	Merge 'LWT: use shards_ready_for_reads for replica locks' from Petr Gusev When a tablet is migrated between shards on the same node, during the write_both_read_new state we begin switching reads to the new shard. Until the corresponding global barrier completes, some requests may still use write_both_read_old erm, while others already use the write_both_read_new erm. To ensure mutual exclusion between these two types of requests, we must acquire locks on both the old and new shards. Once the global barrier completes, no requests remain on the old shard, so we can safely switch to acquiring locks only on the new shard. The idea came from the similar locking problem in the [counters for tablets PR](https://github.com/scylladb/scylladb/pull/26636#discussion_r2463932395). Fixes scylladb/scylladb#26727 backport: need to backport to 2025.4 Closes scylladb/scylladb#26719 * https://github.com/scylladb/scylladb: paxos_state: use shards_ready_for_reads paxos_state: inline shards_for_writes into get_replica_lock	2025-10-28 10:37:53 +01:00
Avi Kivity	d81796cae3	Merge 'Limit concurrent view updates from all sources' from Wojciech Mitros Before this patch, when a base table has many materialized views, each write to this table can start up to 128 view updates in parallel. With high client write concurrency, the actual concurrency of writes executed on the node may grow unexpectedly, which can lead to higher latency and higher memory usage compared to a sequential approach. In this patch we add a per-shard, per-service-level semaphore which limits the number of concurrent view updates processed on the shard in this service level to a constant value. We take one unit from the semaphore for each local view update write, and releasing it when it finishes. The remote view updates do not take units from the semaphore because they don't consume nearly as much processing power and they are limited by another semaphore based on their memory usage. Fixes https://github.com/scylladb/scylladb/issues/25341 Closes scylladb/scylladb#25456 * github.com:scylladb/scylladb: mv: limit concurrent view updates from all sources database: rename _view_update_concurrency_sem to _view_update_memory_sem	2025-10-28 11:13:24 +02:00
Aleksandra Martyniuk	910cd0918b	locator: use get_primary_replica for get_primary_endpoints Currently, tablet_sstable_streamer::get_primary_endpoints is out of sync with tablet_map::get_primary_replica. The get_primary_replica optimizes the choice of the replica so that the work is fairly distributes among nodes. Meanwhile, get_primary_endpoints always chooses the first replica. Use get_primary_replica for get_primary_endpoints. Fixes: https://github.com/scylladb/scylladb/issues/21883. Closes scylladb/scylladb#26385	2025-10-28 09:56:08 +02:00
Michael Litvak	8743422241	cdc: improve cdc metadata loading when loading CDC streams metadata for tablets from the tables, read only new entries from the history table instead of reading all entries. This improves the CDC metadata reloading, making it more efficient and predictable. the CDC metadata is loaded as part of group0 reload whenever the internal CDC tables are modified. on tablet split / merge, we create a new CDC timestamp and streams by writing them to the cdc_streams_history table by group0 operation, and when it's applied we reload the in-memory CDC streams map by reading from the tables and constructing the updated map. Previously, on every update, we would read the entire cdc_streams_history entries for the changed table, constructing all its streams and creating a new map from scratch. We improve this now by reading only new entries from cdc_streams_history and append them to the existing map. we can do this because we only append new entries to cdc_streams_history with higher timestamp than all previous entries. This makes this reloading more efficient and predictable, because previously we would read a number of entries that depends on the number of tablets splits and merges, which increases over time and is unbounded, whereas now we read only a single stream set on each update. Fixes scylladb/scylladb#26732	2025-10-28 08:54:09 +01:00
Wojciech Mitros	f07a86d16e	mv: limit concurrent view updates from all sources Before this patch, when a base table has many materialized views, each write to this table can start up to 128 view updates in parallel. With high client write concurrency, the actual concurrency of writes executed on the node may grow unexpectedly, which can lead to higher latency and higher memory usage compared to a sequential approach. In this patch we add a per-shard, per-service-level semaphore which limits the number of concurrent view updates processed on the shard in this service level to a constant value. We take one unit from the semaphore for each local view update write, and releasing it when it finishes. The remote view updates do not take units from the semaphore because they don't consume nearly as much processing power and they are limited by another semaphore based on their memory usage. The effect of this patch can also be observed when writing to a base table with a large number of materialized views, like in the materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent dtest. In that test, if we perform a full scan in parallel to a write workload with a concurrency of 100 to a table with 100 views, the scan would sometimes timeout because it would effectively get 1/10000 of cpu. With this patch, the cpu concurrency of view updates was limited to 128 (we ran both writes and scan in the same service level), and the scan no longer timed out. Fixes https://github.com/scylladb/scylladb/issues/25341	2025-10-27 18:55:41 +01:00
Pavel Emelyanov	81f598225e	error_injection: Add template parameter default for in release mode The std::optional<T> inject_parameter(...) method is a template, and in dev/debug modes this parameter is defaulted to std::string_view, but for release mode it's not. This patch makes it symmetrical. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26706	2025-10-27 16:39:22 +01:00
Taras Veretilnyk	1361ae7a0a	test_tablets2: verify SSTable cleanup after tablet load and stream Modify existing test_tablet_load_and_stream testcase to verify that SSTable files are properly deleted from the upload directory after streaming.	2025-10-27 16:36:08 +01:00
Taras Veretilnyk	517a4dc4df	tablet_sstable_streamer: replace unlink() call with mark_for_deletion() When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patches replaces unlink() call with mark_for_deletion()	2025-10-27 16:30:05 +01:00
Petr Gusev	478f7f545a	paxos_state: use shards_ready_for_reads Acquiring locks on both shards for the entire tablet migration period is redundant. In most cases, locking only the old shard or only the new shard is sufficient. Using shards_ready_for_reads reduces the situations in which we need to lock both shards to: * intra-node migrations only * only during the write_both_read_new state Once the global barrier completes in the write_both_read_new state, no requests remain on the old shard, so we can safely acquire locks only on the new shard. Fixes scylladb/scylladb#26727	2025-10-27 16:22:28 +01:00
Piotr Dulikowski	fd966ec10d	Merge 'cdc: garbage collect CDC streams for tablets' from Michael Litvak introduce helper functions that can be used for garbage collecting old cdc streams for tablets-based keyspaces. add a background fiber to the topology coordinator that runs periodically and checks for old CDC streams for tablets keyspaces that can be garbage collected. the garbage collection works by finding the newest cdc timestamp that has been closed for more than the configured cdc TTL, and removing all information from the cdc internal tables about cdc timestamps and streams up to this timestamp. in general it should be safe to remove information about these streams because they are closed for more than TTL, therefore all rows that were written to these streams with the configured TTL should be dead. the exception is if the TTL is altered to a smaller value, and then we may remove information about streams that still have live rows that were written with the longer ttl. Fixes https://github.com/scylladb/scylladb/issues/26669 Closes scylladb/scylladb#26410 * github.com:scylladb/scylladb: cdc: garbage collect CDC streams periodically cdc: helpers for garbage collecting old streams for tablets	2025-10-27 16:16:55 +01:00
Michał Hudobski	541b52cdbf	cql: fail with a better error when null vector is passed to ann query Currently when a null vector is passed to an ANN query we fail with a quite confusing error ("NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.0.0.1:9042 datacenter1>: <Error from server: code=0000 [Server error] message="to_bytes() called on raw value that is null">})"). This patch fixes that by throwing an InvalidRequestException with an appropriate message instead. We also add a test case that validates this behavior. Fixes: VECTOR-257 Closes scylladb/scylladb#26510	2025-10-27 16:09:08 +02:00
Botond Dénes	417270b726	Merge 'Port dtest EAR tests to test.py/pytest in scylla CI' from Calle Wilund Fixes #26641 * Adds shared abstraction for dockerized mock services for out pytests (not using python docker, due to both library and podman) * Adds test fixtures for our key providers (except GCS KMS, for which we have no mock server) to do local testing * Ports (and prunes and sharpens) the test cases from dtest::encryption_at_rest_test to our pytest. * Shared KMIP mock between boost test and pytest and speeds up boost test shutdown. When merged, the dtest counterpart can be decommissioned. Closes scylladb/scylladb#26642 * github.com:scylladb/scylladb: test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper test::cluster::test_encryption: Port dtest EAR tests test::cluster::conftest: Add key_provider fixture test::pylib::encryption_provider: Port dtest encryption provider classes test::pylib::dockerized_service: Add helper for running docker/podman test::pylib::kmip_wrapper: Modify to be usable by pytest fixtures test::boost::kmip_wrapper: Move python script for PyKMIP to pylib	2025-10-27 15:42:52 +02:00
Patryk Jędrzejczak	e1c3f666c9	Merge 'vnode cleanup: add missing barriers and fix race conditions' from Petr Gusev Problems addressed by this PR * Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up. * Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations. * `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details). * Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`: * The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited. * Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between. * It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them. * It also waited for writes with versions greater than the current `fence_version`, which is unnecessary. Fixes scylladb/scylladb#26150 backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting Closes scylladb/scylladb#26315 * https://github.com/scylladb/scylladb: test_automatic_cleanup: add test_cleanup_waits_for_stale_writes test_fencing: fix due to new version increment test_automatic_cleanup: clean it up storage_proxy: wait for closing sessions in sstable cleanup fiber storage_proxy: rename await_pending_writes -> await_stale_pending_writes storage_proxy: use run_fenceable_write storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check storage_proxy: introduce run_fenceable_write storage_proxy: move update_fence_version from shared_token_metadata storage_proxy: fix start_write() operation scope in apply_locally storage_proxy: move post fence check into handle_write storage_proxy: move fencing into mutate_counter_on_leader_and_replicate storage_proxy::handle_read: add fence check before get_schema storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber sstable_cleanup_fiber: use coroutine::parallel_for_each storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock topology_coordinator: barrier before cleanup topology_coordinator: small start_cleanup refactoring global_token_metadata_barrier: add fenced flag	2025-10-27 12:35:13 +01:00
Petr Gusev	5ab2db9613	paxos_state: inline shards_for_writes into get_replica_lock No need to have two functions since both callers of get_replica_lock() use shards_for_writes() to compute the shards where the locks must be acquired. Also while at it, inline the acquire() lambda in get_replica_lock() and replace it with a loop over shards. This makes the code more strightforward.	2025-10-27 11:12:29 +01:00
Michael Litvak	6109cb66be	cdc: garbage collect CDC streams periodically add a background fiber to the topology coordinator that runs periodically and checks for old CDC streams for tablets keyspaces that can be garbage collected.	2025-10-26 11:01:20 +01:00
Michael Litvak	440caeabcb	cdc: helpers for garbage collecting old streams for tablets introduce helper functions that can be used for garbage collecting old cdc streams for tablets-based keyspaces. - get_new_base_for_gc: finds a new base timestamp given a TTL, such that all older timestamps and streams can be removed. - get_cdc_stream_gc_mutations: given new base timestamp and streams, builds mutations that update the internal cdc tables and remove the older streams. - garbage_collect_cdc_streams_for_table: combines the two functions above to find a new base and build mutations to update it for a specific table - garbage_collect_cdc_streams: builds gc mutations for all cdc tables	2025-10-26 11:01:20 +01:00
Avi Kivity	b843d8bc8b	Merge 'scylla-sstable: add cql support to write operation' from Botond Dénes In theory, scylla-sstable write is an awesome and flexible tool to generate sstables with arbitrary content. This is convenient for tests and could come clutch in a disaster scenario, where certain system table's content need to be manually re-created, system tables that are not writable directly via CQL. In practice, in its current form this operation is so convoluted to use that even its own author shuns it. This is because the JSON specification of the sstable content is the same as that of the scylla-sstable dump-data: containing every single piece of information on the mutation content. Where this is an advantage for dump-data, allowing users to inspect the data in its entirety -- it is a huge disadvantage for write, because of all these details have to be filled in, down to the last timestamp, to generate an sstable. On top of that, the tool doesn't even support any of the more advanced data types, like collections, UDF and counters. This PR proposes a new way of generating sstables: based on the success of scylla-sstable query, it introduces CQL support for scylla-sstable write. The content of the sstable can now be specified via standard INSERT, UPDATE and DELETE statements, which are applied to a memtable, then flushed into the sstable. To avoid boundless memory consumption, the memtable is flushed every time it reaches 1MiB in size, consequently the command can generate multiple output sstables. The new CQL input-format is made default, this is safe as nobody is using this command anyway. Hopefully this PR will change that. Fixes: https://github.com/scylladb/scylladb/issues/26506 New feature, no backport. Closes scylladb/scylladb#26515 * github.com:scylladb/scylladb: test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql replica/mutation_dump: add support for virtual tables tools/scylla-sstable: print_query_results_json(): handle empty value buffer tools/scylla-sstable: add cql support to write operation tools/scylla-sstable: write_operation(): fix indentation tools/scylla-sstable: write_operation(): prepare for a new input-format tools/scylla-sstable: generalize query_operation_validate_query() tools/scylla-sstable: move query_operation_validate_query() tools/scylla-sstable: extract schema transformation from query operation replica/table: add virtual write hook to the other apply() overload too	2025-10-24 23:32:40 +03:00
Avi Kivity	997b52440e	Merge 'replica/mutation_dump: include empty/dead partitions in the scan results' from Botond Dénes `select * from mutation_fragment()` queries don't return partitions which are completely empty or only contain tombstones which are all garbage collectible. This is because the underlying `mutation_dump` mechanism has a separate query to discover partitions for scans. This query is a regular mutation scan, which is subject to query compaction and garbage collection. Disable the query compaction for mutation queries executed on behalf of mutation fragment queries, so all data is visible in the result, even that which is fully garbage collectible. Fixes scylladb/scylladb#23707. Scans for mutation-fragment are very rare, so a backport is not necessary. We can backport on-demand. Closes scylladb/scylladb#26227 * github.com:scylladb/scylladb: replica/mutation_dump: multi_range_partition_generator: disable garbage-collection replica: add tombstone_gc_enabled parameter to mutation query methods mutation/mutation_compactor: remove _can_gc member tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc	2025-10-24 23:26:16 +03:00
Patryk Jędrzejczak	5ae1aba107	test: unskip test_raft_recovery_entry_loss The issue has been fixed in #26612. Closes scylladb/scylladb#26614	2025-10-24 21:23:41 +03:00
Ferenc Szili	b4ca12b39a	load_stats: change data structure which contains tablet sizes This patch changes the tablet size map in load_stats. Previously, this data structure was: std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes; and is changed into: std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes; This allows for improved performance of tablet tablet size reconciliation.	2025-10-24 14:37:00 +02:00
Andrzej Jackowski	8642629e8e	test: add test_anonymous_user to test_raft_service_levels The primary goal of this test is to reproduce scylladb/scylladb#26040 so the fix (`278019c328`) can be backported to older branches. Scenario: connect via CQL as an anonymous user and verify that the `sl:default` scheduling group is used. Before the fix for #26040 `main` scheduling group was incorrectly used instead of `sl:default`. Control connections may legitimately use `sl:driver`, so the test accepts those occurrences while still asserting that regular anonymous queries use `sl:default`. This adds explicit coverage on master. After scylladb#24411 was implemented, some other tests started to fail when scylladb#26040 was unfixed. However, none of the tests asserted this exact behavior. Refs: scylladb/scylladb#26040 Refs: scylladb/scylladb#26581 Closes scylladb/scylladb#26589	2025-10-24 12:23:34 +02:00
Ernest Zaslavsky	e8ce49dadf	s3_client: remove unnecessary `co_await` in `make_request` Eliminates a redundant `co_await` by directly returning the `future`, simplifying the control flow without affecting behavior.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	71ea973ae4	s3 cleanup: remove obsolete retry-related classes Delete `default_retry_strategy` and `retryable_http_client`, no longer used in `s3_client` after recent refactors.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	d44bbb1b10	s3_client: remove unused `filler_exception` Eliminate the now-obsolete `filler_exception`, rendered redundant by earlier refactors that streamlined error handling in the S3 client.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	d3c6338de6	s3_client: fix indentation Fix indentation in background download fiber in `chunked_download_source`	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	47704deb1e	s3_client: simplify chunked download error handling using `make_request` Refactor `chunked_download_source` to eliminate redundant exception handling by leveraging the new `make_request` override with custom retry strategy. This streamlines the download fiber logic, improving readability and maintainability.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	2bc9b205b6	s3_client: reformat `make_request` functions for readability Reformats `make_request` functions with long argument lists to improve readability and comply with formatting guidelines.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	bf39412f4a	s3_client: eliminate duplication in `make_request` by using overload Removes redundant code in the `make_request` function by invoking the appropriate overload, simplifying logic and improving maintainability.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	695e70834e	s3_client: reformat `make_request` function declarations for readability Reformats the `make_request` function declarations to improve readability due to the large number of arguments. This aligns with our formatting guidelines and makes the code easier to maintain.	2025-10-23 15:58:11 +03:00
Ernest Zaslavsky	9f01c1f3ff	s3_client: reorder `make_request` and helper declarations Performs minor reordering of helper functor declarations in the header file to improve readability and maintain logical grouping.	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	3d51124cb0	s3_client: add `make_request` override with custom retry and error handler Introduce an override for `make_request` in `s3_client` to support custom retry strategies and error handlers, enabling flexibility beyond the default client behavior and improving control over request handling	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	bdb3979456	s3_client: migrate s3_client to Seastar HTTP client Eliminate use of `retryable_http_client` in `s3_client` and adopt Seastar's native HTTP client.	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	2025760e75	s3_client: fix crash in `copy_s3_object` due to dangling stream In the `copy_part` method, move the `input_stream<char>` argument into a local variable before use. Failing to do so can lead to a SIGSEGV or trigger an abort under address sanitizer.	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	0983c791e9	s3_client: coroutinize `copy_s3_object` response callback coroutinize `copy_s3_object` response callback for a bugfix in the following commit to prevent failing on dangling stream	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	237217c798	aws_error: handle missing `unexpected_status_error` case Add a missing `case` clause to the `switch` statement to correctly handle scenarios where `unexpected_status_error` is thrown. This fixes overlooked error handling and improves robustness.	2025-10-23 15:58:10 +03:00
Ernest Zaslavsky	4f6384b1a0	s3_creds: use Seastar HTTP client with retry strategy In AWS credentials providers, replace `retryable_http_client` with Seastar's native HTTP client. Integrate the newly added `default_aws_retry_strategy` to handle retries more efficiently and reduce dependency on external retry logic.	2025-10-23 15:58:07 +03:00
Ernest Zaslavsky	3851ee58d7	retry_strategy: add exponential backoff to `default_aws_retry_strategy` Add exponential backoff to `default_aws_retry_strategy` and call it to `sleep` before returning `true`, no-op in case of non-retryable error	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	524737a579	retry_strategy: introduce Seastar-based retry strategy Add a new class derived from Seastar's `default_retry_strategy`. Relocate the `should_retry` implementation from Scylla's `default_retry_strategy` into the new class to centralize and standardize retry behavior.	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	51aadd0ab3	retry_strategy: update CMake and configure.py for new strategy Include `default_aws_retry_strategy` in the build system by updating CMake and `configure.py` to ensure it is properly compiled and linked.	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	5d65b47a15	retry_strategy: rename `default_retry_strategy` to `default_aws_retry_strategy` Renames the `default_retry_strategy` class to `default_aws_retry_strategy` to clarify its association with the S3 client implementation. This avoids confusion with the unrelated `seastar::default_retry_strategy` class.	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	cc200ced67	retry_strategy: fix include Fix header inclusion in "newly" created file	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	d679fd514c	retry_strategy: Copied utils/s3/retry_strategy.hh to utils/s3/default_aws_retry_strategy.hh	2025-10-23 15:49:34 +03:00
Ernest Zaslavsky	7cd4be4c49	retry_strategy: Copied utils/s3/retry_strategy.cc to utils/s3/default_aws_retry_strategy.cc	2025-10-23 15:49:34 +03:00
Aleksandra Martyniuk	6fc43f27d0	db: fix indentation	2025-10-23 10:39:43 +02:00
Aleksandra Martyniuk	1935268a87	test: add reproducer for data resurrection Add a reproducer to check that the repair_time isn't updated if the batchlog replay fails. If repair_time was updated, tombstones could be GC'd before the batchlog is replayed. The replay could later cause the data resurrection.	2025-10-23 10:39:43 +02:00
Aleksandra Martyniuk	d436233209	repair: fail tablet repair if any batch wasn't sent successfully If any batch replay failed, we cannot update repair_time as we risk the data resurrection. If replay of any batch needs to be retried, run the whole repair but fail at the very end, so that the repair_time for it won't be updated.	2025-10-23 10:39:42 +02:00
Aleksandra Martyniuk	e1b2180092	db/batchlog_manager: fix making decision to skip batch replay Currently, we skip batch replay if less than batch_log_timeout passed from the moment the batch was written. batch_log_timeout value can be configured. If it is large, it won't be replayed for a long time. If the tombstone will be GC'd before the batch is replayed, then we risk the data resurrection. To ensure safety we can skip only the batches that won't be GC'd. In this patch we skip replay of the batches for which: now() < written_at + min(timeout + propagation_delay) repair_time is set as a start of batchlog replay, so at the moment of the check we will have: repair_time <= now() So we know that: repair_time < written_at + propagation_delay With this condition we are sure that GC won't happen.	2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk	7f20b66eff	db: repair: throw if replay fails Return a flag determining whether all the batches were sent successfully in batchlog_manager::replay_all_failed_batches (batches skipped due to being too fresh are not counted). Throw in repair_flush_hints_batchlog_handler if not all batches were replayed, to ensure that repair_time isn't updated.	2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk	904183734f	db/batchlog_manager: delete batch with incorrect or unknown version batchlog_manager::replay_all_failed_batches skips batches that have unknown or incorrect version. Next round will process these batches again. Such batches will probably be skipped everytime, so there is no point in keeping them. Even if at some point the version becomes correct, we should not replay the batch - it might be old and this may lead to data resurrection.	2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk	502b03dbc6	db/batchlog_manager: coroutinize replay_all_failed_batches	2025-10-23 10:38:31 +02:00
Ernest Zaslavsky	abd3abc044	cmake: fix the seastar API level Fix the build to make it compile when using CMake by defining the right Seastar API level Closes scylladb/scylladb#26690	2025-10-23 11:20:20 +03:00
Botond Dénes	f8b0142983	Merge 'Add --drop-unfixable-sstables flag for scrub in segregate mode' from Taras Veretilnyk This PR introduces support for a new scrub option: `--drop-unfixable-sstables`, which enables the dropping of corrupted SSTables during scrub only in segregate mode. The patch includes implementation, validation, and set of tests to ensure correct behavior and error handling. Fixes #19060 Backport is not required, it is a new feature Closes scylladb/scylladb#26579 * github.com:scylladb/scylladb: sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option test/nodetool: add scrub drop-unfixable-sstables option testcase scrub: add support for dropping unfixable sstables in segregate mode	2025-10-23 11:06:19 +03:00
Wojciech Mitros	c0d0f8f85b	database: rename _view_update_concurrency_sem to _view_update_memory_sem In the following commit, we'll introduce a new semaphore for view updates that limits their concurrency by view update count. To avoid confusion, we rename the existing semaphore that tracks the memory used by concurrent view updates and related objects accordingly.	2025-10-23 10:00:15 +02:00
Tomasz Grabiec	564cebd0e6	Merge 'tablet_metadata_guard: fix split/merge handling' from Petr Gusev The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the `tablet_id` field (`_tablet`), which means the guard can no longer correctly protect ongoing operations from tablet migrations. The problem is specific to LWT, since `tablet_metadata_guard` is used mostly for heavy topology operations, which exclude with split and merge. The guard was used for LWT as an optimization -- we don't need to block topology operations or migrations of unrelated tablets. In the future, we could use the guard for regular reads/writes as well (via the `token_metadata_guard` wrapper). Fixes [scylladb/scylladb#26437](https://github.com/scylladb/scylladb/issues/26437) backports: need to backport to 2025.4 since the bug is relevant to LWT over tablets. Closes scylladb/scylladb#26619 * github.com:scylladb/scylladb: test_tablets_lwt: add test_tablets_merge_waits_for_lwt test.py: add universalasync_typed_wrap tablet_metadata_guard: fix split/merge handling tablet_metadata_guard: add debug logs paxos_state: shards_for_writes: improve the error message storage_service: barrier_and_drain – change log level to info topology_coordinator: fix log message	2025-10-22 20:56:21 +02:00
Taras Veretilnyk	60334c6481	sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option Added a new test case, sstable_scrub_segregate_mode_drop_unfixable_sstables_test, which verifies that when the drop-unfixable-sstables flag is enabled in segregate mode, corrupted SSTables are correctly dropped.	2025-10-22 17:16:55 +02:00
Taras Veretilnyk	11874755a3	test/nodetool: add scrub drop-unfixable-sstables option testcase This patches introduces the test_scrub_drop_unfixable_sstables_option testcase, which verifies that correct request is generated when the --drop-unfixable-sstables flag is used. It also validates that an error is thrown if the drop-unfixable-sstables flag is enabled and mode is not set to SEGREGATE. This patch introduces test_scrub_drop_unfixable_sstables_option, which test	2025-10-22 17:16:55 +02:00
Taras Veretilnyk	42da7f1eb6	scrub: add support for dropping unfixable sstables in segregate mode This patch adds a new flag `drop-unfixable-sstables` to the scrub operation in segregate mode, allowing to automatically drop SSTables that cannot be fixed during scrub. It also includes API support of the 'drop_unfixable_sstables' paramater and validation to ensure this flag is not enabled in other modes rather than segragate.	2025-10-22 17:16:49 +02:00
Radosław Cybulski	621e88ce52	Fix spelling errors Closes scylladb/scylladb#26652	2025-10-22 16:46:31 +02:00
Petr Gusev	22271b9fe7	test_automatic_cleanup: add test_cleanup_waits_for_stale_writes	2025-10-22 16:31:43 +02:00
Petr Gusev	d1fc111dd7	test_fencing: fix due to new version increment Topology version is now bumped when a node finishes bootstrapping. As a result, fence_version == version - 1, and decrementing version in the test no longer triggers a stale topology exception. Fix: run cleanup_all to invoke the global barrier, which synchronizes fence_version := version on all nodes.	2025-10-22 16:31:43 +02:00
Petr Gusev	5bdeb4ec66	test_automatic_cleanup: clean it up Remove redundant imports and variables. Extract cleanup_all function. Add logs. Remove pytest.mark.prepare_3_racks_cluster -- the test doesn't actually need a 3 node cluster, one initial node is enough.	2025-10-22 16:31:43 +02:00
Petr Gusev	f34126aacf	storage_proxy: wait for closing sessions in sstable cleanup fiber Ensure that no stale streaming or repair sessions are active before proceeding with the cleanup.	2025-10-22 16:31:43 +02:00
Petr Gusev	7e2959a1bf	storage_proxy: rename await_pending_writes -> await_stale_pending_writes	2025-10-22 16:31:43 +02:00
Petr Gusev	1dd05f4404	storage_proxy: use run_fenceable_write Switch local write code sites from start_write() to run_fenceable_write().	2025-10-22 16:31:43 +02:00
Petr Gusev	d56495fd9c	storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check All mutation_holder::apply_locall() implementations now do the same post fence chech. In this commit we hoist this check up to abstract_write_response_handler::apply_locally().	2025-10-22 16:31:43 +02:00
Petr Gusev	24f8962938	storage_proxy: introduce run_fenceable_write This function is intended to replace start_write() in subsequent commits. It provides the following benefits: * Remove duplication: All start_write() call sites must run the fence check after the operation completes. run_fenceable_write() encapsulates this pattern. * Fix a race: To ensure no new stale write operations occur during cleanup, a fence check before start_write() was previously used. However, yields in several code paths between the check and start_write() made it non-atomic, allowing a stale operation to slip in if the fence_version was updated in between. * Optimize waiting: We do not need to wait for all operations—only for vnode-based, non-local tables with versions smaller than the current fence_version.	2025-10-22 16:31:43 +02:00
Petr Gusev	c5f447224a	storage_proxy: move update_fence_version from shared_token_metadata Future commits will extend update_fence_version, and it is simpler to do so if the function resides in storage_proxy. Additionally, fence_version is the only field this function accesses, and it is used solely within storage_proxy, making this change natural on its own.	2025-10-22 16:31:43 +02:00
Petr Gusev	659c5912e0	storage_proxy: fix start_write() operation scope in apply_locally The operation must be held during the local write. Before this commit, its scope ended after returning from apply_locally(), so it did not actually provide any protection.	2025-10-22 16:31:43 +02:00
Petr Gusev	27915befac	storage_proxy: move post fence check into handle_write handle_write() is invoked from receive_mutation_handler() and handle_paxos_learn(), and both previously performed a fence check in apply_fn. This commit hoists the fence check into handle_write() to reduce code duplication. Additionally, move start_write() after get_schema_for_write(), since there is no need to hold the operation while querying the schema.	2025-10-22 16:31:43 +02:00
Petr Gusev	41077138bf	storage_proxy: move fencing into mutate_counter_on_leader_and_replicate As noted in the code comments, start_write() does not need to be held during counter replication; it is required only while performing local storage modifications. Move the start_write() call and the fence check down to mutate_counter_on_leader_and_replicate(). Additionally, mutate_counters_on_leader() is updated to check for possible stale_topology_exception() and properly package them in the resulting exception_variant structure.	2025-10-22 16:31:43 +02:00
Petr Gusev	a6208b2d67	storage_proxy::handle_read: add fence check before get_schema Avoid querying the schema for outdated requests by adding a fence check at the start of handle_read.	2025-10-22 16:31:43 +02:00
Petr Gusev	263cbef68e	storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber The function applies only to vnode-based tables. Rename it to vnodes_cleanup_fiber for clarity.	2025-10-22 16:31:42 +02:00
Petr Gusev	03aa856da3	sstable_cleanup_fiber: use coroutine::parallel_for_each A refactoring commit -- no need to allocate a dedicated std::vector<future<>>.	2025-10-22 16:31:42 +02:00
Petr Gusev	4a781b67b5	storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock The flush_all_tables() call ensures that no obsolete, cleanup-eligible writes remain in the commitlog. This does not need to run under the group0 lock, so move it outside. Also, run await_pending_writes() before flush_all_tables(), since pending writes may include data that must be cleaned up. Finally, add more detailed info-level logs to trace the stages of the cleanup procedure.	2025-10-22 16:31:42 +02:00
Petr Gusev	a54ebe890b	topology_coordinator: barrier before cleanup Cleanup needs a barrier to make sure that no request coordinators are sending requests to old replicas/ranges that we're going to cleanup. For example, during node bootstrap, the cleanup process on replicas must be protected against coordinators running write_both_read_new and sending requests to old ranges. We run a barrier to ensure that most data-plane requests with the old topology finish before cleanup starts. At the same time, we do not want to block cleanup if the barrier fails on some replicas. Once the fence is committed to group0, we can safely proceed, since any late request with the old topology will be fenced out on the replica. The test for this case is added in a separate commit "test_automatic_cleanup: add test_cleanup_waits_for_stale_writes"	2025-10-22 16:31:42 +02:00
Petr Gusev	1b791dacde	topology_coordinator: small start_cleanup refactoring Rename start_cleanup -> start_vnodes_cleanup for clarity. Pass topology_request and server_id in start_vnodes_cleanup, we will need them for better logging later.	2025-10-22 16:31:42 +02:00
Petr Gusev	d53e24812f	global_token_metadata_barrier: add fenced flag Cleanup needs a barrier. For example, during node bootstrap, the cleanup process on replicas must be protected against coordinators running write_both_read_new and sending requests to old ranges. We run a barrier to ensure that most data-plane requests with the old topology finish before cleanup starts. At the same time, we do not want to block cleanup if the barrier fails on some replicas. Once the fence is committed to group0, we can safely proceed, since any late request with the old topology will be fenced out on the replica. To support this, introduce a "fenced" flag. The client can pass a pointer to a bool, which will be set to true after the new fenced_version is committed.	2025-10-22 16:31:42 +02:00
Calle Wilund	c4427f6d4f	test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper Use the shared logic of DockerizedServer to provide the fake-gcs-server docker helper.	2025-10-22 14:06:30 +00:00
Calle Wilund	1aa8014f8f	test::cluster::test_encryption: Port dtest EAR tests Moves the tests, reexamined and simplified, to unit tests instead of dtest.	2025-10-22 14:06:30 +00:00
Asias He	5f1febf545	repair: Remove the regular mode name in the tablet repair api The patch `e34deb72f9` (repair: Rename incremental mode name) missed one place that references the removed regular mode name. Fixes #26503 Closes scylladb/scylladb#26660	2025-10-22 16:55:55 +03:00
Botond Dénes	1c7f1f16c8	Merge 'raft topology: fix group0 tombstone GC in the Raft-based recovery procedure' from Patryk Jędrzejczak Group0 tombstone GC considers only the current group 0 members while computing the group 0 tombstone GC time. It's not enough because in the Raft-based recovery procedure, there can be nodes that haven't joined the current group 0 yet, but they have belonged to a different group 0 and thus have a non-empty group 0 state ID. The current code can cause a data resurrection in group 0 tables. We fix this issue in this PR and add a regression test. This issue was uncovered by `test_raft_recovery_entry_loss`, which became flaky recently. We skipped this test for now. We will unskip it in a following PR because it's skipped only on master, while we want to backport this PR. Fixes #26534 This PR contains an important bugfix, so we should backport it to all branches with the Raft-based recovery procedure (2025.2 and newer). Closes scylladb/scylladb#26612 * github.com:scylladb/scylladb: test: test group0 tombstone GC in the Raft-based recovery procedure group0_state_id_handler: remove unused group0_server_accessor group0_state_id_handler: consider state IDs of all non-ignored topology members	2025-10-22 16:40:11 +03:00
Ernest Zaslavsky	a09ec56e3d	cmake: fix `s3_test` linkage Fix missing `s3_test` executable linkage with `scylla_encryption` Closes scylladb/scylladb#26655	2025-10-22 14:14:43 +03:00
Anna Stuchlik	9c0ff7c46b	doc: add support for Debian 12 Fixes https://github.com/scylladb/scylladb/issues/26640 Closes scylladb/scylladb#26668	2025-10-22 14:09:13 +03:00
Calle Wilund	93e335f861	test::cluster::conftest: Add key_provider fixture Iterates test functions across all mockable providers and provides a key provider instance handling EAR setup.	2025-10-22 10:53:02 +00:00
Calle Wilund	6406879092	test::pylib::encryption_provider: Port dtest encryption provider classes Adds virtual interface for running scylla with EAR and various providers we can do mock for. Note: GCP KMS not implemented.	2025-10-22 10:53:02 +00:00
Petr Gusev	03d6829783	test_tablets_lwt: add test_tablets_merge_waits_for_lwt	2025-10-22 11:33:20 +02:00
Petr Gusev	33e9ea4a0f	test.py: add universalasync_typed_wrap The universalasync.wrap function doesn't preserve the type information, which confuses the VS Code Pylance plugin and makes code navigation hard. In this commit we fix the problem by adding a typed wrapped around universalasync.wrap. Fixes: scylladb/scylladb#26639	2025-10-22 11:32:37 +02:00
Petr Gusev	b23f2a2425	tablet_metadata_guard: fix split/merge handling The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the tablet_id field (_tablet), which means the guard can no longer correctly protect ongoing operations from tablet migrations. Fixes scylladb/scylladb#26437	2025-10-22 11:32:37 +02:00
Petr Gusev	ec6fba35aa	tablet_metadata_guard: add debug logs	2025-10-22 11:32:37 +02:00
Petr Gusev	64ba427b85	paxos_state: shards_for_writes: improve the error message Add the current token and tablet info, remove 'this_shard_id' since it's always written by the logging infrastructure.	2025-10-22 11:32:37 +02:00
Petr Gusev	6f4558ed4b	storage_service: barrier_and_drain – change log level to info Debugging global barrier issues is difficult without these logs. Since barriers do not occur frequently, increasing the log level should not produce excessive output.	2025-10-22 11:32:37 +02:00
Petr Gusev	e1667afa50	topology_coordinator: fix log message	2025-10-22 11:32:37 +02:00
Nadav Har'El	895d89a1b7	Update seastar submodule Among other things, the merge includes the patch "http: add "Connection: close" header to final server response.". This Fixes #26298: A missing response header meant that a test's client code sometimes didn't notice that the server closed the connection (since the client didn't need to use the connection again), which made one test flaky. * seastar bd74b3fa...63900e03 (6): > Merge 'Rework output_stream::slow_write()' from Pavel Emelyanov output_stream: Fix indentation of the slow_write() method output_stream: Remove pointless else output_stream: Replace std::swap with std::exchange output_stream: Unify some code-paths of slow_write() > Merge 'Deprecate in/out streams move-assignment operator' from Pavel Emelyanov iostream: Deprecate input/output stream default constructor and move-assignment operator test: Sub-split test-cases test: Don't reuse output_stream in file demo test: Keep input_/output_stream as optional util: Construct file_data_source in with_file_input_stream() websocket: Construct in/out in initializer list rpc: Wrap socket and buffers > scripts/perftune.py: detect corrupted NUMA topology information > Merge 'memory, smp: support more than 256 shards' from Avi Kivity reactor, smp: allocate smp queues across all shards memory: increase maximum shard count memory: make cpu_id_shift and related mask dynamic resource, memory: move memory limit calculation to memory.cc resource: don't error if --overprovisioned and asking for more vcpus than available > Merge 'Update perf_test text output, make columns selectable' from Travis Downs perf_tests: enhance text output perf_test_tests: add some check_output tests	2025-10-22 11:26:40 +03:00
Nadav Har'El	7c9f5ef59e	Merge 'alternator/executor: instantly mark view as built when creating it with base table' from Michał Jadwiszczak `CreateTable` request creates GSI/LSI together with the base table, the base table is empty and we don't need to actually build the view. In tablet-based keyspaces we can just don't create view building tasks and mark the view build status as SUCCESS on all nodes. Then, the view building worker on each node will mark the view as built in `system.built_views` (`view_building_worker::update_built_views()`). Vnode-based keyspaces will use the "old" logic of view builder, which will process the view and mark it as built. Fixes scylladb/scylladb#26615 This fix should be backported to 2025.4. Closes scylladb/scylladb#26657 * github.com:scylladb/scylladb: test/alternator/test_tablets: add test for GSI backfill with tablets test/alternator/test_tablets: add reproducer for GSI with tablets alternator/executor: instantly mark view as built when creating it with base table	2025-10-22 10:44:28 +03:00
Calle Wilund	31cc1160b4	test::pylib::dockerized_service: Add helper for running docker/podman While there is a docker interface for python, need to deal with the docker-in-docker issues etc. This uses pure subprocess and stream parse. Meant to provide enough flexibility for all our docker mock server needs.	2025-10-21 23:26:50 +00:00
Avi Kivity	ab488fbb3f	Merge 'Switch to seastar API level 9 (no more packet-s in output_stream/data_sink API)' from Pavel Emelyanov Other than patching Scylla sinks to implement new data_sink_impl::put(std::span<temporary_buffer>) overload, the PR changes transport write_response() method to stop using output_stream::write(scattered_message) because it's also gone. Using newer seastar API, no need to backport Closes scylladb/scylladb#26592 * github.com:scylladb/scylladb: code: Fix indentation after previous patch code: Switch to seastar API level 9 transport: Open-code invoke_with_counting into counting_data_sink::put transport: Don't use scattered_message utils: Implement memory_data_sink::put(net::packet)	2025-10-22 01:51:43 +03:00
Michał Jadwiszczak	34503f43a1	test/alternator/test_tablets: add test for GSI backfill with tablets The test should pass without the fix for scylladb/scylladb#26615, because the `executor::updata_table()` uses `service::prepare_new_view_announcement()`, which creates view building tasks for the view. But it's better to add this test.	2025-10-22 00:34:49 +02:00
Michał Jadwiszczak	bdab455cbb	test/alternator/test_tablets: add reproducer for GSI with tablets	2025-10-22 00:34:10 +02:00
Andrei Chekun	24d17c3ce5	test.py: rewrite the wait_for_first_completed Rewrite wait_for first_completed to return only first completed task guarantee of awaiting(disappearing) all cancelled and finished tasks Use wait_for_first_completed to avoid false pass tests in the future and issues like #26148 Use gather_safely to await tasks and removing warning that coroutine was not awaited Closes scylladb/scylladb#26435	2025-10-22 01:13:43 +03:00
Takuya ASADA	eb30594a60	dist: detect corrupted NUMA topology information There are some environment which has corrupted NUMA topology information, such as some instance types on AWS EC2 with specific Linux kernel images. On such environment, we cannot get HW information correctly from hwloc, so we cannot proceed optimization on perftune. To avoid causing script error, check NUMA topology information and skip running perftune if the information corrupted. Related scylladb/seastar#2925 Closes scylladb/scylladb#26344	2025-10-22 01:11:14 +03:00
Michał Jadwiszczak	8fbf122277	alternator/executor: instantly mark view as built when creating it with base table `CreateTable` request creates GSI/LSI together with the base table, the base table is empty and we don't need to actually build the view. In tablet-based keyspaces we can just don't create view building tasks and mark the view build status as SUCCESS on all nodes. Then, the view building worker on each node will mark the view as built in `system.built_views` (`view_building_worker::update_built_views()`). Vnode-based keyspaces will use the "old" logic of view builder, which will process the view and mark it as built. Fixes scylladb/scylladb#26615	2025-10-22 00:05:40 +02:00
Avi Kivity	029513bee9	Merge 'storage_proxy: wait for write handlers destruction' from Petr Gusev `shared_ptr<abstract_write_response_handler>` instances are captured in the `lmutate` and `rmutate` lambdas of `send_to_live_endpoints()`. As a result, an `abstract_write_response_handler` object may outlive its removal from the `storage_proxy::_response_handlers` map -> `cancel_all_write_response_handlers()` doesn't actually wait for requests completion -> `sp::drain_on_shutdown()` doesn't guarantee all requests are drained -> `sp::stop_remote()` completes too early and `paxos_store` is destroyed while LWT local writes might still be in progress. In this PR we introduce a `write_handler_destroy_promise` to wait for such pending instances in `cancel_write_handlers()` and `cancel_all_write_response_handlers()` to prevent the `use-after-free`. A better long-term solution might be to replace `shared_ptr` with `unique_ptr` for `abstract_write_response_handler` and use a separate gate to track the `lmutate/rmutate` lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in `~abstract_write_response_handler`. Fixes scylladb/scylladb#26355 backport: need to be backported to 2025.4 since #26355 is reproduced on LWT over tablets Closes scylladb/scylladb#26408 * github.com:scylladb/scylladb: test_tablets_lwt: add test_lwt_shutdown storage_proxy: wait for write handler destruction storage_proxy: coroutinize cancel_write_handlers storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler	2025-10-22 00:02:08 +03:00
Michał Hudobski	5c957e83cb	vector_search: remove dependence on cql3 This patch removes the dependence of vector search module on the cql3 module by moving the contents of cql3/type_json.hh to types/json_utils.hh and removing the usage of cql3 primary_key object in vector_store_client. We also make the needed adjustments to files that were previously using the afformentioned type_json.hh file. This fixes the circular dependency cql3 <-> vector_search. Closes scylladb/scylladb#26482	2025-10-21 17:41:55 +03:00
Michael Litvak	35711a4400	test: cdc: test cdc compatible schema Add a simple test verifying our changes for the compatible CDC schema. The test checks we can write to a table with CDC enabled after ALTER and after node restart.	2025-10-21 14:14:34 +02:00
Michael Litvak	448e14a3b7	cdc: use compatiable cdc schema in the CDC log transformer, when augmenting a base mutation, use the CDC log schema that is compatible with the base schema, if set. Now that the base schema has a pointer to its CDC schema, we can use it instead of getting the current schema from the db, which may not be compatible with the base schema. The compatible CDC schema may not be set if the cluster is not using raft mode for schema. In this case, we maintain the previous behavior.	2025-10-21 14:14:33 +02:00
Michael Litvak	6e2513c4d2	db: schema_applier: create schema with pointer to CDC schema When creating a schema for a non-CDC table in the schema_applier, find its CDC schema that we created previously in the same operation, if any, and create the schema with a pointer to the CDC schema. We use the fact that for a base table with CDC enabled, its CDC schema is created or altered together in the same group0 operation. Similarly, in schema_tables, when creating table schemas from the schema tables, first create all schemas that don't have CDC enabled, then create schemas that have CDC enabled by extending them with the pointer to the CDC schema that we created before. There are few additional cases where we create schemas that we need to consider how to handle. When loading a schema from schema tables in the schema_loader we decide not to set the CDC schema, because this schema is mostly used for tools and it's not used for generating CDC mutations. When transporting a schema by RPC in the migration manager, we don't transport its CDC schema, and we always set it to null. Because we use raft we expect this shouldn't have any effect, because the schema is synchronized through raft and not through the RPC.	2025-10-21 14:13:43 +02:00
Michael Litvak	4fe13c04a9	db: schema_applier: extract cdc tables Previously in the schema applier we have two maps of schema_mutations, for tables and for views. Now create another map for CDC tables by extracting them from the non-views tables map. We maintain the previous behavior by applying each operation that's done on the tables map, to the CDC map as well. Later we will want to handle CDC and non-CDC tables differently. We want to be able to create all CDC schemas first, so when we create the non-CDC tables we can create them with a pointer to their CDC schemas.	2025-10-21 14:13:43 +02:00
Michael Litvak	ac96e40f13	schema: add pointer to CDC schema Add to the schema object a member that points to the CDC schema object that is compatible with this schema, if any. The compatible CDC schema is created and altered with its base schema in the same group0 operation. When generating CDC log mutations for some base mutation we want them to be created using a compatible schema thas has a CDC column corresponding to each base column. This change will allow us to find the right CDC schema given a base mutation. We also update the relevant structures in the schema registry that are related to learning about schemas and transporting schemas across shards or nodes. When transporting a schema as frozen_schema, we need to transport the frozen cdc schema as well, and set it again when unfreezing and reconstructing the schema. When adding a schema to the registry, we need to ensure its CDC schema is added to the registry as well. Currently we always set the CDC schema to nullptr and maintain the previous behavior. We will change it in a later commit. Until then, we mark all places where CDC schema is passed clearly so we don't forget it.	2025-10-21 14:13:43 +02:00
Michael Litvak	60f5c93249	schema_registry: remove base_info from global_schema_ptr remove the _base_info member from global_schema_ptr, and used the base_info we have stored in the schema registry entry instead. Currently when constructing a global_schema_ptr from a schema_ptr it extracts and stores the base_info from the schema_ptr. Later it uses it to reconstruct the schema_ptr, together with the frozen schema from the schema registry entry. But we can use the base_info that is already stored in the schema registry entry.	2025-10-21 14:13:43 +02:00
Michael Litvak	085abef05d	schema_registry: use extended_frozen_schema in schema load Change the schema loader type in the schema_registry to return a extended_frozen_schema instead of view_schema_and_base_info, and remove view_schema_and_base_info which is not used anymore. The casting between them is trivial.	2025-10-21 14:13:43 +02:00
Michael Litvak	8c7c1db14b	schema_registry: replace frozen_schema+base_info with extended_frozen_schema The schema_registry_entry holds a frozen_schema and a base_info. The base_info is extracted from the schema_ptr on load of a schema_ptr, and it is used when unfreezing the schema. But this is exactly what extended_frozen_schema is doing, so we can just store an object of this type in the schema_registry_entry. This makes the code simpler because the schema registry doesn't need to be aware of the base_info.	2025-10-21 14:13:43 +02:00
Michael Litvak	278801b2a6	frozen_schema: extract info from schema_ptr in the constructor Currently we construct a frozen schema with base info in few places, and the caller is responsible for constructing the frozen schema and extracting the base info if it's a view table. We change it to make it simpler and remove the burden from the caller. The caller can simply pass the schema_ptr, and the constructor for extended_frozen_schema will construct the frozen schema and extract the additional info it needs. This will make it easier to add additional fields, and reduces code duplication. We also make temporary castings between extended_frozen_schema and view_schema_and_base_info for the transition, which are trivial, until they are combined to a single type.	2025-10-21 14:13:42 +02:00
Michael Litvak	154d5c40c8	frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema This commit starts a series of refactoring commits of the frozen_schema to reduce duplication and make it easier to extend. Currently there are two essentially identical types, frozen_schema_with_base_info and view_schema_and_base_info in the schema_registry that hold a frozen_schema together with a base_info for view schemas. Their role is to pass around a frozen schema together with additional info that is extracted from the schema and passed around with it when transporting it across shards or nodes, and is needed for reconstructing it, and it is not part of the schema mutations. Our goal is to combine them to a single type that we will call extended_frozen_schema.	2025-10-21 14:13:42 +02:00
Emil Maskovsky	cf93820c0a	test/cluster: fix missing await in test_group0_tombstone_gc The recursive call to alter_system_schema() was missing the await keyword, which meant the coroutine was never actually executed and the test wasn't doing what it was supposed to do. Not backporting: Test fix only. Closes scylladb/scylladb#26623	2025-10-21 11:22:39 +02:00
Calle Wilund	91db8583f8	test::pylib::kmip_wrapper: Modify to be usable by pytest fixtures Add `serve` impl that does not mess with signals, and shutdown that does not mess with threads. Also speed up standalone shutdown to make boost tests less slow.	2025-10-21 09:01:55 +00:00
Calle Wilund	772bd856e2	test::boost::kmip_wrapper: Move python script for PyKMIP to pylib Prepare for re-use in python tests as well as boost ones.	2025-10-21 09:01:54 +00:00
Avi Kivity	0ed178a01e	build: disable the -fextend-variable-liveness clang option In clang 21, the -fextend-variable-liveness option was made default [1] with -Og. It helps reduce "optimized out" problems while debugging. However, it conflicts [2] with coroutines. To prevent problems during the upgrade to Clang 21, disable the option. [1] `36af7345df` [2] https://github.com/llvm/llvm-project/issues/163007 Closes scylladb/scylladb#26573	2025-10-21 10:47:34 +03:00
Botond Dénes	fbceb8c16b	Merge 's3_client: handle failures which require http::request updating' from Ernest Zaslavsky Apply two main changes to the s3_client error handling 1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header 2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether Fixes: https://github.com/scylladb/scylladb/issues/26483 Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions Closes scylladb/scylladb#26527 * github.com:scylladb/scylladb: s3_client: tune logging level s3_client: add logging s3_client: improve exception handling for chunked downloads s3_client: fix indentation s3_client: add max for client level retries s3_client: remove `s3_retry_strategy` s3_client: support high-level request retries s3_client: just reformat `make_request` s3_client: unify `make_request` implementation	2025-10-21 10:40:38 +03:00
Botond Dénes	c543059f86	Merge 'Synchronize tablet split and load-and-stream' from Raphael Raph Carvalho Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1) split starts 2) split finalization executes barrier and succeed 3) load-and-stream runs now, starts writing sstable (pre-split) 4) split finalization publishes changes to tablet metadata 5) load-and-stream finishes writing sstable 6) sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): 1) load-and-stream awaits for topology to quiesce 2) perform split compaction on sstable that spans both sibling tablets This patch implements # 1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes https://github.com/scylladb/scylladb/issues/26455. Closes scylladb/scylladb#26456 * github.com:scylladb/scylladb: test: Add reproducer for l-a-s and split synchronization issue sstables_loader: Synchronize tablet split and load-and-stream	2025-10-21 09:43:38 +03:00
Tomasz Grabiec	ba692d1805	schema_tables: Keep "replication" column backwards-compatible by expanding rack lists to numeric RF In `380f243986` we added support for rack lists in replication options. Drivers which are not prepared to parse that (as of now, all of them), will not create metadata object for that keyspace. This breaks, for example, the "copy to/from" cqlsh command. Potentially other things too. To fix that, keep the "replication" column in the old format, and store numeric RF there, which corresponds to the number of replicas. Accurate options in the new format are put in "replication_v2". We set replication_v2 in the schema only when it differs from the old "replication" so that the new column is not set during upgrade, otherwise downgrade would fail. Partition tombstone is added to ensure that pre-alter replication_v2 value is deleted on alters which change replication to a value which is the same as the post-alter "replication" value. Fixes #26415 Closes scylladb/scylladb#26429	2025-10-21 09:11:25 +03:00
Tomasz Grabiec	e4e79be295	Merge 'tablet_allocator: allow merges in base tables if rf-rack-valid=true' from Piotr Dulikowski Tablet merge of base tables is only safe if there is at most one replica in each rack. For more details on why it is the case please see scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on, this condition is satisfied, so allow it in that case. Fixes: scylladb/scylladb#26273 Marked for backport to 2025.4 as MVs are getting un-experimentaled there. Closes scylladb/scylladb#26278 * github.com:scylladb/scylladb: test: mv: add a test for tablet merge tablet_allocator, tests: remove allow_tablet_merge_with_views injection tablet_allocator: allow merges in base tables if rf-rack-valid=true	2025-10-21 00:18:30 +02:00
Raphael S. Carvalho	4654cdc6fd	test: Add reproducer for l-a-s and split synchronization issue Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-10-20 19:17:25 -03:00
Raphael S. Carvalho	3abc66da5a	sstables_loader: Synchronize tablet split and load-and-stream Load-and-stream is broken when running concurrently to the finalization step of tablet split. Consider this: 1) split starts 2) split finalization executes barrier and succeed 3) load-and-stream runs now, starts writing sstable (pre-split) 4) split finalization publishes changes to tablet metadata 5) load-and-stream finishes writing sstable 6) sstable cannot be loaded since it spans two tablets two possible fixes (maybe both): 1) load-and-stream awaits for topology to quiesce 2) perform split compaction on sstable that spans both sibling tablets This patch implements #1. By awaiting for topology to quiesce, we guarantee that load-and-stream only starts when there's no chance coordinator is handling some topology operation like split finalization. Fixes #26455. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-10-20 19:17:22 -03:00
Piotr Dulikowski	f76917956c	view_building_worker: access tablet map through erm on sstable discovery Currently, the data returned by `database::get_tables_metadata()` and `database::get_token_metadata()` may not be consistent. Specifically, the tables metadata may contain some tablet-based tables before their tablet maps appear in the token metadata. This is going to be fixed after issue scylladb/scylladb#24414 is closed, but for the time being work around it by accessing the token metadata via `table`->effective_replication_map() - that token metadata is guaranteed to have the tablet map of the `table`. Fixes: scylladb/scylladb#26403 Closes scylladb/scylladb#26588	2025-10-21 00:14:39 +02:00
Petr Gusev	8925f31596	test_tablets_lwt: add test_lwt_shutdown	2025-10-20 20:16:09 +02:00
Petr Gusev	bbcf3f6eff	storage_proxy: wait for write handler destruction shared_ptr<abstract_write_response_handler> instances are captured in the lmutate/rmutate lambdas of send_to_live_endpoints(). As a result, an abstract_write_response_handler object may outlive its removal from the _response_handlers map. We use write_handler_destroy_promise to wait for such pending instances in cancel_write_handlers() and cancel_all_write_response_handlers() to prevent use-after-free. A better long-term solution might be to replace shared_ptr with unique_ptr for abstract_write_response_handler and use a separate gate to track the lmutate/rmutate lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in ~abstract_write_response_handler. Fixes scylladb/scylladb#26355	2025-10-20 20:10:42 +02:00
Petr Gusev	b269f78fa6	storage_proxy: coroutinize cancel_write_handlers The cancel_write_handlers() method was assumed to be called in a thread context, likely because it was first used from gossiper events, where a thread context already existed. Later, this method was reused in abort_view_writes() and abort_batch_writes(), where threads are created on the fly and appear redundant. The drain_on_shutdown() method also used a thread, justified by some "delicate lifetime issues", but it is unclear what that actually means. It seems that a straightforward co_await should work just fine.	2025-10-20 19:49:02 +02:00
Petr Gusev	bf2ac7ee8b	storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler A strong pointer was held for the duration of thread::yield(), preventing abstract_write_response_handler destruction and possibly delaying the sending of timeout or error responses to the client. This commit removes the strong pointer. Instead, we compute the next iterator before calling timeout_cb(), so if the handler is destroyed inside timeout_cb(), we already have a valid next iterator.	2025-10-20 19:49:02 +02:00
Piotr Wieczorek	a3ec6c7d1d	alternator/streams: Support userIdentity field for TTL deletions UserIdentity is a map of two fields in GetRecords responses, which always has the same value. It may be missing, or contain a constant object with value `{"type": "Service", "principalId": "dynamodb.amazonaws.com"}`. Currently, the latter is set only for `REMOVE`s triggered by TTL. This commit introduces two new CDC operation types: `service_row_delete` and `service_partition_delete`, emitted in place of `row_delete` and `partition_delete`. Alternator Streams treats them as regular `REMOVE`s, but in addition adds the `userIdentity` field to the record. This change may break existing Scylla libraries for reading raw CDC tables, but we doubt that anybody has this use case. Refs https://github.com/scylladb/scylladb/pull/26149 Refs https://github.com/scylladb/scylladb/pull/26121 Fixes https://github.com/scylladb/scylladb/issues/11523 Closes scylladb/scylladb#26460	2025-10-20 17:15:59 +02:00
Nadav Har'El	eb06ace944	Merge 'auth: implement vector store authorization' from Michał Hudobski This patch implements the changes required by the Vector Store authorization, as described in https://scylladb.atlassian.net/wiki/spaces/RND/pages/107085899/Vector+Store+Authentication+And+Authorization+To+ScyllaDB, that is: - adding a new permission VECTOR_SEARCH_INDEXING, grantable only on ALL KEYSPACES - allowing users with that permission to perform SELECT queries, but only on tables with a vector index - increasing the number of scheduling groups by one to allow users to create a service level for a vector store user - adjusting the tests and documentation These changes are needed, as the vector indexes are managed by the external service, Vector Store, which needs to read the tables to create the indexes in its memory. We would like to limit the privileges of that service to a minimum to maintain the principle of least privilege, therefore a new permission, one that allows the SELECTs conditional on the existence of a vector_index on the table. Fixes: VECTOR-201 Backport reasoning: Backport to 2025.4 required as this can make upgrading clusters more difficult if we add it in 2026.1. As for now Scylla Cloud requires version 2025.4 to enable vector search and permission is set by orchestrator so there is no chance that someone will try to add this permission during upgrade. In 2026.1 it will be more difficult. Closes scylladb/scylladb#25976 * github.com:scylladb/scylladb: docs: adjust docs for VS auth changes test: add tests for VECTOR_SEARCH_INDEXING permission cql: allow VECTOR_SEARCH_INDEXING users to select auth: add possibilty to check for any permission in set auth: add a new permission VECTOR_SEARCH_INDEXING	2025-10-20 17:32:00 +03:00
Ernest Zaslavsky	fdd0d66f6e	s3_client: tune logging level Change all logging related to errors in `chunked_download_source` background download fiber to `info` to make it visible right away in logs.	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	4497325cd6	s3_client: add logging Add logging for the case when we encounter expired credentials, shouldnt happen but just in case	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	1d34657b14	s3_client: improve exception handling for chunked downloads Refactor the wrapping exception used in `chunked_download_source` to prevent the retry strategy from reattempting failed requests. The new implementation preserves the original `exception_ptr`, making the root cause clearer and easier to diagnose.	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	58a1cff3db	s3_client: fix indentation Reformat `client::make_request` to fix the indentation of `if` block	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	43acc0d9b9	s3_client: add max for client level retries To prevent client retrying indefinitely time skew and authentication errors add `max_attempts` to the `client::make_request`	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	116823a6bc	s3_client: remove `s3_retry_strategy` It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request	2025-10-20 17:12:59 +03:00
Ernest Zaslavsky	185d5cd0c6	s3_client: support high-level request retries Add an option to retry S3 requests at the highest level, including reinitializing headers and reauthenticating. This addresses cases where retrying the same request fails, such as when the S3 server rejects a timestamp older than 15 minutes.	2025-10-20 17:12:59 +03:00
Dawid Mędrek	7e201eea1a	index: Set tombstone_gc when creating secondary index Before this commit, when the underlying materialized view was created, it didn't have the property `tombstone_gc` set to any value. That was a bug and we fix it now. Two reproducer tests is added for validation. They reproduce the problem and don't pass before this commit. Fixes scylladb/scylladb#26542	2025-10-20 14:04:45 +02:00
Dawid Mędrek	e294b80615	index: Make `create_view_for_index` method of `create_index_statement`	2025-10-20 14:04:16 +02:00
Dawid Mędrek	fe00485491	index: Move code for creating MV of secondary index to cql3 We move the code responsible for creating the schema for the underlying materialized view of a secondary index from `index/` to `cql3/` so that it's close to that responsible for performing `CREATE INDEX`. That's in line with how other CQL statements are designed. Note that the moved method is still a method of `secondary_index_manager`. We'll make it a method of `create_index_statement` in the following commit.	2025-10-20 14:04:11 +02:00
Dawid Mędrek	20761b5f13	db, cql3: Move creation of underlying MV for index The main goal of this patch is to give more control over the creation of the underlying view on an index to `create_index_statement.cc`. That goal is in line with how the other statements are executed: the schema is built in the cql3 module and only the ready schema_ptr is passed further. That should also make the code cleaner and easier to understand. There are a few important things to note here: * A call to `service::prepare_new_view_announcement` appears out of nowhere. Aside from some validation checks and logging, that function does pretty much the same as the pre-existing code we remove: a. It creates Raft mutations based on the passed `view_ptr`. b. It creates Raft mutations responsible for view building tasks. c. It notifies about a new column family. * We seemingly get rid of the code that creates view building tasks. That's not true: we still do that via `service::prepare_new_view_announcement`. That should explain why the change doesn't remove any relevant logic. On the other hand, it might be more difficult to explain why moving the code is correct. I'll touch on it below. Before that, it may also be important to highlight that this commit only affects the logic responsible for creating an index. There should be no effect on any other part of how Scylla behaves. --- Proving the correctness of the solution would take quite a lot of space, so I'll only summarize it. It relies on a few things: 1. Two schema changes cannot happen in one operation. We allow for more but only when those changes are dependent on each other and when the additional ones are internal for Scylla, e.g. creating an index leads to creating the underlying materialized view. 2. There are no entities or components that rely on indexes. 3. Each index is uniquely defined by the keyspace it belongs to and the name of the index. 4. There is a bijection between rows in `system_schema.indexes` and the currently existing indexes. 5. The name of an unnamed index depends on the name of the base table and the names of the indexed columns. The name of an unnamed index may have a number attached to it, but that number only depends on the state of the schema at the time of creation of the index, and it never changes later on. There are no other things the name of an unnamed index depends on. 6. Scylla doesn't allow for changing any column in the base table that has an index depending on it. Based on that, we conclude that every existing index has exactly one entry in `system_schema.indexes`, and the primary key of that entry never changes. The columns of `system_schema.indexes` that are not part of the primary key are: `kind` and `options`. Both values are only decided at the time of creation of an index, and currently there's no way to modify them. That implies that there are only two events when an entry in the system table can change: when creating an index and when dropping an index. --- When we consider the previous place of the logic that this commit moves to `cql3/statements/create_index_statement.cc`, it works like this: 1. We compare the sets of indexes defined on a specific table (in the form of a structure called `index_metadata`) before and after an operation. 2. We divide the entries into three sets: those present in both sets and those present in only one of them. 3. We handle each of those three sets separately. The structure `index_metadata` is a reflection of entries in `system_schema.indexes`. It stores one more parameter -- `local` -- but its value depends on the other values of an entry, so we can ignore it in this reasoning. Because an index cannot be modified -- it can only be created or dropped -- there are at most two non-empty sets: the set of new indexes and the set of dropped indexes. Those sets are only non-empty during an operation like `CREATE INDEX`, `DROP INDEX`, `DROP TABLE (base table)`, `DROP KEYSPACE`. Note that it's impossible to drop an index by dropping the underlying materialized view -- Scylla doesn't allow for that. However, the code in `migration_manager.cc` we call (`prepare_column_family_update_announcement`) and the code that we call in `schema_tables.cc` (`make_update_table_mutations`) is only triggered by updates related to the base table. In the context of `DROP TABLE` or `DROP KEYSPACE`, we'd call `prepare_column_family_drop_announcement` instead. In other words, we're only concerned with `CREATE INDEX` and `DROP INDEX`. --- A conclusion from this reasoning is that we only need to consider those two situations when talking about correctness of this change. The impact of this commit is that we may have potentially reordered mutations in the resulting vector that will be applied to the Raft log. The only mutations we may have reordered are the mutations responsible for creating the underlying view and the mutations responsible for updating columns in the base table. It's clear then that this commit brings no change at all: we only give `cql3/statements/create_index_statement.cc` more control over creating the underlying view. --- We leave a remnant of the code in `db/schema_tables.cc` responsible for dropping an index along with its underlying view. It would require changing a bit more of the logic, and we don't need it for the rest of this sequence of changes. Refs scylladb/scylladb#16454	2025-10-20 14:04:06 +02:00
Łukasz Paszkowski	7ec369b900	database: Log message after critical_disk_utilization mode is set This is a follow-up of the previous fix: https://github.com/scylladb/scylladb/pull/26030 The test test_user_writes_rejection starts a 3-node cluster and creates a large file on one of the nodes, to trigger the out-of-space prevention mechanism, which should reject writes on that node. It waits for the log message 'Setting critical disk utilization mode: true' and then executes a write expecting the node to reject it. Currently, the message is logged before the `_critical_disk_utilization` variable is actually updated. This causes the test to fail sporadically if it runs quickly enough. The fix splits the logging into two steps: 1. "Asked to set critical disk utilization mode" - logged before any action 2) "Set critical disk utilization mode" - logged after `_critical_disk_utilization` has been updated The tests are updated to wait for the second message. Fixes https://github.com/scylladb/scylladb/issues/26004 Closes scylladb/scylladb#26392	2025-10-20 13:24:10 +03:00
Asias He	33bc1669c4	repair: Fix uuid and nodes_down order in the log Fixes #26536 Closes scylladb/scylladb#26547	2025-10-20 13:21:59 +03:00
Pavel Emelyanov	44ed3bbb7c	Merge 'RFC: Initial GCP storage backend for scylla (sstables + backup)' from Calle Wilund Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage. Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers. This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend. Similarly with storage_options. Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc). Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends. Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake. Fixes #25359 Fixes #26453 Closes scylladb/scylladb#26186 * github.com:scylladb/scylladb: docs::dev::object_storage: Add some initial info on GS storage docs/dev: Add mention of (nested) docker usage in testing.md sstables::object_storage_client: Forward memory limit semaphore to GS instance utils::gcp::object_storage: Add optional memory limits to up/download sstables::object_storage_client: Add multi-upload support for GS utils::gcp::storage: Add merge objects operation test_backup/test_basic: Make tests multiplex both s3 and gs backends test::cluster::conftest: Add support for multiple object storage backends boost::gcs_storage_test: reindent boost::gcs_storage_test: Convert to use fixture tests::boost: Add GS object storage cases to mirror S3 ones tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env sstables::object_storage_client: Add google storage implementation test_services: Allow testing with GS object storage parameters utils::gcp::gcp_credentials: Add option to create uninitialized credentials utils::gcp::object_storage: Make create_download_source return seekable_data_source utils::gcp::object_storage: Add defensive copies of string_view params utils::gcp::object_storage: Add missing retry backoff increate utils::gcp::object_storage: Add timestamp to object listing utils::gcp::object_storage: Add paging support to list_objects object_storage_client: Add object_name wrapper type utils::gcp::object_storage: Add optional abort_source utils::rest::client: Add abort_source support sstables: Use object_storage_client for remote storage sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial) s3::upload_progress: Promote to general util type storage_options: Abstract s3 to "object_storage" and add gs as option sstables::file_io_extension: Change "creator" callback to just data_source utils::io-wrappers: Add ranged data_source utils::io-wrappers: Add file wrapper type for seekable_source utils::seekable_source: Add a seekable IO source type object_storage_endpoint_param: Add gs storage as option config: break out object_storage_endpoint_param preparing for multi storage	2025-10-20 13:14:53 +03:00
Patryk Jędrzejczak	c57f097630	test: test group0 tombstone GC in the Raft-based recovery procedure We add a regression test for the bug fixed in the previous commits.	2025-10-20 12:05:11 +02:00
Patryk Jędrzejczak	6b2e003994	group0_state_id_handler: remove unused group0_server_accessor It became unused in the previous commit.	2025-10-20 12:05:11 +02:00
Patryk Jędrzejczak	1d09b9c8d0	group0_state_id_handler: consider state IDs of all non-ignored topology members It's not enough to consider only the current group 0 members. In the Raft-based recovery procedure, there can be nodes that haven't joined the current group 0 yet, but they have belonged to a different group 0 and thus have a non-empty group 0 state ID. We fix this issue in this commit by considering topology members instead. We don't consider ignored nodes as an optimization. When some nodes are dead, the group 0 state ID handler won't have to wait until all these nodes leave the cluster. It will only have to wait until all these nodes are ignored, which happens at the beginning of the first removenode/replace. As a result, tombstones of group 0 tables will be purged much sooner. We don't rename the `group0_members` variable to keep the change minimal. There seems to be no precise and succinct name for the used set of nodes anyway. We use `std::ranges::join_view` in one place because: - `std::ranges::concat` will become available in C++26, - `boost::range::join` is not a good option, as there is an ongoing effort to minimize external dependencies in Scylla.	2025-10-20 12:05:07 +02:00
Avi Kivity	87c0adb2fe	gdb: simplify and future-proof looking up coroutine frame type llvm recently updated [1] their coroutine debugging instructions. They now recommend looking up the variable __coro_frame in the coroutine function rather than constructing the name of the coroutine frame type from the ramp function plus __coro_frame_ty. Since the latter method no longer works with Clang 21 (I did not check why), and since the former method is blessed as being more compatible, switch to the recommended method. Since it works with both Clang 20 and Clang 21, it future proofs the script. [1] `6e784afcb5` Closes scylladb/scylladb#26590	2025-10-20 12:38:53 +03:00
Botond Dénes	1ab697693f	Merge 'compaction/twcs: fix use after free issues' from Lakshmi Narayanan Sreethar The `compaction_strategy_state` class holds strategy specific state via a `std::variant` containing different state types. When a compaction strategy performs compaction, it retrieves a reference to its state from the `compaction_strategy_state` object. If the table's compaction strategy is ALTERed while a compaction is in progress, the `compaction_strategy_state` object gets replaced, destroying the old state. This leaves the ongoing compaction holding a dangling reference, resulting in a use after free. Fix this by using `seastar::shared_ptr` for the state variant alternatives(`leveled_compaction_strategy_state_ptr` and `time_window_compaction_strategy_state_ptr`). The compaction strategies now hold a copy of the shared_ptr, ensuring the state remains valid for the duration of the compaction even if the strategy is altered. The `compaction_strategy_state` itself is still passed by reference and only the variant alternatives use shared_ptrs. This allows ongoing compactions to retain ownership of the state independently of the wrapper's lifetime. The method `maybe_wait_for_sstable_count_reduction()`, when retrieving the list of sstables for a possible compaction, holds a reference to the compaction strategy. If the strategy is updated during execution, it can cause a use after free issue. To prevent this, hold a copy of the compaction strategy so it isn’t yanked away during the method’s execution. Fixes #25913 Issue probably started after `9d3755f276`, so backport to 2025.4 Closes scylladb/scylladb#26593 * github.com:scylladb/scylladb: compaction: fix use after free when strategy is altered during compaction compaction/twcs: pass compaction_strategy_state to internal methods compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction	2025-10-20 10:45:47 +03:00
Ernest Zaslavsky	db1ca8d011	s3_client: just reformat `make_request` Just reformat previously changed methods to improve readability	2025-10-20 10:44:37 +03:00
Israel Fruchter	986e8d0052	Update tools/cqlsh submodule (v6.0.27) * tools/cqlsh ff3f572...f852b1f5 (2): > Add LZ4 as a required package - so ScyllaDB Python driver could use LZ4 compression > github actions: replace macos-13 with macos-15-intel Closes scylladb/scylladb#26608	2025-10-20 10:03:31 +03:00
Michael Litvak	b808d84d63	storage_service: improve colocated repair error to show table names When requesting repair for tablets of a colocated table, the request fails with an error. Improve the error message to show the table names instead of table IDs, because the table names are more useful for users. Fixes scylladb/scylladb#26567 Closes scylladb/scylladb#26568	2025-10-20 10:03:31 +03:00
Piotr Dulikowski	70b0cfb13e	Merge 'test: cluster: Replica exceptions tests' from Dario Mirovic This patch series introduces several tests that check number of exceptions that happens during various replica operations. The goal is to have a set of tests that can catch situations where number of exceptions per operation increases. It makes exception throw regressions easier to catch. The tests cover apply counter update and apply functionalities in the database layer. There are more paths that can be checked, like various semaphore wait timeouts located deeper in the code. This set of tests does not cover all code paths. Fixes #18164 This is an improvement. No backport needed. Closes scylladb/scylladb#25992 * github.com:scylladb/scylladb: test: cluster: test replica write timeout database: parameterize apply_counter_update_delay_5s injector value test: cluster: test replica exceptions - test rate limit exceptions	2025-10-20 10:03:31 +03:00
Piotr Dulikowski	a716fab125	Merge 'alternator/metrics: Log operation sizes to histograms' from Piotr Wieczorek This PR adds operation per-table histograms to Alternator with item sizes involved in an operation, for each of the operations: `GetItem`, `PutItem`, `DeleteItem`, `UpdateItem`, `BatchGetItem`, `BatchWriteItem`. If read-before-write wasn't performed (i.e. it was not needed by the operation and the flag `alternator_force_read_before_write` was disabled), then we log sizes of the items that are in the request. Also, `UpdateItem` logs the maximum of the update size and the existing item size. We'll change it in a next PR. Fixes: #25143 Closes scylladb/scylladb#25529 * github.com:scylladb/scylladb: alternator: Add UpdateItem and BatchWriteItem response size metrics alternator: Add PutItem and DeleteItem response size metrics alternator: Add BatchGetItem response size metrics alternator: Add GetItem response size metrics alternator/test: Add more context to test_metrics.py asserts	2025-10-20 10:03:31 +03:00
Lakshmi Narayanan Sreethar	18c071c94b	compaction: fix use after free when strategy is altered during compaction The `compaction_strategy_state` class holds strategy specific state via a `std::variant` containing different state types. When a compaction strategy performs compaction, it retrieves a reference to its state from the `compaction_strategy_state` object. If the table's compaction strategy is ALTERed while a compaction is in progress, the `compaction_strategy_state` object gets replaced, destroying the old state. This leaves the ongoing compaction holding a dangling reference, resulting in a use after free. Fix this by using `seastar::shared_ptr` for the state variant alternatives(`leveled_compaction_strategy_state_ptr` and `time_window_compaction_strategy_state_ptr`). The compaction strategies now hold a copy of the shared_ptr, ensuring the state remains valid for the duration of the compaction even if the strategy is altered. The `compaction_strategy_state` itself is still passed by reference and only the variant alternatives use shared_ptrs. This allows ongoing compactions to retain ownership of the state independently of the wrapper's lifetime. Fixes #25913 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-17 22:57:05 +05:30
Lakshmi Narayanan Sreethar	35159e5b02	compaction/twcs: pass compaction_strategy_state to internal methods During TWCS compaction, multiple methods independently fetch the compaction_strategy_state using get_state(). This can lead to inconsistencies if the compaction strategy is ALTERed while the compaction is in progress. This patch fixes a part of this issue by passing down the state to the lower level methods as parameters instead of fetching it repeatedly. Refs #25913 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-17 21:26:30 +05:30
Lakshmi Narayanan Sreethar	1cd43bce0e	compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction The method `maybe_wait_for_sstable_count_reduction()`, when retrieving the list of sstables for a possible compaction, holds a reference to the compaction strategy. If the strategy is updated during execution, it can cause a use after free issue. To prevent this, hold a copy of the compaction strategy so it isn’t yanked away during the method’s execution. Refs #26546 Refs #25913 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-10-17 21:26:30 +05:30
Dario Mirovic	1d93f342f9	test: cluster: test replica write timeout This patch introduces test `test_replica_database_apply_timeout`. It tests timeout on database write. The test uses error injection that returns timeout error if the injection `database_apply_force_timeout` is enabled. Refs #18164	2025-10-17 11:52:11 +02:00
Dario Mirovic	ff88fe2d76	database: parameterize apply_counter_update_delay_5s injector value Parameterize `apply_counter_update_delay_5s` injector value. Instead of sleeping 5s when the injection is active, read parameter value that specifies sleep duration. To reflect these changes, it is renamed to `apply_counter_update_delay_ms` and the sleep duration is specified in milliseconds. Refs #18164	2025-10-17 11:52:10 +02:00
Dario Mirovic	7dc0ff2152	test: cluster: test replica exceptions - test rate limit exceptions This patch introduces two tests for `replica::rate_limit_exception`. One test is for write/apply limit, the other one for read/query limit. The tests check the number of rate limit errors reported and the number of cpp exceptions reported. If somebody adds an exception throw on the rate limit paths, this test will catch it and fail. Refs #18164	2025-10-17 10:54:43 +02:00
Botond Dénes	01bcafbe24	Merge 'test: make various improvements in the recovery procedure tests' from Patryk Jędrzejczak This PR contains various improvements in the recovery procedure tests, mostly `test_raft_recovery_user_data`: - decreasing the running time, - some simplifications, - making sure group 0 majority is lost when expected. These are not critical test changes, so no need to backport. Closes scylladb/scylladb#26442 * github.com:scylladb/scylladb: test: assert that majority is lost in some tests of the recovery procedure test: rest_client: add timeout support for read_barrier test: test_raft_recovery_user_data: lose majority when killing one dc test: test_raft_recovery_user_data: shutdown driver sessions test: test_raft_recovery_user_data: use a separate driver connection for the write workload test: test_raft_recovery_user_data: send ALTER KEYSPACE to any node test: test_raft_recovery_user_data: bring failure_detector_timeout_in_ms back to 20 s test: test_raft_recovery_user_data: speed up replace operations test: stop/start servers concurrently in the recovery procedure tests	2025-10-17 10:54:05 +03:00
Piotr Wieczorek	a2b9d7eed5	alternator: Split `update_item_operation::apply` into smaller methods This is a minor refactoring aimed at reducing cognitive complexity of `update_item_operation::apply`. The logic remains unchanged. Closes scylladb/scylladb#25887	2025-10-17 09:51:05 +02:00
Taras Veretilnyk	d9be2ea69b	docs: improve nodetool getendpoints documentation Clarified and expanded the documentation for the nodetool getendpoints command, including detailed explanations of the --key and --key-components options. Added examples demonstrating usage with simple and composite partition keys. Closes scylladb/scylladb#26529	2025-10-17 10:40:54 +03:00
Pawel Pery	10208c83ca	vector_search: fix flaky dns_refresh_aborted test The test process like that: - run long dns refresh process - request for the resolve hostname with short abort_source timer - result should be empty list, because of aborted request The test sometimes finishes long dns refresh before abort_source fired and the result list is not empty. There are two issues. First, as.reset() changes the abort_source timeout. The patch adds a get() method to the abort_source_timeout class, so there is no change in the abort_source timeout. Second, a sleep could be not reliable. The patch changes the long sleep inside a dns refresh lambda into condition_variable handling, to properly signal the end of the dns refresh process. Fixes: #26561 Fixes: VECTOR-268 It needs to be backported to 2025.4 Closes scylladb/scylladb#26566	2025-10-17 09:33:17 +02:00
Pavel Emelyanov	7d0722ba5c	code: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:26:50 +03:00
Pavel Emelyanov	a88a36f5b5	code: Switch to seastar API level 9 In the new API the biggest change is to implement the only data_sink_impl::put(span<temporary_buffer>) overload. Encrypted file impl and sstables compress sink use fallback_put() helper that generates a chain of continuations each holding a buffer. The counting_data_sink in transport had mostly been patched to correct implementation by the previous patch, the change here is to replace vector argument with span one. Most other sinks just re-implement their put(vector<temporary_buffer>) overload by iterating over span and non-preemptively grabbing buffers from it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:26:50 +03:00
Pavel Emelyanov	9ece535b5e	transport: Open-code invoke_with_counting into counting_data_sink::put The former helper is implemented like this: future<> invoke_with_counting(fn) { if (not_needed) return fn(); return futurize_invoke(something).then([fn] { return fn() }).finally(something_else); } and all put() overloads are like future<> put(arg) { return invoke_with_counting([this, arg] { return lower_sink.put(arg); }); } The problem is that with seastar API level 9, the put() overload will have to move the passed buffers into stable storage before preempting. In its current implementation, when counting is needed the invoke_with_counting will link lower_sink.put() invocation to the futurize_invoke(something) future. Despite "something" is non-preempting, and futurize_invoke() on it returns ready future, in debug mode ready_future.then() does preempt, and the API level 9 put() contract will be violated. To facilitate the switch to new API level, this patch rewrites one of put() overloads to look like future<> put(arg) { if (not_needed) { return lower_sink.put(arg); } something; return lower_sink(arg).finally(something_else); } Other put()-s will be removed by next patch anyway, but this put() will be patched and will call lower_sink.put() without preemption. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:26:30 +03:00
Pavel Emelyanov	068d788084	transport: Don't use scattered_message The API to put scattered_message into output_stream() is gone in seastar API level 9, transport is the only place in Scylla that still uses it. The change is to put the response as a sequence of temporary_buffer-s. This preserves the zero-copy-ness of the reply, but needs few things to care about. First, the response header frame needs to be put as zero-copy buffer too. Despite output_stream() supports semi-mixed mode, where z.c. buffers can follow the buffered writes, it won't apply here. The socket is flushed() in batched mode, so even if the first reply populates the stream with data and flushes it, the next response may happen to start putting the header frame before delayed flush took place. Second, because socket is flushed in batch-flush poller, the temporary buffers that are put into it must hold the foreigh_ptr with the response object. With scattered message this was implemented with the help of a delter that was attached to the message, now the deleter is shared between all buffers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:17:08 +03:00
Pavel Emelyanov	d9808fafdb	utils: Implement memory_data_sink::put(net::packet) It's going to be removed by next-after-next patch, but the next one needs this overload implemented properly, so here it is. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-17 10:17:08 +03:00
Tomasz Grabiec	c4a87453a2	Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov The series adds an experimental flag for strongly consistent tables and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace. Closes scylladb/scylladb#26116 * github.com:scylladb/scylladb: schema: Allow configuring consistency setting for a keyspace db: experimental consistent-tablets option	2025-10-16 21:48:06 +02:00
Tomasz Grabiec	e6c427953e	Merge 'schema_applier: unify handling of token_metadata during schema change' from Marcin Maliszkiewicz This patchset improves the atomicity and clarity of schema application in the presence of token metadata updates during schema changes. The primary focus is to ensure that changes to tablet metadata are applied atomically as part of the schema commit phase, rather than being replicated to all cores afterward, which previously violated atomicity guarantees. Key changes: - Introduced pending_token_metadata to unify handling of new and existing metadata. - Split token metadata replication into prepare and commit steps. - Abstracted schema dependencies in storage_service to support pending schema visibility. - Applied tablet metadata updates atomically within schema commit phase. Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/24414 Closes scylladb/scylladb#25302 * github.com:scylladb/scylladb: db: schema_applier: update tablet metadata atomically db: replica: move tables_metadata locking to commit storage_service: abstract schema dependecies during token metadata update storage_service: split replicate_to_all_cores to steps db: schema_applier: unify token_metadata loading replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge service: fix dependencies during migration_manager startup db: schema_applier: move pending_token_metadata to locator db: always use _tablet_hint as condition for tablet metadata change db: refactor new_token_metadata into pending_token_metadata db: rename new_token_metadata to pending_token_metadata db: schema_applier: move types storage init to merge_types func db: schema_applier: make merge functions non-static members db: remove unused proxy from create_keyspace_metadata	2025-10-16 21:43:49 +02:00
Piotr Wieczorek	caa522a29d	alternator: Add UpdateItem and BatchWriteItem response size metrics This commit bundle introduces metrics on item sizes for Alternator operations. The new metrics are: - `operation_size_kib op=UpdateItem`: Tracks the size of an `UpdateItem` operation. This is calculated as the sum of the existing item's size plus the estimated size of the updated fields. - `operation_size_kib op=BatchWriteItem`: Tracks the total size of items within a `BatchWriteItem` request, aggregated on a per-table basis. If an item already exists, the logged size is the maximum of the old and the new item size. NOTE: Both metrics rely on read-before-write, so if the `alternator_force_read_before_write` option is disabled, these metrics may be incomplete and report inaccurate sizes.	2025-10-16 19:17:27 +02:00
Piotr Wieczorek	5ca42b3baf	alternator: Add PutItem and DeleteItem response size metrics This commit bundle introduces metrics on item sizes for Alternator operations. Specifically, this commit adds `operation_size_kb` histograms for sizes of items created or replaced by the `PutItem` operation, and sizes of items deleted by `DeleteItem` requests. The latter needs a read-before-write, so the metrics may be incomplete if `alternator_force_read_before_write` is disabled.	2025-10-16 19:17:26 +02:00
Piotr Wieczorek	5c72fd9ea3	alternator: Add BatchGetItem response size metrics This commit bundle introduces metrics on item sizes for Alternator operations. Specifically, this commit adds a `operation_size_kb` per-table histogram, which contains item sizes in BatchGetItem requests. A size of a BatchGetItem is the sum of the sizes of all items in the operation grouped by table. In other words, a single BatchGetItem, and BatchWriteItem for that matter, updates the histograms for each table that it has items in.	2025-10-16 19:16:57 +02:00
Piotr Wieczorek	1aa3819b57	alternator: Add GetItem response size metrics This commit bundle introduces metrics on item sizes for Alternator operations. Specifically, this commit adds a per-table `operation_size_kb` histogram, recording the sizes of the items contained in GetItem responses.	2025-10-16 19:04:55 +02:00
Piotr Dulikowski	44257f4961	Merge 'raft topology: disable schema pulls in the Raft-based recovery procedure' from Patryk Jędrzejczak Schema pulls should always be disabled when group 0 is used. However, `migration_manager::disable_schema_pulls()` is never called during a restart with `recovery_leader` set in the Raft-based recovery procedure, which causes schema pulls to be re-enabled on all live nodes (excluding the nodes replacing the dead nodes). Moreover, schema pulls remain enabled on each node until the node is restarted, which could be a very long time. We fix this issue and add a regression test in this PR. Fixes #26569 This is an important bug fix, so it should be backported to all branches with the Raft-based recovery procedure (2025.2 and newer branches). Closes scylladb/scylladb#26572 * github.com:scylladb/scylladb: test: test_raft_recovery_entry_loss: fix the typo in the test case name test: verify that schema pulls are disabled in the Raft-based recovery procedure raft topology: disable schema pulls in the Raft-based recovery procedure	2025-10-16 18:48:38 +02:00
Emil Maskovsky	6769c313c2	raft: small fixes for voters code Minor cleanups and improvements to voter-related code. No backport: cleanup only, no functional changes. Closes scylladb/scylladb#26559	2025-10-16 18:41:08 +02:00
Ernest Zaslavsky	55fb2223b6	s3_client: unify `make_request` implementation Refactor `make_request` to use a single core implementation that handles authentication and issues the HTTP request. All overloads now delegate to this unified method.	2025-10-16 15:51:28 +03:00
Piotr Wieczorek	1559021c4e	alternator/test: Add more context to test_metrics.py asserts This commit adds more information to the assert messages to ease in debugging. The semantics of the asserts remains the same.	2025-10-16 14:41:19 +02:00
Piotr Dulikowski	a8d92f2abd	test: mv: add a test for tablet merge The test test_mv_tablets_replace verifies that merging tablets of both a view and its base table is allowed if rf-rack-valid-keyspaces option is enabled (and it is enabled by default in the test suite).	2025-10-16 14:07:37 +02:00
Piotr Dulikowski	359ed964e3	tablet_allocator, tests: remove allow_tablet_merge_with_views injection The `allow_tablet_merge_with_views` error injection was previously used to allow merging tablets in a table which has materialized views attached to it. Now, the error injection is not needed because this is allowed under the rf-rack-valid condition, which is enabled by default in tests. Remove the error injection from the code and adjust the tests not to use it.	2025-10-16 14:07:37 +02:00
Piotr Dulikowski	189ad96728	tablet_allocator: allow merges in base tables if rf-rack-valid=true Tablet merge of base tables is only safe if there is at most one replica in each rack. For more details on why it is the case please see scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on, this condition is satisfied, so allow it in that case. Fixes: scylladb/scylladb#26273	2025-10-16 13:02:05 +02:00
Gleb Natapov	c255740989	schema: Allow configuring consistency setting for a keyspace We want to add strongly consistent tables as an option. We will have two kind of strongly consistent tables: globally consistent and locally consistent. The former means that requests from all DCs will be globally linearisable while the later - only requests to the same DCs will be linearisable. To allow configuring all the possibilities the patch adds new parameter to a keyspace definition "consistency" that can be configured to be `eventual`, `global` or `local`. Non eventual setting is supported for tablets enabled keyspaces only. Since we want to start with implementing local consistency configuring global consistency will result in an error for now.	2025-10-16 13:34:49 +03:00
Avi Kivity	8f1de2a7ad	Merge 'test/boost: speed up test test_indexing_paging_and_aggregation by making internal page size configurable' from Nadav Har'El The C++ test `test_indexing_paging_and_aggregation` is one of the slowest tests in test/boost. The reason for its slowness is that it needs a table with more rows than SELECT's "DEFAULT_COUNT_PAGE_SIZE" which was hard-coded to 10,000, so the test needed to write and read tens of thousands of rows, and did it multiple times. It turns out the code actually had an ad-hoc mechanism to override DEFAULT_COUNT_PAGE_SIZE in a C++ test, but both this mechanism and the test itself were so opaque I didn't find it until I fixed it in a different way: What I ended up doing in this pull request is the following (each step in a separate patch): 1. Rewrite this test in Python, in the test/cqlpy framework. This was straightforward, as this test only used CQL and not internal interfaces. The reason why this test wasn't written in Python in the first place is that it was written in 2019, a year before cqlpy existed. A added extensive comments to the new tests, and I finally understood what it was doing :-) 2. I replaced the ad-hoc C++-test-only mechanism of overriding DEFAULT_COUNT_PAGE_SIZE by a bona-fide configuration parameter, `select_internal_page_size`. 3. Finally, the Python test can temporarily lower `select_internal_page_size` and use a table with much fewer rows. After this series, the test `test_indexing_paging_and_aggregation` (which is now in Python instead of C++) takes around half a second, 20 times faster than before. I expect the speedup to be even more dramatic for the debug build. Closes scylladb/scylladb#25368 * github.com:scylladb/scylladb: cql: make SELECT's "internal page size" configurable secondary index: translate test_indexing_paging_and_aggregation to Python	2025-10-16 11:58:13 +03:00
Marcin Maliszkiewicz	47dba4203a	db: schema_applier: update tablet metadata atomically Before mutable_token_metadata_ptr containing tablet changes was replicated to all cores in post_commit phase which violated atomicy guarantee of schema_applier, now it's incorporated into per shard commit phase. It uses service::schema_getter abstraction introduced in earlier commit to inject "pending" schema which is not yet visible to the whole system.	2025-10-16 10:56:50 +02:00
Marcin Maliszkiewicz	e5fffa158f	db: replica: move tables_metadata locking to commit This keeps the locking scope minimal, and since unlocking is done in commit(), locking fits here as well.	2025-10-16 10:56:10 +02:00
Marcin Maliszkiewicz	92cfc3c005	storage_service: abstract schema dependecies during token metadata update The functions prepare_token_metadata_change and commit_token_metadata_change depend on the current schema through calls to the database service. However, during an atomic schema change, the current schema does not yet include the pending changes. Despite that, we want to apply token metadata changes to those pending schema elements as well. Currently, this is achieved by postponing token metadata changes until after the rest of the schema is committed, but this breaks atomicity. To allow incorporating the prepare and commit phases into schema_applier, we need to abstract the schema dependency. This will make it possible to provide, in following commits, an implementation that includes visibility into pending changes, not just the currently active schema.	2025-10-16 10:56:09 +02:00
Botond Dénes	5d70450917	replica/mutation_dump: multi_range_partition_generator: disable garbage-collection Make use of the freshly introduced facility to disable garbage-collection on a per-query basis for range scans. This is needed so partitions that only contain garbage-collectible data are not missing from the partition-list. When using SELECT * FROM MUTATION_FRAGMENTS(), the user is expecting to see all data, even that which is dead and garbage-collectible. Include a test which reproduces the issue.	2025-10-16 10:40:28 +03:00
Botond Dénes	734a9934a6	replica: add tombstone_gc_enabled parameter to mutation query methods Allow disabling tombstone gc on a per-query basis for mutation queries. This is achieved by a bool flag passed to mutation query variants like `query_mutations_on_all_shards()` and `database::mutation_query()`, which is then propagated down to compaction_mutation_state. The future user (in the next patch) is the SELECT * FROM MUTATION_FRAGMENTS() statement which wants to see dead partitions (and rows) when scanning a table. Currently, due to garbage collections, said statement can miss partitions which only contain garbage-collectible tombstones.	2025-10-16 10:38:47 +03:00
Botond Dénes	03118a27b8	mutation/mutation_compactor: remove _can_gc member It is confusing. For query compaction, it initialized to `always_gc`, for sstable compaction it is initialized to a lambda calling into `can_gc()`. This makes understanding the purpose of this member very confusing. The real use of this member is to bridge mutation_partition::compact_and_expire() with can_gc(). This patch ditches the member and creates the lambda near the call sites instead, just like the other params to `compact_and_expire()` already are. can_gc() now also respects _tombstone_gc.is_gc_enabled() instead of just blindly returning true when in query mode. With this patch, whether tombstones are collected or not in query mode is now consistent and controlled by the tombstone_gc_state.	2025-10-16 10:38:47 +03:00
Botond Dénes	cb27c3d6e9	tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc Currently, to disable tombstone-gc on-demand completely, one has to pass down a bool flag along with the already required tombstone_gc_state to the code which does the compacting. This is redundant and confusing, the tombstone_gc_state is supposed to encapsulate all tombstone-gc related logic in a transparent way. Add dedicated factory methods for no-gc and gc-all, to allow creating a tombstone_gc_state which transparently gcs for all or no tombstones.	2025-10-16 10:38:47 +03:00
Piotr Wieczorek	15c399ed40	test/alternator: Add more Streams tests for UpdateItem and BatchWriteItem This commit adds tests to `test_streams.py` (i.e. Alternator Streams) checking the following cases: * putting an item with BatchWriteItem shouldn't emit a log if the old item and the new item are identical, * deleting an item with BatchWriteItem shouldn't emit a log if the item doesn't exist, * UpdateItem shouldn't emit a log if the old item and the new item are identical. These cases haven't been tested until this commit. Refs https://github.com/scylladb/scylladb/issues/6918 Closes scylladb/scylladb#26396	2025-10-16 09:34:12 +03:00
Pavel Emelyanov	dbca0b8126	Update seastar submodule * seastar 270476e7...bd74b3fa (20): > memory: Decay large allocation warning threshold > iotune: fix very long warm up duration on systems with high cpu count > Add lib info to one line backtrace > io: Count and export number of AIO retries > io_queue: Destroy priority class data with scheduling group > Merge 'Expell net::packet from output_stream API stack' from Pavel Emelyanov code: Introduce new API level iostream: Remove write()-s of packet/scattered_message from new API level iostream: Convert output_stream::_zc_bufs to vector of buffers code: Add data_sink_impl::put(std::span<temporary_buffer>) method code: Prepare some data_sink_impl::do_put(temporary_buffer) methods iostream: Introduce output_stream::write(span<temporary_buffer>) overload packet: Add packet(std::span<temporary_buffer>) constructor temporary_buffer: Add detach_front() helper > cooking: update gnutls to 3.7.11 > file: Configure DMA alignment from block size > util: adapt to fmt 12.0.0 API changes > Merge 'Internalize reactor::posix_... API methods' from Pavel Emelyanov reactor: Deprecate and internalize posix_connect() reactor: Deprecate and internalize posix_listen() > cooking: update fmt to modern version > Merge 'Add prometheus bench, coroutinize prometheus' from Travis Downs prometheus: coroutinize metrics writing prometheus_test: add global label test introduce metrics_perf bench > operator co_await: use rvalue reference > futurize::invoke: use std::invoke > io_tester: Don't skip 0 position in sequential workflows > io_queue: Use own logger for messages > .clangd: tell the LSP about seastar's header style > docker: Update to plucky > Merge 'Convert timer test into seastar test (and a bit more)' from Pavel Emelyanov test: Remove OK macro test: Fix one failure check test: Use boost checkers instead of BUG() macro test: Fix indentation after previous patch test: Convert timer_test into seastar test(s) Closes scylladb/scylladb#26560	2025-10-16 07:55:17 +03:00
Nadav Har'El	921d07a26b	cql: make SELECT's "internal page size" configurable In some uses of SELECT, such as aggregation (sum() et al.), GROUP BY or secondary index, it needs to perform internal scans. It uses an "internal page size" which before this patch was always DEFAULT_COUNT_PAGE_SIZE = 10000. There was an ad-hoc and undocumented way to override this default in C++ tests, using functions in test/lib/select_statement_utils.hh, but it was so non-obvious that the test that most needed to override this default - the very slow test test_indexing_paging_and_aggregation which would have been must faster with a lower setting - never used it. So in this patch we replace the ad-hoc configuration functions by a bona-fide Scylla configuration option named "select_internal_page_size". The few C++ tests that used the old configuration functions were modified to use the new configuration parameters. The slow test test_indexing_paging_and_aggregation still doesn't use the new configuration to become faster - we'll do this in the next patch. Another benefit of having this "internal page size" as a configuration option is that one day a user might realize that the default choice 10,000 is bad for some reason (which I can't envision right now), so having it configurable might come it handy. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-15 18:42:09 +03:00
Patryk Jędrzejczak	71de01cd41	test: test_raft_recovery_entry_loss: fix the typo in the test case name	2025-10-15 16:58:28 +02:00
Patryk Jędrzejczak	da8748e2b1	test: verify that schema pulls are disabled in the Raft-based recovery procedure We do this at the end of `test_raft_recovery_entry_loss`. It's not worth to add a separate regression test, as tests of the recovery procedure are complicated and have a long running time. Also, we choose `test_raft_recovery_entry_loss` out of all tests of the recovery procedure because it does some schema changes.	2025-10-15 16:58:28 +02:00
Patryk Jędrzejczak	ec3a35303d	raft topology: disable schema pulls in the Raft-based recovery procedure Schema pulls should always be disabled when group 0 is used. However, `migration_manager::disable_schema_pulls()` is never called during a restart with `recovery_leader` set in the Raft-based recovery procedure, which causes schema pulls to be re-enabled on all live nodes (excluding the nodes replacing the dead nodes). Moreover, schema pulls remain enabled on each node until the node is restarted, which could be a very long time. The old gossip-based recovery procedure doesn't have this problem because we disable schema pulls after completing the upgrade-to-group0 procedure, which is a part of the old recovery procedure. Fixes #26569	2025-10-15 16:58:24 +02:00
Nadav Har'El	afc5379148	secondary index: translate test_indexing_paging_and_aggregation to Python The Boost test test_indexing_paging_and_aggregation is one of the slowest boost tests. But it's hard to understand why it needs to be so slow - the C++ test code is opaque, and uncommented. The test didn't need to be in C++ - it only uses CQL, not any internal interfaces - but it was written in 2019, a year before test/cqlpy was created. So before we can make this test faster, this patch translates it to Python and adds significant amount of comments. The new Python test is functionally identical to the old C++ test - it is not (yet) made smaller or faster. The new test takes a whopping 9 seconds to run on my laptop (in dev build mode). We'll reduce that in the next patch. As usual, the cqlpy test can also be tested on Cassandra, and unsurprisingly, it passes. Refs #16134 (which asks to translate more MV and SI tests to Python). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-10-15 17:50:37 +03:00
Piotr Dulikowski	61662bc562	Merge 'alternator: Make CDC use preimages from LWT for Alternator' from Piotr Wieczorek This patch adds a struct `per_request_options` used to communicate between CDC and upper abstraction layers. We need this for better compatibility with DynamoDB Streams in Alternator (https://github.com/scylladb/scylladb/issues/6918) to change operation types of log rows. This patch also adds a way to conditionally forward the item read by LWT to CDC and use it as a preimage. For now, only Alternator uses this feature. The main changes are: - add a struct `cdc::per_request_options` to pass information between CDC and upper abstraction layers, - add the struct to `cas_request::apply`'s signature, - add a possibility to provide a preimage fetched by an upper abstraction layer (to propagate a row read by Alternator to CDC's preimage). This reduces the number of reads-before-write by 1 for some Alternator requests and it is always safe. It's possible to use this feature also in CQL. No backport, it's a feature. Refs https://github.com/scylladb/scylladb/issues/6918 Refs https://github.com/scylladb/scylladb/pull/26121 Closes scylladb/scylladb#26149 * github.com:scylladb/scylladb: alternator, cdc: Re-use the row read by LWT as a CDC preimage cdc: Support prefetched preimages storage: Add cdc options to cas_request::apply cdc, storage: Add a struct to pass per-mutation options to CDC cdc: Move operations enum to the top of the namespace	2025-10-15 12:30:29 +02:00
Piotr Wieczorek	28eda0203e	alternator: Small cleanup, removing unnecessary statements, etc. Tiny code cleanup to improve readability without changing behavior. Changes: - remove unused variables and imports, - remove redundant whitespaces, and a duplicated `public:` access specifier, - use `is_aws` function to check if running in AWS test/alternator/test_metrics.py, - other trivial changes. Closes scylladb/scylladb#26423	2025-10-15 12:05:20 +02:00
Pavel Emelyanov	7bd50437ff	test: Remove unused operator<<(radix_tree_test::test_data) It was used while debugging the test Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26458	2025-10-15 11:57:56 +02:00
Marcin Maliszkiewicz	106fd39c6c	storage_service: split replicate_to_all_cores to steps In later commits schema merge code will use those prepare and commit steps. Rest of the code will continue using replicate_to_all_cores.	2025-10-15 10:54:24 +02:00
Gleb Natapov	eb9112a4a2	db: experimental consistent-tablets option The option will be used to hid consistent tablets feature until it is ready.	2025-10-15 11:27:10 +03:00
Dawid Mędrek	3aa07d7dfe	test/cluster/mv: Provide reason why test is skipped We point to the issue explaining why the test was disabled and what can be done about it. Closes scylladb/scylladb#26541	2025-10-15 09:22:39 +02:00
Jenkins Promoter	d731d68e66	Update pgo profiles - aarch64	2025-10-15 05:21:46 +03:00
Jenkins Promoter	b6237d7dd4	Update pgo profiles - x86_64	2025-10-15 04:54:54 +03:00
Piotr Dulikowski	aed166814e	test: cluster: skip flaky test_raft_recovery_entry_lose test Unfortunately, the test became flaky and is blocking promotion. The cause of the flaky is not known yet but unrelated to other items currently queued on the `next` branch. The investigation continues on GitHub issue scylladb/scylladb#26534. In the meantime, skip the test to unblock other work. Refs: scylladb/scylladb#26534 Closes scylladb/scylladb#26549	2025-10-14 19:35:44 +02:00
Botond Dénes	d0844abb5c	mutation/mutation_compactor: compaction:stats: split partitions Into total and live. Currently only live (those with live content) are counted. Report live and total seprately, just like we do for rows. This allows deducing the count of dead partitions as well, which is particularly interesting for scans. Closes scylladb/scylladb#26548	2025-10-14 19:08:47 +03:00
Marcin Maliszkiewicz	b0f11b6d91	db: schema_applier: unify token_metadata loading Putting it into a single place gives more clarity on how _pending_token_metadata is made and avoids extra per shard copy when tablets change.	2025-10-14 10:56:37 +02:00
Marcin Maliszkiewicz	d67632bfe2	replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge This copy is now used during the whole duration of schema merge. If it changes due to tablet_hint then it's replicated to all shards as before.	2025-10-14 10:56:36 +02:00
Marcin Maliszkiewicz	389afcdeb6	service: fix dependencies during migration_manager startup We need to avoid reloading schema early as it goes via schema_applier which internally depends on storage_service and on distribued_loader initializing all keyspaces. Simply moving migration manager startup later in the code is not easy as some services depend on it being initialized so we just enable those feature listeners a bit later.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	46bff28a38	db: schema_applier: move pending_token_metadata to locator It never belonged to tables and views and its placement stems from location of _tablet_hint handling code. In the follwing commits we'll reference it in storage_service.cc.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	1a539f7151	db: always use _tablet_hint as condition for tablet metadata change When all schema_applier code uses this condition it's easier to grep than when we use different, derived conditions.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	c112916215	db: refactor new_token_metadata into pending_token_metadata It prepares pending_token_metadata to handle both new and copy of existing metadata for consistent usage in later commit. It also adds shared_token_metatada getter so that we don't need to get it from db.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	668231d97c	db: rename new_token_metadata to pending_token_metadata Part of the refactor done in following commit. Separated for easier review.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	0c4c995c0d	db: schema_applier: move types storage init to merge_types func Merge_types function groups operation related to types, types storage fits this group.	2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz	794d68e44c	db: schema_applier: make merge functions non-static members This is mechanical change which simplifies the code. Schema_applier class is an object which holds schema merging intermediate state so it's fine that all schema merging functions have access to this state.	2025-10-14 10:56:25 +02:00
Marcin Maliszkiewicz	209563f478	db: remove unused proxy from create_keyspace_metadata	2025-10-14 10:56:25 +02:00
Ernest Zaslavsky	413739824f	s3_client: track memory starvation in background filling fiber Introduce a counter metric to monitor instances where the background filling fiber is blocked due to insufficient memory in the S3 client. Closes scylladb/scylladb#26466	2025-10-14 11:22:54 +03:00
Piotr Wieczorek	5ff2d2d6ab	alternator, cdc: Re-use the row read by LWT as a CDC preimage Propagates the row read by CAS to CDC's preimage to save one read-before-write. As of now, a preimage in Alternator Streams always contains the entire item (see previous_item_read_command in executor.cc), so the resulting preimage should stay the same. In other words, this change should be transparent to users.	2025-10-14 07:52:40 +02:00
Piotr Wieczorek	d4581cc442	cdc: Support prefetched preimages This commit adds support to pass a preimage selected by an upper layer to CDC. The responsibility for the correctness of the preimage (i.e. the selected columns, whether it's up to date, etc.) lies with the caller. It may be improved in the future by validating the preimage, e.g. by "slicing" the received preimage to the necessary columns. The motivation behind this change was to reduce the number of read-before-writes and avoid reading the row twice for Alternator Streams in an increased compatibility mode with DynamoDB. This is to be added in a following commit. Until now, this commit should be a no-op.	2025-10-14 07:29:07 +02:00
Łukasz Paszkowski	125bf391a7	utils/directories: ignore files when retrieving stats fails During Scylla startup, directories are created and verified in `directories::do_verify_owner_and_mode()`. It is possible that while retrieving file stats, a file might be removed, leading to Scylla failing to boot. This is particularly visible in `storage/test_out_of_space.py` tests, which use FUSE to mount size-limited volumes. When a file that is open by another process is removed, FUSE renames it to `.fuse_hidden*`. In `directories::do_verify_owner_and_mode()`, the code performs a `scan_dir` to list files and retrieves their stats to verify type, mode, and ownership. If a file is removed while retrieving its stats, we see errors such as: ``` Failed to get /scylladir/testlog/x86_64/dev/volumes/e0125c60-1e63-4330-bf6f-c0ea3e466919/scylla-0/hints/1/.fuse_hidden0000001800000005 ``` This change makes `do_verify_owner_and_mode()` ignore files when retrieving stats fails, avoiding spurious errors during verification. Refs: https://github.com/scylladb/scylladb/issues/26314 Closes scylladb/scylladb#26535	2025-10-13 20:41:25 +03:00
Botond Dénes	46af0127e9	test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql Comprehensive test for the new CQL input format.	2025-10-13 18:10:40 +03:00
Botond Dénes	180bf647f7	replica/mutation_dump: add support for virtual tables Not supported currently as such tables have no memtables, cache or sstables, so any select * from mutation_fragments() query will return empty result. Detect virtual tables and add return their content with a distinct 'virtual-table' mutation_source designation.	2025-10-13 18:10:40 +03:00
Botond Dénes	64c32ca501	tools/scylla-sstable: print_query_results_json(): handle empty value buffer Print null, similar to disengaged optional value.	2025-10-13 18:10:40 +03:00
Botond Dénes	e404dd7cf0	tools/scylla-sstable: add cql support to write operation Add new --input-format command line argument. Possible values are json (current) and cql (new -- added in this patch). When --input-format=cql (new default), the input-file is expected to contain CQL INSERT, UPDATE or DELETE statements, separated by semicolon. The input file can contain any number of statements, in any order. The statements will be executed and applied to a memtable, which is then flushed to create an sstable with the content generated from the statement. The memtable's size is capped at 1MiB, if it reaches this size, it is flushed and recreated. Consequently, multiple sstables can be created from a single scylla-sstable write --input-format=cql operation.	2025-10-13 18:10:40 +03:00
Dawid Mędrek	7d017748ab	db/commitlog: Extend segment truncation error messages We include more relevant information for debugging purposes: the remaining bytes and the size. It might be useful to determine where exactly an error occurred and help reason about it. Closes scylladb/scylladb#26486	2025-10-13 17:42:31 +03:00
Nadav Har'El	06108ea020	test/alternator: a small cleanup for a test in test_streams.py This patch makes three small mostly-cosmetic improvements to a test in test/alternator/test_streams.py: 1. The test is renamed "test_streams_deleteitem_old_image_no_ck" to emphasize its focus on the combination of deleteitem, old image, and no ck. The "putitem" we had in the name was not relevant, and the "old_image" was missing and important. 2. Moreover, using PutItem in this test just to set up the test scenario mixed the bug which the test tries to reproduced with a different only-recently-fixed bug (that PutItem also generated a spurious "REMOVE" event). So I changed the use of PutItem by using UpdateItem, to make this test indepedent of the other bug. Test independence is important because it allows us - if we want - to backport a fix for just one bug independently of the fix to the other bug. 3. Also improved the comment in front of the test to mention where we already tested the with-ck case, and also to mention issue 26382 which this test reproduces (the xfail line also mentions it, but the xfail line will be removed when the bug is fixed - but the mention in the comment will remain - and should remain. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26526	2025-10-13 17:42:31 +03:00
Piotr Dulikowski	1cf944577b	Merge 'Fix vector store client flaky test' from Karol Nowacki This series of patches improves test vector_store_client_test stability. The primary issue with flaky connections was discovered while working on PR #26308. Key Changes: - Fixes premature connection closures in the mock server: The mock HTTP server was not consuming request payloads, causing it to close connections immediately after a response. Subsequent tests attempting to reuse these closed connections would fail intermittently, leading to flakiness. The server has been updated to handle payloads correctly. - Removes a retry workaround: With the underlying connection issue resolved, the retry logic in the vector_store_client_test_ann_request test is no longer needed and has been removed. - Mocks the DNS resolver in tests: The vector_store_client_uri_update_to_invalid test has been corrected to mock DNS lookups, preventing it from making real network requests. - Corrects request timeout handling: A bug has been fixed where the request timeout was not being reset between consecutive requests. - Unifies test timeouts: Timeouts have been standardized across the test suite for consistency. Fixes: #26468 It is recommended to backport this series to the 2025.4 branch. Since these changes only affect test code and do not alter any production logic, the backport is safe. Addressing this test flakiness will improve the stability of the CI pipeline and prevent it from blocking unrelated patches. Closes scylladb/scylladb#26374 * github.com:scylladb/scylladb: vector_search: Unify test timeouts vector_search: Fix missing timeout reset vector_search: Refactor ANN request test vector_search: Fix flaky connection in tests vector_search: Fix flaky test by mocking DNS queries	2025-10-13 17:42:31 +03:00
Botond Dénes	f4f99ece7d	tools/scylla-sstable: write_operation(): fix indentation Left broken in previous patch for easier review.	2025-10-13 17:35:50 +03:00
Botond Dénes	023aed0813	tools/scylla-sstable: write_operation(): prepare for a new input-format In the next patches a new input-format will be introduced, which can produce multiple output format. To prepare for this, consolidate the code which produces an sstable into a reusable lambda function. Moves code around, reduces churn in next patches. Indentation is left broken for easier review.	2025-10-13 17:35:50 +03:00
Botond Dénes	f03cec9574	tools/scylla-sstable: generalize query_operation_validate_query() Make error messages more generic, so they are not specific to select. Make it a template on the type of cql statement for the final check. To avoid templating the whole thing, the function is split into two. Parametrize the name of the allowed statement types in said check. Prepares the method to be shared between query operation and write operation (future change). While at it, also change query param type to std::string_view to avoid some copies.	2025-10-13 17:35:50 +03:00
Botond Dénes	61e70d1b11	tools/scylla-sstable: move query_operation_validate_query() Move it in the source code above write_operation(), as said operation will soon want to use this method too.	2025-10-13 17:35:50 +03:00
Botond Dénes	dfe4cfc0e2	tools/scylla-sstable: extract schema transformation from query operation This transformation enables an existing schema to be created as a table in cql_test_env, to be used to read/write sstables belonging to said schema. Extract this into a method, to be shared by a future operation which will also want to do this.	2025-10-13 17:35:50 +03:00
Botond Dénes	970d4f0dcd	replica/table: add virtual write hook to the other apply() overload too Currently only one has it, which means virtual table can potentially miss some writes.	2025-10-13 17:35:50 +03:00
Calle Wilund	68c109c2df	docs::dev::object_storage: Add some initial info on GS storage Augments the object storage document with config options etc for using GS instead of S3. TODO: add proper gsutil command line examples for manual managing of GCP storage.	2025-10-13 08:53:28 +00:00
Calle Wilund	54a7d7bd47	docs/dev: Add mention of (nested) docker usage in testing.md As one (of more?) places to document the fact that we partially rely on resolving docker images for CI.	2025-10-13 08:53:28 +00:00
Calle Wilund	403247243b	sstables::object_storage_client: Forward memory limit semaphore to GS instance Enforces object storage limits to the GS implementation as well.	2025-10-13 08:53:28 +00:00
Calle Wilund	01f4dfed84	utils::gcp::object_storage: Add optional memory limits to up/download Adds optional memory semaphore to limit the mem buffer usage in sink/source. Note that we don't bookkeep exact, to avoid deadlock issues in higher layer. In upload, we overlease on first buffer put to ensure we can at least fill the desired 8M of buffers. We try to adjust when going over, but if we fail, we fail, but at least will initiate upload -> soon release memory. On next put, we try to grab multiples of 8M again, and so forth. Thus potentially causing waiting for resources, without ending up not uploading at least one active sink. For download (source), we try to get lease for as much as we want to read, but if we fail, we adjust this down to 256k and download anyway. Since this will typically be released immediately, we at least don't overrun for long, and again, avoid fully stopping, throttling rate instead.	2025-10-13 08:53:27 +00:00
Calle Wilund	5e4e5b1f4a	sstables::object_storage_client: Add multi-upload support for GS Uses file splitting + object merge to facilitate parallel, resumable upload of files with known size.	2025-10-13 08:53:27 +00:00
Calle Wilund	bd1304972c	utils::gcp::storage: Add merge objects operation Allows merging 1-32 smaller files into a destination.	2025-10-13 08:53:27 +00:00
Calle Wilund	e940a1362a	test_backup/test_basic: Make tests multiplex both s3 and gs backends Change fixture used + property/config access to allow running with arbitrary bucket-object based backend.	2025-10-13 08:53:27 +00:00
Calle Wilund	80c02603a8	test::cluster::conftest: Add support for multiple object storage backends Adds an `object_storage` fixture with paramterization to iterate through 's3' and 'gs' backends. For the former, will instansiate the `s3_server` backend (modified to better handle being actual temp, function level server). For the latter, will either give back a frontend if env vars indicating "real" GS buckets and endpoints are used, or launch a docker image for fake-gcs-server on a free port. Please read the comment in the code about the management of server output, as this is less than optimal atm, but I can't figure out the issue with it. All returned fixture objects will respond to `address`, `bucket` properties, as well as be able to create endpoint config objects for scylla.	2025-10-13 08:53:27 +00:00
Calle Wilund	da36a9d78e	boost::gcs_storage_test: reindent Remove redundant indentation/moosewings.	2025-10-13 08:53:27 +00:00
Calle Wilund	1356f60c69	boost::gcs_storage_test: Convert to use fixture Instead of test-local server/endpoint etc, use the gcs test fixture, with the added bonus of a suite-shared one for additional speed.	2025-10-13 08:53:27 +00:00
Calle Wilund	7c6b4bed97	tests::boost: Add GS object storage cases to mirror S3 ones I.e. run same remote storage backend unit tests for GS backend	2025-10-13 08:53:27 +00:00
Calle Wilund	af2616d750	tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS A text fixture object for either real google storage or fake-gcs-server using test local podman. Copied/transposed from gcp_object_storage_test.	2025-10-13 08:53:26 +00:00
Calle Wilund	a33fdd0b62	tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env Move some code to compilation unit + add some overloads. Add a RAII-object for temporary setting current process env as well.	2025-10-13 08:53:26 +00:00
Calle Wilund	527a6df460	sstables::object_storage_client: Add google storage implementation Allowing GS config to be honoured.	2025-10-13 08:53:26 +00:00
Calle Wilund	956d26aa34	test_services: Allow testing with GS object storage parameters	2025-10-13 08:53:26 +00:00
Calle Wilund	da7099a56e	utils::gcp::gcp_credentials: Add option to create uninitialized credentials To avoid having to async wait for creating credentials, allow lazy init (in actual token renew) of credentials. This is not super pleasant, since it means any error will be late, but it is required more or less for the code paths into which we intend to place this.	2025-10-13 08:53:26 +00:00
Calle Wilund	fd13ffd95d	utils::gcp::object_storage: Make create_download_source return seekable_data_source Since, given the nature of object storage API:s, it is no more complicated to provide a reasonable implementation of a seekable, limited, interface, give this back, which in turn means upper layers can provide easy read-only file interfaces. Hint hint.	2025-10-13 08:53:26 +00:00
Calle Wilund	cc899d4a86	utils::gcp::object_storage: Add defensive copies of string_view params	2025-10-13 08:53:26 +00:00
Calle Wilund	2093e7457d	utils::gcp::object_storage: Add missing retry backoff increate Ensure we increase wait time on subsequent backoffs	2025-10-13 08:53:26 +00:00
Calle Wilund	74578aaae2	utils::gcp::object_storage: Add timestamp to object listing	2025-10-13 08:53:26 +00:00
Calle Wilund	d0fe001518	utils::gcp::object_storage: Add paging support to list_objects Allowing listing N entries at a time, with continuation.	2025-10-13 08:53:26 +00:00
Calle Wilund	144b550e4f	object_storage_client: Add object_name wrapper type Remaining S3-centric, but abstracting the object name to possible implementations not quite formatted the same.	2025-10-13 08:53:25 +00:00
Calle Wilund	9dde8806dd	utils::gcp::object_storage: Add optional abort_source Add forwarded abort_source to lengty ops	2025-10-13 08:53:25 +00:00
Calle Wilund	926177dfb4	utils::rest::client: Add abort_source support Add optional forwarded abort_source	2025-10-13 08:53:25 +00:00
Calle Wilund	5d4558df3b	sstables: Use object_storage_client for remote storage Replaces direct s3 interfaces with the abstraction layer, and open for having multiple implentations/backends	2025-10-13 08:53:25 +00:00
Calle Wilund	25e932230a	sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial) Adds abstraction layer for creating the various ops and IO objects for storing sstable data on cloud storage	2025-10-13 08:53:25 +00:00
Calle Wilund	a2cd061a5d	s3::upload_progress: Promote to general util type	2025-10-13 08:53:25 +00:00
Calle Wilund	868d057aae	storage_options: Abstract s3 to "object_storage" and add gs as option Since both are bucket+prefix oriented, we can basically use same options for both, only distinguished by actual protocol. Abstract the types and the helper parse etc routines to handle either. Use "gs" as term for gcs (google compute storage), since this is the URL scheme used.	2025-10-13 08:53:25 +00:00
Calle Wilund	ac438c61e6	sstables::file_io_extension: Change "creator" callback to just data_source Because the concept of pushing reading range does not work for the wrapping we do (i.e. encryption), there is no point having it here. We need to do said range handling higher up. Also, must allow multi-layered wrapping.	2025-10-13 08:53:25 +00:00
Calle Wilund	2e49270da5	utils::io-wrappers: Add ranged data_source Provides a data_source wrapper to read a specific range of a source stream.	2025-10-13 08:53:25 +00:00
Calle Wilund	91c0467282	utils::io-wrappers: Add file wrapper type for seekable_source Provides a read-only file interface for a seekable_source object.	2025-10-13 08:53:25 +00:00
Calle Wilund	e62a18304e	utils::seekable_source: Add a seekable IO source type Extension of data_source, with the ability to a.) Seek in any direction, i.e. move backwards. Thus not pure stream. b.) Read a limited number of bytes. The very transparent reason for the interface is to have a base abstraction for providing a read-only file layer for networked resources.	2025-10-13 08:53:24 +00:00
Calle Wilund	14dada350a	object_storage_endpoint_param: Add gs storage as option	2025-10-13 08:53:24 +00:00
Calle Wilund	78d9dda060	config: break out object_storage_endpoint_param preparing for multi storage Moves the config wrapper to own file (to reduce recompilation for modifying) and refactors to handle extending this parameter to non-s3 endpoint configs.	2025-10-13 08:53:24 +00:00
Avi Kivity	c783f0e539	Merge 'index: Prefer const qualifiers wherever possible' from Dawid Mędrek We add missing `const`-qualifiers wherever possible in the module. A few smaller changes were included as a bonus. Backport: not needed. This is a cleanup. Closes scylladb/scylladb#26485 * github.com:scylladb/scylladb: index/secondary_index_manager: Take std::span instead of std::vector index/secondary_index_manager: Add missing const qualifier index/vector_index: Add missing const qualifiers cql3/statements/index_prop_defs.cc: Remove unused include cql3/statements/index_prop_defs.cc: Mark function as TU-local cql3/statements/index_prop_defs: Mark methods as const-qualified	2025-10-12 19:47:53 +03:00
Michał Chojnowski	93dac3d773	sstables/compressor: relax a large allocation warning in ZSTD_CDict creation ZSTD_CDict needs a big contiguous allocation and there's no way around that. The only thing to do is relax the warning appropriately. Closes scylladb/scylladb#25393	2025-10-12 18:21:11 +03:00
Botond Dénes	24c6476f73	mutation/mutation_compactor: add tombstone_gc_state to query ctor So tombstones can be purged correctly based on the tombstone gc mode. Currently if repair-mode is used, tombstones are not purged at all, which can lead to purged tombstone being re-replicated to replicas which already purged them via read-repair. This is not a correctness problem, tombstones are not included in data query resutl or digest, these purgable tombstone are only a nuissance for read repair, where they can create extra differences between replicas. Note that for the read repair to trigger, some difference other than in purgable tombstones has to exist, because as mentioned above, these are not included in digets. Fixes: scylladb/scylladb#24332 Closes scylladb/scylladb#26351	2025-10-12 17:48:15 +03:00
Botond Dénes	d9c3772e20	service/storage_proxy: send batches with CL=EACH_QUORUM Batches that fail on the initial send are retired later, until they succeed. These retires happen with CL=ALL, regardless of what the original CL of the batch was. This is unnecessarily strict. We tried to follow Cassandra here, but Cassandra has a big caveat in their use of CL=ALL for batches. They accept saving just a hint for any/all of the endpoints, so a batch which was just logged in hints is good enough for them. We do not plan on replicating this usage of hints at this time, so as a middle ground, the CL is changed to EACH_QUORUM. Fixes: scylladb/scylladb#25432 Closes scylladb/scylladb#26304	2025-10-12 17:18:41 +03:00
Michał Chojnowski	7c6e84e2ec	test/boost/sstable_compressor_factory_test: fix thread-unsafe usage of Boost.Test It turns out that Boost assertions are thread-unsafe, (and can't be used from multiple threads concurrently). This causes the test to fail with cryptic log corruptions sometimes. Fix that by switching to thread-safe checks. Fixes scylladb/scylladb#24982 Closes scylladb/scylladb#26472	2025-10-12 17:16:51 +03:00
Piotr Wieczorek	8cd9f5d271	test/alternator: Add a Streams test reproducing #26382 This commit adds a test that reproduces an issue, wherein OldImage isn't included in the REMOVE events produced by Alternator Streams. Refs https://github.com/scylladb/scylladb/issues/26382 Closes scylladb/scylladb#26383	2025-10-12 11:09:57 +03:00
Piotr Wieczorek	a55c5e9ec7	alternator: Correct RCU undercount in BatchGetItem The `describe_multi_item` function treated the last reference-captured argument as the number of used RCU half units. The caller `batch_get_item`, however, expected this parameter to hold an item size. This RCU value was then passed to `rcu_consumed_capacity_counter::get_half_units`, treating the already-calculated RCU integer as if it were a size in bytes. This caused a second conversion that undercounted the true RCU. During conversion, the number of bytes is divided by `RCU_BLOCK_SIZE_LENGTH` (=4KB), so the double conversion divided the number of bytes by 16 MB. The fix removes the second conversion in `describe_multi_item` and changes the API of `describe_multi_item`. Fixes: https://github.com/scylladb/scylladb/pull/25847 Closes scylladb/scylladb#25842	2025-10-12 10:42:32 +03:00
Karol Nowacki	62deea62a4	vector_search: Unify test timeouts The test previously used separate timeouts for requests (5s) and the overall test case (10s). This change unifies both timeouts to 10 seconds.	2025-10-10 16:49:06 +02:00
Karol Nowacki	0de1fb8706	vector_search: Fix missing timeout reset The `vector_store_client_test` could be flaky because the request timeout was not consistently reset in all code paths. This could lead to a timeout from a previous operation firing prematurely and failing the test. The fix ensures `abort_source_timeout` is reset before each request. The implementation is also simplified by changing `abort_source_timeout::reset` that combines the reset and arm operations into a same invocation.	2025-10-10 16:48:54 +02:00
Karol Nowacki	d99a4c3bad	vector_search: Refactor ANN request test Refactor the `vector_store_client_test_ann_request` test to use the `vs_mock_server` class, unifying the structure of the test cases. This change also removes retry logic that waited for the server to be ready. This is no longer necessary because the handler now exists for all index names and consumes the entire request payload, preventing connection closures. Previously, the server did not handle requests for unconfigured indexes, which caused the connection to close. This could lead to a race condition where the client would attempt to reuse a closed connection.	2025-10-10 16:48:20 +02:00
Karol Nowacki	2eb752e582	vector_search: Fix flaky connection in tests The vector store mock server was not reading the ANN request body, which could cause it to prematurely close the connection. This could lead to a race condition where the client attempts to reuse a closed connection from its pool, resulting in a flaky test. The fix is to always read the request body in the mock server.	2025-10-10 16:48:09 +02:00
Karol Nowacki	ac5e9c34b6	vector_search: Fix flaky test by mocking DNS queries The `vector_store_client_uri_update_to_invalid` test was flaky because it performed real DNS lookups, making it dependent on the network environment. This commit replaces the live DNS queries with a mock to make the test hermetic and prevent intermittent failures. `vector_search_metrics_test` test did not call configure{vs}, as a consequence the test did real DNS queries, which made the test flaky. The refreshes counter increment has been moved before the call to the resolver. In tests, the resolver is mocked leading to lack of increments in production code. Without this change, there is no way to test DNS counter increments. The change also simplifies the test making it more readable.	2025-10-10 16:47:03 +02:00
Patryk Jędrzejczak	5f68b9dc6b	test: test_raft_no_quorum: test_can_restart: deflake the read barrier call Expecting the group 0 read barrier to succeed with a timeout of 1s, just after restarting 3 out of 5 voters, turned out to be flaky. In some unlikely scenarios, such as multiple vote splits, the Raft leader election could finish after the read barrier times out. To deflake the test, we increase the timeout of Raft operations back to 300s for read barriers we expect to succeed. Fixes #26457 Closes scylladb/scylladb#26489	2025-10-10 15:22:39 +03:00
Asias He	13dd88b010	repair: Rename incremental mode name Using the name regular as the incremental mode could be confusing, since regular might be interpreted as the non-incremental repair. It is better to use incremental directly. Before: - regular (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) After: - incremental (standard incremental repair) - full (full incremental repair) - disabled (incremental repair disabled) Fixes #26503 Closes scylladb/scylladb#26504	2025-10-10 15:21:54 +03:00
Michał Chojnowski	85fd4d23fa	test_sstable_compression_dictionaries_basic: reconnect robustly after node reboots Using `driver_connect()` after a cluster restart isn't enough to ensure full CQL availability, but the test assumes that it is. Fix that by making the test wait for CQL availability via `get_ready_cql()`. Also, replace some manual usages of wait_for_cql_and_get_hosts with `get_ready_cql()` too. Fixes scylladb/scylladb#25362 Closes scylladb/scylladb#25366	2025-10-10 14:27:02 +03:00
Piotr Dulikowski	0b800aab17	Merge 'db/view/view_building_worker: move `discover_existing_staging_sstables()` to the foreground' from Michał Jadwiszczak db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground This patch moves `discover_existing_staging_sstables()` to be executed from main level, instead of running it on the background fiber. This method need to be run only once during the startup to collect existing staging sstables, so there is no need to do it in the background. This change will increase debugability of any further issues related to it (like https://github.com/scylladb/scylladb/issues/26403). Fixes https://github.com/scylladb/scylladb/issues/26417 The patch should be backported to 2025.4 Closes scylladb/scylladb#26446 * github.com:scylladb/scylladb: db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground db/view/view_building_worker: futurize and rename `start_background_fibers()`	2025-10-09 18:24:50 +02:00
Michał Jadwiszczak	8d0d53016c	db/view/view_building_worker: update state again if some batch was finished during the update There was a race between loop in `view_building_worker::run_view_building_state_observer()` and a moment when a batch was finishing its work (`.finally()` callback in `view_building_worker::batch::start()`). State observer waits on `_vb_state_machine.event` CV and when it's awoken, it takes group0 read apply mutex and updates its state. While updating the state, the observer looks at `batch::state` field and reacts to it accordingly. On the other hand, when a batch finishes its work, it sets `state` field to `batch_state::finished` and does a broadcast on `_vb_state_machine.event` CV. So if the batch will execute the callback in `.finally()` while the observer is updating its state, the observer may miss the event on the CV and it will never notice that the batch was finished. This patch fixes this by adding a `some_batch_finished` flag. Even if the worker won't see an event on the CV, it will notice that the flag was set and it will do next iteration. Fixes scylladb/scylladb#26204 Closes scylladb/scylladb#26289	2025-10-09 18:17:22 +02:00
Avi Kivity	55d4d39ae3	Merge 'transport: service_level_controller: create and use driver service level' from Andrzej Jackowski This is a cherry-pick of https://github.com/scylladb/scylladb/pull/25412 commits, as the changes were reverted in 364316dd2f2212bbbb446eaa2a4b0bd53d125ad5 due to https://github.com/scylladb/scylladb/issues/26163. The underlying problem (https://github.com/scylladb/scylladb/issues/26190) was fixed in seastar (https://github.com/scylladb/seastar/pull/2994), so https://github.com/scylladb/scylladb/pull/25412 commits are restored without changes (only rebase conflicts were resolved). === This patch series: - Increases the number of allowed scheduling groups to allow creation of `sl:driver` - Implements `create_driver_service_level` that creates `sl:driver` with shares=200 if it wasn't already created - Implements creation of `sl:driver` for new systems and tests in `raft_initialize_discovery_leader` - Modifies `topology_coordinator` to use create `sl:driver` after upgrades. - Implements using `sl:driver` for new connections in `transport/server` - Adds to `transport/server` recognition of driver's control connections and forcing them to keep using `sl:driver`. - Adds tests to verify the new functionality - Modifies existing tests to let them pass after `sl:driver` is added - Modifies the documentation to contain new `sl:driver` The changes were evaluated by a test with the following scenario ([test_connections-sl-driver.py](https://github.com/user-attachments/files/22021273/test_connections-sl-driver.py)): - Start ScyllaDB with one node - Create 1000 keyspaces, 1 table in each keyspace - Start `cassandra-stress` (`-rate threads=50 -mode native cql3`) - Run connection storm with 1000 session (100 python processes, 10 sessions each) The maximum latency during connection storm dropped from 224.94ms to 41.43ms (those numbers are average from 20 test executions, were max latency was in [140ms, 361ms] before change and [31.4ms, 61.5ms] after). The snippet of cassandra-stress output from the moment of connection storm: Before: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 789206, 85887, 85887, 85887, 0.6, 0.3, 2.0, 2.0, 2.5, 5.0, 9.0, 0.09679, 0, 0, 0, 0, 0, 0 total, 909322, 120116, 120116, 120116, 0.4, 0.2, 1.9, 2.0, 2.1, 3.1, 10.0, 0.09053, 0, 0, 0, 0, 0, 0 total, 964392, 55070, 55070, 55070, 0.9, 0.4, 2.0, 4.5, 7.7, 18.9, 11.0, 0.09203, 0, 0, 0, 0, 0, 0 total, 975705, 11313, 11313, 11313, 4.4, 3.5, 6.5, 24.5, 82.7, 83.0, 12.0, 0.11713, 0, 0, 0, 0, 0, 0 total, 987548, 11843, 11843, 11843, 4.2, 3.5, 6.5, 33.7, 48.6, 51.5, 13.0, 0.13366, 0, 0, 0, 0, 0, 0 total, 995422, 7874, 7874, 7874, 6.3, 4.0, 7.7, 85.6, 112.9, 113.5, 14.0, 0.14753, 0, 0, 0, 0, 0, 0 total, 1007228, 11806, 11806, 11806, 4.3, 3.5, 6.5, 29.1, 43.8, 87.1, 15.0, 0.15598, 0, 0, 0, 0, 0, 0 total, 1012840, 5612, 5612, 5612, 8.2, 5.0, 11.5, 121.8, 166.6, 170.1, 16.0, 0.16535, 0, 0, 0, 0, 0, 0 total, 1016186, 3346, 3346, 3346, 13.4, 7.4, 20.1, 204.9, 207.6, 210.4, 17.0, 0.17405, 0, 0, 0, 0, 0, 0 total, 1025462, 9276, 9276, 9276, 6.3, 3.9, 9.6, 74.6, 206.8, 210.0, 18.0, 0.17800, 0, 0, 0, 0, 0, 0 total, 1035979, 10517, 10517, 10517, 4.8, 3.5, 6.7, 38.5, 82.6, 83.0, 19.0, 0.18120, 0, 0, 0, 0, 0, 0 total, 1047488, 11509, 11509, 11509, 4.3, 3.5, 6.0, 32.6, 72.3, 74.0, 20.0, 0.18334, 0, 0, 0, 0, 0, 0 total, 1077456, 29968, 29968, 29968, 1.7, 1.6, 2.9, 3.6, 7.0, 8.2, 21.0, 0.17943, 0, 0, 0, 0, 0, 0 total, 1105490, 28034, 28034, 28034, 1.8, 1.8, 3.5, 4.6, 5.3, 13.8, 22.0, 0.17609, 0, 0, 0, 0, 0, 0 total, 1132221, 26731, 26731, 26731, 1.9, 1.8, 3.8, 5.2, 8.4, 11.1, 23.0, 0.17314, 0, 0, 0, 0, 0, 0 total, 1162149, 29928, 29928, 29928, 1.7, 1.7, 3.0, 4.5, 8.0, 9.1, 24.0, 0.16950, 0, 0, 0, 0, 0, 0 ... ``` After: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 822863, 94379, 94379, 94379, 0.5, 0.3, 2.0, 2.0, 2.1, 3.7, 9.0, 0.06669, 0, 0, 0, 0, 0, 0 total, 937337, 114474, 114474, 114474, 0.4, 0.2, 2.0, 2.0, 2.1, 3.4, 10.0, 0.06301, 0, 0, 0, 0, 0, 0 total, 986630, 49293, 49293, 49293, 1.0, 1.0, 2.0, 2.1, 17.9, 19.0, 11.0, 0.07318, 0, 0, 0, 0, 0, 0 total, 1026734, 40104, 40104, 40104, 1.2, 1.0, 2.0, 2.2, 6.3, 7.1, 12.0, 0.08410, 0, 0, 0, 0, 0, 0 total, 1066124, 39390, 39390, 39390, 1.3, 1.0, 2.0, 2.2, 2.6, 3.4, 13.0, 0.09108, 0, 0, 0, 0, 0, 0 total, 1103082, 36958, 36958, 36958, 1.3, 1.1, 2.1, 2.5, 3.1, 4.2, 14.0, 0.09643, 0, 0, 0, 0, 0, 0 total, 1141987, 38905, 38905, 38905, 1.3, 1.0, 2.0, 2.4, 11.4, 12.7, 15.0, 0.09894, 0, 0, 0, 0, 0, 0 total, 1180023, 38036, 38036, 38036, 1.3, 1.0, 2.0, 3.7, 5.6, 7.1, 16.0, 0.10070, 0, 0, 0, 0, 0, 0 total, 1216481, 36458, 36458, 36458, 1.4, 1.0, 2.1, 3.6, 4.7, 5.0, 17.0, 0.10210, 0, 0, 0, 0, 0, 0 total, 1256819, 40338, 40338, 40338, 1.2, 1.0, 2.0, 2.2, 3.5, 5.4, 18.0, 0.10173, 0, 0, 0, 0, 0, 0 total, 1295122, 38303, 38303, 38303, 1.3, 1.0, 2.0, 2.4, 21.0, 21.1, 19.0, 0.10136, 0, 0, 0, 0, 0, 0 total, 1334743, 39621, 39621, 39621, 1.3, 1.0, 2.0, 2.3, 3.3, 4.0, 20.0, 0.10055, 0, 0, 0, 0, 0, 0 total, 1375579, 40836, 40836, 40836, 1.2, 1.0, 2.0, 2.1, 3.4, 5.7, 21.0, 0.09927, 0, 0, 0, 0, 0, 0 total, 1415576, 39997, 39997, 39997, 1.2, 1.0, 2.0, 2.3, 3.2, 4.1, 22.0, 0.09807, 0, 0, 0, 0, 0, 0 total, 1449268, 33692, 33692, 33692, 1.5, 1.4, 2.5, 3.2, 4.2, 5.6, 23.0, 0.09800, 0, 0, 0, 0, 0, 0 total, 1471873, 22605, 22605, 22605, 2.2, 2.0, 4.8, 5.9, 7.0, 7.9, 24.0, 0.10015, 0, 0, 0, 0, 0, 0 ... ``` Fixes: https://github.com/scylladb/scylladb/issues/24411 This is a new feature, so no backport needed. Closes scylladb/scylladb#26411 * github.com:scylladb/scylladb: docs: workload-prioritization: add driver service level test: add test to verify use of `sl:driver` transport: use `sl:driver` to handle driver's control connections transport: whitespace only change in update_scheduling_group transport: call update_scheduling_group for non-auth connections generic_server: transport: start using `sl:driver` for new connections test: add test_desc_* for driver service level test: service_levels: add tests for sl:driver creation and removal test: add reload_raft_topology_state() to ScyllaRESTAPIClient service_level_controller: automatically create `sl:driver` service_level_controller: methods to create driver service level service_level_controller: handle special sl:driver in DESC output topology_coordinator: add service_level_controller reference system_keyspace: add service_level_driver_created test: add MAX_USER_SERVICE_LEVELS	2025-10-09 17:28:39 +03:00
Dawid Mędrek	ecc955fbe0	index/secondary_index_manager: Take std::span instead of std::vector	2025-10-09 16:17:07 +02:00
Dawid Mędrek	074f0f2e4c	index/secondary_index_manager: Add missing const qualifier	2025-10-09 16:06:50 +02:00
Dawid Mędrek	7baf95bc4b	index/vector_index: Add missing const qualifiers	2025-10-09 16:06:24 +02:00
Dawid Mędrek	4486ac0891	cql3/statements/index_prop_defs.cc: Remove unused include	2025-10-09 16:01:56 +02:00
Dawid Mędrek	d50c2f7c74	cql3/statements/index_prop_defs.cc: Mark function as TU-local	2025-10-09 16:00:44 +02:00
Dawid Mędrek	89b3d0c582	cql3/statements/index_prop_defs: Mark methods as const-qualified	2025-10-09 15:53:29 +02:00
Avi Kivity	bb02295695	setup: add the lazytime XFS mount option In `f828fe0d59` ("setup: add the lazytime XFS version") we added the lazytime mount option to /var/lib/scylla, but it was quickly reverted (`8f5e80e61a`) as it caused a regression on CentOS 7. We reinstate it now with a kernel version check. This will avoid the lazytime mount option on CentOS 7, which is unsupported anyway. The lazytime option avoids marking the inode as dirty if it's only for the purpose of updating mtime/ctime. This won't help much while writing sstables (since the write also updates extent information), but may help a little with with commitlog writes, since those are pure overwrites. It likely won't help with the RWF_NOWAIT violations seen in [1], since those are likely due to in-memory locking, not flushing dirty inodes to disk. Tested with an install to Ubuntu 24.04 LTS followed by a scylla_setup run. The lazytime option was added the the .mount file and showed up in the live mount. [1] https://github.com/scylladb/seastar/issues/2974 Closes scylladb/scylladb#26436 Fixes #26002	2025-10-09 15:55:58 +03:00
Ernest Zaslavsky	c2bab430d7	s3_client: fix `when` condition to prevent infinite locking Refine condition variable predicate in filling fiber to avoid indefinite waiting when `close` is invoked. Closes scylladb/scylladb#26449	2025-10-09 15:55:37 +03:00
Piotr Wieczorek	b54ad9e22f	storage: Add cdc options to cas_request::apply	2025-10-09 12:28:10 +02:00
Piotr Wieczorek	2c1e699864	cdc, storage: Add a struct to pass per-mutation options to CDC This will allow us to communicate with CDC from higher layers. We plan to use it to reduce the number of read-before-writes with preimages by passing the row selected in upper layers.	2025-10-09 12:28:10 +02:00
Piotr Wieczorek	66935bedac	cdc: Move operations enum to the top of the namespace	2025-10-09 12:28:10 +02:00
Michał Chojnowski	c35b82b860	test/cluster/test_bti_index.py: avoid a race with CQL tracing The test uses CQL tracing to check which files were read by a query. This is flaky if the coordinator and the replica are different shards, because the Python driver only waits for the coordinator, and not for replicas, to finish writing their traces. (So it might happen that the Python driver returns a result with only coordinator events and no replica events). Let's just dodge the issue by using --smp=1. Fixes scylladb/scylladb#26432 Closes scylladb/scylladb#26434	2025-10-09 13:22:06 +03:00
Michał Chojnowski	87e3027c81	docs: fix a parameter name in API calls in sstable-dictionary-compression.rst The correct argument name is `cf`, not `table`. Fixes scylladb/scylladb#25275 Closes scylladb/scylladb#26447	2025-10-09 13:18:47 +03:00
Robert Bindar	2c74a6981b	Make scylla_io_setup detect request size for best write IOPS We noticed during work on scylladb/seastar#2802 that on i7i family (later proved that it's valid for i4i family as well), the disks are reporting the physical sector sizes incorrectly as 512bytes, whilst we proved we can render much better write IOPS with 4096bytes. This is not the case on AWS i3en family where the reported 512bytes physical sector size is also the size we can achieve the best write IOPS. This patch works around this issue by changing `scylla_io_setup` to parse the instance type out of `/sys/devices/virtual/dmi/id/product_name` and run iotune with the correct request size based on the instance type. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#25315	2025-10-08 14:30:52 +03:00
Piotr Dulikowski	fe7ffc5e5d	Merge 'service/qos: set long timeout for auth queries on SL cache update' from Michael Litvak pass an appropriate query state for auth queries called from service level cache reload. we use the function qos_query_state to select a query_state based on caller context - for internal queries, we set a very long timeout. the service level cache reload is called from group0 reload. we want it to have a long timeout instead of the default 5 seconds for auth queries, because we don't have strict latency requirement on the one hand, and on the other hand a timeout exception is undesired in the group0 reload logic and can break group0 on the node. Fixes https://github.com/scylladb/scylladb/issues/25290 backport possible to improve stability Closes scylladb/scylladb#26180 * github.com:scylladb/scylladb: service/qos: set long timeout for auth queries on SL cache update auth: add query_state parameter to query functions auth: refactor query_all_directly_granted	2025-10-08 12:37:01 +02:00
Michał Jadwiszczak	84e4e34d81	db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground This patch moves `discover_existing_staging_sstables()` to be executed from main level, instead of running it on the background fiber. This method need to be run only once during the startup to collect existing staging sstables, so there is no need to do it in the background. This change will increase debugability of any further issues related to it (like scylladb/scylladb#26403). Fixes scylladb/scylladb#26417	2025-10-08 11:16:07 +02:00
Michał Jadwiszczak	575dce765e	db/view/view_building_worker: futurize and rename `start_background_fibers()` Next commit will move `discover_existing_staging_sstables()` to the foreground, so to prepare for this we need to futurize `start_background_fibers()` method and change its name to better reflect its purpose.	2025-10-08 10:19:41 +02:00
Andrzej Jackowski	0072b75541	docs: workload-prioritization: add driver service level Refs: scylladb/scylladb#24411	2025-10-08 08:25:38 +02:00
Andrzej Jackowski	f720ce0492	test: add test to verify use of `sl:driver` `sl:driver` is expected to be used for new and control connections, but other connections that run user load should not use it after the user is authenticated. Refs: scylladb/scylladb#24411	2025-10-08 08:25:33 +02:00
Andrzej Jackowski	f99b8c4a55	transport: use `sl:driver` to handle driver's control connections Before `sl:driver` was introduced, service levels were assigned as follows: 1. New connections were processed in `main`. 2. After user authentication was completed, the connection's SL was changed to the user's SL (or `sl:default` if the user had no SL). This commit introduces `service_level_state` to `client_state` and implements the following logic in `transport/server`: 1. If `sl:driver` is not present in the system (for example, it was removed), service levels behave as described above. 2. If `sl:driver` is present, the flow is: I. New connections use `sl:driver`. II. After user authentication is completed, the connection's SL is changed to the user's SL (or `sl:default`). III. If a REGISTER (to events) request is handled, the client is processing the control connection. We mark the client_state to permanently use `sl:driver`. The aforementioned state `2.III` is represented by `_control_connection` flag in `client_state`. Fixes: scylladb/scylladb#24411	2025-10-08 08:25:28 +02:00
Andrzej Jackowski	fd36bc418a	transport: whitespace only change in update_scheduling_group The indentation is changed because it will be required in the next commit of this patch series.	2025-10-08 08:25:22 +02:00
Andrzej Jackowski	278019c328	transport: call update_scheduling_group for non-auth connections Before this change, unauthorized connections stayed in `main` scheduling group. It is not ideal, in such case, rather `sl:default` should be used, to have a consistent behavior with a scenario where users is authenticated but there is no service level assigned to the user. This commit adds a call to `update_scheduling_group` at the end of connection creation for an unauthenticated user, to make sure the service level is switched to `sl:default`. Fixes: scylladb/scylladb#26040	2025-10-08 08:25:17 +02:00
Andrzej Jackowski	14081d0727	generic_server: transport: start using `sl:driver` for new connections Before this change, new connections were handled in a default scheduling group (`main`), because before the user is authenticated we do not know which service level should be used. With the new `sl:driver` service level, creation of new connections can be moved to `sl:driver`. We switch the service level as early as possible, in `do_accepts`. There is a possibility, that `sl:driver` will not exist yet, for instance, in specific upgrade cases, or if it was removed. Therefore, we also switch to `sl:driver` after a connection is accepted. Refs: scylladb/scylladb#24411	2025-10-08 08:25:12 +02:00
Andrzej Jackowski	b62135f767	test: add test_desc_* for driver service level Driver service level is a special service level that is created automatically by the system. Therefore, it requires special handling in DESC SCHEMA WITH INTERNALS and those test verifies the special behavior. Refs: scylladb/scylladb#24411	2025-10-08 08:25:07 +02:00
Andrzej Jackowski	0ddf46c7b4	test: service_levels: add tests for sl:driver creation and removal Refs: scylladb/scylladb#24411	2025-10-08 08:25:02 +02:00
Andrzej Jackowski	9e9bca9bdb	test: add reload_raft_topology_state() to ScyllaRESTAPIClient To encapsulate `/storage_service/raft_topology/reload` API call	2025-10-08 08:24:57 +02:00
Andrzej Jackowski	c59a7db1c9	service_level_controller: automatically create `sl:driver` This commit: - Increases the number of allowed scheduling groups to allow the creation of `sl:driver`. - Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating `sl:driver` until all nodes have increased the number of scheduling groups. - Starts using `get_create_driver_service_level_mutations` to unconditionally create `sl:driver` on `raft_initialize_discovery_leader`. The purpose of this code path is ensuring existence of `sl:driver` in new system and tests. - Starts using `migrate_to_driver_service_level` to create `sl:driver` if it is not already present. The creation of `sl:driver` is managed by `topology_coordinator`, similar to other system keyspace updates, such as the `view_builder` migration. The purpose of this code path is handling upgrades. - Modifies related tests to pass after `sl:driver` is added. Later in this patch series, `sl:driver` will be used by `transport/server` to handle selected traffic, such as the driver's schema and topology fetches. Refs: scylladb/scylladb#24411	2025-10-08 08:24:43 +02:00
Andrzej Jackowski	923559f46a	service_level_controller: methods to create driver service level This commit implements `get_create_driver_service_level_mutations` and `migrate_to_driver_service_level` in service_level_controller. Both methods create `sl:driver` with shares=200 and store this fact in `system.scylla_local`. Both methods will be used later in this patch series for automatic creation of sl:driver. Refs: scylladb/scylladb#24411	2025-10-08 08:24:38 +02:00
Andrzej Jackowski	2d296a2f9b	service_level_controller: handle special sl:driver in DESC output Later in this patch series, `sl:driver` will be added as a special service level created automatically by the system. It needs special handling in `DESC SCHEMA ...` to ensure that during backup restore: 1. CREATE SERVICE LEVEL does not fail if `sl:driver` already exists 2. If `sl:driver` exists, its configuration is fully restored (emit ALTER SERVICE LEVEL). 3. If `sl:driver` was removed, the information is retained (emit DROP SERVICE LEVEL instead of CREATE/ALTER). Refs: scylladb/scylladb#24411	2025-10-08 08:24:33 +02:00
Andrzej Jackowski	1ff605005e	topology_coordinator: add service_level_controller reference This adds a reference to sl_controller so that, later in this patch series, topology_coordinator can manage creating `sl:driver` once group0 is fully operational. Refs: scylladb/scylladb#24411	2025-10-08 08:24:28 +02:00
Andrzej Jackowski	8953f96609	system_keyspace: add service_level_driver_created This commit extends sytem.scylla_local table with an additional key/value pair that can be used later in this patch series to keep an information that `sl:driver` was already created. The purpose of storing this information is to ensure that `sl:driver` is not recreated after being intentionally removed. A new mutation is included in `register_raft_pull_snapshot` to keep `service_level_driver_created` in state machine shapshot, which is required for proper propagation of the value when a new node is added to the cluster. Refs: scylladb/scylladb#24411	2025-10-08 08:24:23 +02:00
Andrzej Jackowski	7d2db37831	test: add MAX_USER_SERVICE_LEVELS Previously, tests used the hardcoded value 7 for the maximum number of user service levels. This commit introduces a named variable that can be shared across tests to avoid cases where this magic number goes out of sync.	2025-10-08 08:24:17 +02:00
Patryk Jędrzejczak	d391b9d7d9	test: assert that majority is lost in some tests of the recovery procedure The voter handler caused `test_raft_recovery_user_data` to stop losing group 0 majority when expected. We make sure this won't happen again in this commit. We don't change `test_raft_recovery_entry_lose` because it has some checks that would fail with group 0 majority (schema versions would match). Note that it's possible to timeout the read barrier quickly without the `timeout` parameter. See e.g. `test_cannot_add_new_node` in `test_raft_no_quorum.py`. We don't take this approach here because we don't want to change the default Raft parameters in the recovery procedure tests.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	d623844c1c	test: rest_client: add timeout support for read_barrier Scylla already handles the `timeout` parameter, so the change is simple. We use the `timeout` parameter in the following commit.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	91c8466f47	test: test_raft_recovery_user_data: lose majority when killing one dc After introducing the voter handler, the test stopped losing group 0 majority when expected because the killed dc contained 2 out of 5 voters. We fix it in this commit. The fix relies on the voter handler not doing unnecessary work. The first dc should keep its voters and majority. The test was functional even though majority wasn't lost when expected. Stopping the recovery leader before restarting it with `recovery_leader` caused majority loss in the old group 0. Hence, there is no need to backport this commit.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	c8a5e7a74e	test: test_raft_recovery_user_data: shutdown driver sessions Shutting down `ccluster_all_nodes` in the previous commit is necessary to avoid flakiness. It turns out that leaked driver sessions can impact another run of the test case (with different parameterization). Here, without shutting down `ccluster_all_nodes`, we could observe the DDL requests from `start_writes` fail in the second test case run (where `remove_dead_nodes_with == "replace"`) like this: ``` > await cql.run_async(f"USE {ks_name}") E cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.46.35.70:9042 dc1>: ConnectionException('Host has been marked down or removed'), <Host: 127.46.35.71:9042 dc1>: ConnectionException('Host has been marked down or removed'), <Host: 127.46.35.3:9042 dc1>: ConnectionException('Host has been marked down or removed'), <Host: 127.46.35.25:9042>: ConnectionException('Host has been marked down or removed')}) ``` We could also see errors like this on the driver: ``` cassandra.InvalidRequest: Error from server: code=2200 [Invalid query] message="Keyspace 'test_1759763911381_oktks' does not exist" ``` It turned out that `test_1759763911381_oktks` was created in the first test case run (where `remove_dead_nodes_with == "remove"), and somehow the driver session created in the second test case run was still using this keyspace in some way. The DDL requests were failing on the Scylla side with the error above, and after some retries, the driver marked nodes as down. I didn't try to investigate what exactly the driver was doing. In this commit, we shut down other driver sessions used in this test. They didn't cause problems so far, but we'd better use the Python driver correctly and be safe.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	a35740cbe8	test: test_raft_recovery_user_data: use a separate driver connection for the write workload It's simpler than pausing the workload for the `cql` reconnection. Moreover, the removed `start_writes` call required group 0 majority for (redundant) CREATE KEYSPACE IF NOT EXISTS and CREATE TABLE IF NOT EXISTS statements. The test shouldn't have group 0 majority at that point, which is fixed in one of the following commits. Using a separate driver connection also allows us to call `finish_writes()` a bit later, after the `cql` reconnection.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	d1a944251e	test: test_raft_recovery_user_data: send ALTER KEYSPACE to any node We have the global request queue now, so we can't hit "Another global topology request is ongoing, please retry." anymore.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	9a98febac5	test: test_raft_recovery_user_data: bring failure_detector_timeout_in_ms back to 20 s It looks like decreasing `failure_detector_timeout_in_ms` doesn't make the shutdown faster anymore. We had some changes related to requests during shutdown like #24499 and #24714. They are probably the reason.	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	2b1d7f0e83	test: test_raft_recovery_user_data: speed up replace operations	2025-10-07 17:48:55 +02:00
Patryk Jędrzejczak	dbd998bc15	test: stop/start servers concurrently in the recovery procedure tests This change makes these tests a bit faster.	2025-10-07 17:48:51 +02:00
Dawid Mędrek	a9577e4d52	replica/database: Fix description of `validate_tablet_views_indexes` The current description is not accurate: the function doesn't throw an exception if there's an invalid materialized view. Instead, it simply logs the keyspaces that violate the requirement. Furthermore, the experimental feature `views-with-tablets` is no longer necessary for considering a materialized view as valid. It was dropped in scylladb/scylladb@b409e85c20. The replacement for it is the cluster feature `VIEWS_WITH_TABLETS`. Fixes scylladb/scylladb#26420 Closes scylladb/scylladb#26421	2025-10-07 17:39:43 +02:00
Artsiom Mishuta	99455833bd	test.py: reintroducing sudo in resource_gather.py conditionally reintroducing sudo for resource gathering when running under docker related: https://github.com/scylladb/scylladb/pull/26294#issuecomment-3346968097 fixes: https://github.com/scylladb/scylladb/issues/26312 Closes scylladb/scylladb#26401	2025-10-07 14:42:15 +02:00
Piotr Dulikowski	264cf12b66	Merge 'view building coordinator - add missing tests' from Michał Jadwiszczak This patch adds tests for: - tablet migration during view building - tablet merge during view building. Those tests were missing from the original testing plan. We want to backport it to 2025.4 to ensure the release is bug-free. Closes scylladb/scylladb#26414 * github.com:scylladb/scylladb: test/cluster/test_view_building_coordinator: add test for tablet merge test/cluster/test_view_building_coordinator: add test for tablet migration	2025-10-07 14:25:04 +02:00
Botond Dénes	8b0bfb817e	Merge 'Switch REST API server to use content-streaming' from Pavel Emelyanov Seastar httpd recommended users to stop using contiguous requet.content string and read body they need from request's input_stream instead. However, "official" deprecation of request content had been only made recently. This PR patches REST API server to turn this feature on and patches few handlers that mess with request bodies to read them from request stream. Using newer seastar API, no need to backport Closes scylladb/scylladb#26418 * github.com:scylladb/scylladb: api: Switch to request content streaming api: Fix indentation after previous patch api: Coroutinize set_relabel_config handler api: Coroutinize set_error_injection handler	2025-10-07 14:13:47 +03:00
Michał Chojnowski	3cf51cb9e8	sstables: fix some typos in comments I added those typos recently, and spellcheckers complain. Closes scylladb/scylladb#26376	2025-10-07 13:20:06 +03:00
Botond Dénes	8beea931be	Merge 'Remove system_keyspace from column_family API' from Pavel Emelyanov This dependency reference is carried into column_family handlers block to make get_built_views handler work. However, the handler in question should live in view_builder block, because it works with v.b. data. This PR moves the handler there, while at it, coroutinizes it, and removes the no longer needed sys.ks. reference from column_family. API dependencies cleanup work, no need to backport Closes scylladb/scylladb#26381 * github.com:scylladb/scylladb: api: Fix indentation after previous patch api: Coroutinize get_built_indexes handler code api: Remove system_keyspace ref from column_family API block api: Move get_built_indexes from column_family to view_builder	2025-10-07 13:07:46 +03:00
Pavel Emelyanov	ed1c049c3b	scripts: Add usage to pull_github_pr script If mis-used, the script says error: unrecognized option: ..., see ./scripts/pull_github_pr.sh -h for usage but if using the suggested -h option it prints just the same. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26378	2025-10-07 10:28:35 +03:00
Lakshmi Narayanan Sreethar	b8042b66e3	cmake: replace -fvisibility=hidden compiler flag with -fvisibility-inlines-hidden The PR #26154 dropped the `-fvisibility=hidden` compiler flag and replaced it with `-fvisibility-inlines-hidden` as the former caused issues in how the `noncopyable_function::operator bool` method executed leading to incorrect return values. Apply the same fix to cmake. Fixes #26391 Closes scylladb/scylladb#26431	2025-10-07 10:10:47 +03:00
Pavel Emelyanov	127afd4da1	api: Switch to request content streaming There are three handler that need to be patched all at once with the server itself being marked with set_content_streaming For two simple handler just get the content string with read_entire_stream_contiguous helper. This is what httpd server did anyway. The "start_restore" handler used the contiguous contents to parse json from using rjson utility. This handler is patched to use read_entire_stream() that returns a vector of temporary buffers. The rjson parser has a helper to pars from that vector, so the change is also optimization. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-06 16:43:26 +03:00
Pavel Emelyanov	2cfccdac5c	api: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-06 16:43:12 +03:00
Pavel Emelyanov	5668058cb0	api: Coroutinize set_relabel_config handler Without the invoke_on_all lambda, for simplicity Also keep indentation "broken" for the ease of review Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-06 16:42:30 +03:00
Pavel Emelyanov	5017a25c00	api: Coroutinize set_error_injection handler Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-06 16:42:14 +03:00
Patryk Jędrzejczak	67d48a459f	raft topology: make the voter handler consider only group 0 members In the Raft-based recovery procedure, we create a new group 0 and add live nodes to it one by one. This means that for some time there are nodes which belong to the topology, but not to the new group 0. The voter handler running on the recovery leader incorrectly considers these nodes while choosing voters. The consequences: - misleading logs, for example, "making servers {<ID of a non-member>} voters", where the non-member won't become a voter anyway, - increased chance of majority loss during the recovery procedure, for example, all 3 nodes that first joined the new group 0 are in the same dc and rack, but only one of them becomes a voter because the voter handler tries to make non-members in other dcs/racks voters. Fixes #26321 Closes scylladb/scylladb#26327	2025-10-06 16:27:47 +03:00
Pavel Emelyanov	8002ddf946	code: Use tls_options::bye_timeout instead of deprecated switch Some code wants its TLS sockets to close immediately without sending BYE message and waiting for the response. Recent seastar update changed the way this functionality is requested (scylladb/seastar#2986) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26253	2025-10-06 16:25:35 +03:00
Michał Jadwiszczak	279a8cbba3	test/cluster/test_view_building_coordinator: add test for tablet merge The test pauses processing of the view building task and triggers tablet merge.	2025-10-06 15:06:11 +02:00
Michał Jadwiszczak	fc7e5370a1	test/cluster/test_view_building_coordinator: add test for tablet migration The test pauses processing of the view building task and migrates it to another node.	2025-10-06 15:02:42 +02:00
Michał Chojnowski	3b338e36c2	utils/config_file: fix a missing `allowed_values` propagation in one of `named_value` constructors In one of the constructors of `named_value`, the `allowed_values` argument isn't used. (This means that if some config entry uses this constructor, the values aren't validated on the config layer, and might give some lower layer a bad surprise). Fix that. Fixes scyllladb/scylladb#26371 Closes scylladb/scylladb#26196	2025-10-06 15:33:11 +03:00
Michał Chojnowski	dbddba0794	sstables/trie: actually apply BYPASS CACHE to index reads BYPASS CACHE is implemented for `bti_index_reader` by giving it its own private `cached_file` wrappers over Partitions.db and Rows.db, instead of passing it the shared `cached_file` owned by the sstable. But due to an oversight, the private `cached_file`s aren't constructed on top of the raw Partitions.db and Rows.db files, but on top of `cached_file_impl` wrappers around those files. Which means that BYPASS CACHE doesn't actually do its job. Tests based on `scylla_index_page_cache_*` metrics and on CQL tracing still see the reads from the private files as "cache misses", but those misses are served from the shared cached files anyway, so the tests don't see the problem. In this commit we extend `test_bti_index.py` with a check that looks at reactor's `io_queue` metrics instead, and catches the problem. Fixes scylladb/scylladb#26372 Closes scylladb/scylladb#26373	2025-10-06 15:32:05 +03:00
Andrzej Jackowski	c3dd383e9e	test: add reproduction of name reuse bug to service level tests This commit adds a reproduction test for scylladb/scylladb#26190 to the service levels test suite. Although the bug was fixed internally in Seastar, the corner-case service level name reuse scenario should be covered by tests to prevent regressions. Refs: https://github.com/scylladb/scylladb/issues/26190 Closes scylladb/scylladb#26379	2025-10-06 14:19:22 +02:00
Piotr Dulikowski	380f243986	Merge ' Support replication factor rack list for tablet-based keyspaces' from Tomasz Grabiec This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names. For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] } Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs. Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks. Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported. New feature, no backport required. Co-authored with @bhalevy Fixes https://github.com/scylladb/scylladb/issues/25269 Fixes https://github.com/scylladb/scylladb/issues/23525 Closes scylladb/scylladb#26358 * github.com:scylladb/scylladb: tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count locator: Make hasher for endpoint_dc_rack globally accessible test: tablets: Add test for replica allocation on rack list changes test: lib: topology_builder: generate unique rack names test: Add tests for rack list RF doc: Document rack-list replication factor topology_coordinator: Restore formatting topology_coordinator: Cancel keyspace alter on broader set of errors topology_coordinator: Make keyspace alter process options through as_ks_metadata_update() cql3: ks_prop_defs: Preserve old options cql3: ks_prop_defs: Introduce flattened() locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace() tablet_allocator: Respect binding replicas to racks locator: network_topology_strategy: Respect rack list when reallocating tablets cql3: ks_prop_defs: Fail with more information when options are not in expected format locator, cql3: Support rack lists in replication options cql3: Fail early on vnode/tablet flavor alter cql3: Extract convert_property_map() out of Cql.g schema: Use definition from the header instead of open-coding it locator: Abstract obtaining the number of replicas from replication_strategy_config_option cql3, locator: Use type aliases for option maps locator: Add debug logging locator: Pass topology to replication strategy constructor abstract_replication_strategy, network_topology_strategy: add replication_factor_data class	2025-10-06 14:14:09 +02:00
Piotr Dulikowski	e7907b173a	Merge 'db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Dawid Mędrek Materialized views are currently in the experimental phase and using them in tablet-based keyspaces requires starting Scylla with an experimental feature, `views-with-tablets`. Any attempts to create a materialized view or secondary index when it's not enabled will fail with an appropriate error. After considerable effort, we're drawing close to bringing views out of the experimental phase, and the experimental feature will no longer be needed. However, materialized views in tablet-based keyspaces will still be restricted, and creating them will only be possible after enabling the configuration option `rf_rack_valid_keyspaces`. That's what we do in this PR. In this patch, we adjust existing tests in the tree to work with the new restriction. That shouldn't have been necessary because we've already seemingly adjusted all of them to work with the configuration option, but some tests hid well. We fix that mistake now. After that, we introduce the new restriction. What's more, when starting Scylla, we verify that there is no materialized view that would violate the contract. If there are some that do, we list them, notify the user, and refuse to start. High-level implementation strategy: 1. Name the restrictions in form of a function. 2. Adjust existing tests. 3. Restrict materialized views by both the experimental feature and the configuration option. Add validation test. 4. Drop the requirement for the experimental feature. Adjust the added test and add a new one. 5. Update the user documentation. Fixes scylladb/scylladb#23030 Backport: 2025.4, as we are aiming to support materialized views for tablets from that version. Closes scylladb/scylladb#25802 * github.com:scylladb/scylladb: view: Stop requiring experimental feature db/view: Verify valid configuration for tablet-based views db/view: Require rf_rack_valid_keyspaces when creating view test/cluster/random_failures: Skip creating secondary indexes test/cluster/mv: Mark test_mv_rf_change as skipped test/cluster: Adjust MV tests to RF-rack-validity test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces db/view: Name requirement for views with tablets	2025-10-06 12:46:46 +02:00
Pavel Emelyanov	6ad8dc4a44	Merge 'root,replica: mv querier to replica/' from Botond Dénes The querier object is a confusing one. Based on its name it should be in the query/ module and it is already in the query namespace. The query namespace is used for symbols which span the coordinator and replica, or that are mostly coordinator side. The querier is mainly in this namespace due to its similar name and because at the time it was introduced, namespace replica didn't exist yet. But this is a mistake which confuses people. The querier is actually a completely replica-side logic, implementing the caching of the readers on the replica. Move it to the replica module and namespace to make this more clear. Code cleanup, no backport. Closes scylladb/scylladb#26280 * github.com:scylladb/scylladb: replica: move querier code to replica namespace root,replica: mv querier to replica/	2025-10-06 08:26:05 +03:00
Pavel Emelyanov	5cf9043d74	Merge 'sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db' from Michał Chojnowski TemporaryHashes.db is a temporary sstable component used during ms sstable writes. It's different from other sstable components in that it's not included in the TOC. Because of this, it has a special case in the logic that deletes unfinished sstables on boot. (After Scylla dies in the middle of a sstable write). But there's a bug in that special case, which causes Scylla to forget to delete other components from the same unfinished sstable. The code intends only to delete the TemporaryHashes.db file from the `_state->generations_found` multimap, but it accidentally also deletes the file's sibling components from the multimap. Fix that. Also, extend a related test so that it would catch the problem before the fix. Fixes scylladb/scylladb#26393 Bugfix, needs backport to 2025.4. Closes scylladb/scylladb#26394 * github.com:scylladb/scylladb: sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db test/boost/database_test: fix two no-op distributed loader tests	2025-10-06 08:23:03 +03:00
Andrzej Jackowski	3411089f5d	treewide: seastar module update The reason for this seastar update is fixing #26190 - a service level bug caused by a problem in scheduling group in seastar implementation (seastar#2992). * ./seastar 9c07020a...270476e7 (10): > core: restore seastar_logger namespace in try_systemwide_memory_barrier > Merge 'coroutines: support coroutines that copy their captures into the coroutine frame' from Avi Kivity coroutines: advertise lambda-capture-by-value and test it future: invoke continuation functions as temporaries future: handle lvalue references in future continuations early > resource: Tune up some allocate_io_queues() arguments > Merge 'Add perf test hooks' from Travis Downs perf_tests:add tests to verify pre-run hooks per_tests: add pre-run hooks perf-tests.md: update on measurement overhead perf_tests_perf: a few more test variations remove vestigial register_test method > Add `touch` command to `rl` file processing > Merge 'execution_stage: update stage name on scheduling_group rename' from Andrzej Jackowski test: add sg_rename_recreate_with_the_same_name test: add test_renaming_execution_stage in metric_test test: add test_execution_stage_rename execution_stage: update stage name on scheduling_group rename execution_stage: reorganize per_group_stage_type execution_stage: add concrete_execution_stage_base execution_stage: move metrics setup to a separate method > iotune: Fix warmup calculation bug and botched rebase > Add missing `#pragma once` to ascii.rl > iotune: Ignore measurements during warmup period Fixes: https://github.com/scylladb/scylladb/issues/26190 Closes scylladb/scylladb#26388	2025-10-06 08:13:37 +03:00
Michał Chojnowski	6efb807c1a	sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db TemporaryHashes.db is a temporary sstable component used during ms sstable writes. It's different from other sstable components in that it's not included in the TOC. Because of this, it has a special case in the logic that deletes unfinished sstables on boot. (After Scylla dies in the middle of a sstable write). But there's a bug in that special case, which causes Scylla to forget to delete other components from the same unfinished sstable. The code intends only to delete the TemporaryHashes.db file from the `_state->generations_found` multimap, but it accidentally also deletes the file's sibling components from the multimap. Fix that. Fixes scylladb/scylladb#26393	2025-10-04 00:45:55 +02:00
Michał Chojnowski	16cb223d7f	test/boost/database_test: fix two no-op distributed loader tests There are two tests which effectively check nothing. They intend to check that distributed loader removes "leftover" sstable files. So they create some incomplete sstables, run the test env on the directory, and the files disappeared. But the test env completely clears the test directory before the distributed loader looks at the files, so the tests succeed trivially. Fix that by adding a config knob to the test env which instructs it not to clear the directory before the test.	2025-10-04 00:44:49 +02:00
Michał Hudobski	3db2e67478	docs: adjust docs for VS auth changes We adjust the documentation to include the new VECTOR_SEARCH_INDEXING permission and its usage and also to reflect the changes in the maximal amount of service levels.	2025-10-03 16:55:57 +02:00
Michał Hudobski	e8fb745965	test: add tests for VECTOR_SEARCH_INDEXING permission This commit adds tests to verify the expected behavior of the VECTOR_SEARCH_INDEXING permission, that is, allowing GRANTing this permission only on ALL KEYSPACES and allowing SELECT queries only on tables with vector indexes when the user has this permission	2025-10-03 16:55:57 +02:00
Michał Hudobski	6a69bd770a	cql: allow VECTOR_SEARCH_INDEXING users to select This patch allows users with the VECTOR_SEARCH_INDEXING permission to perform SELECT queries on tables that have a vector index. This is needed for the Vector Store service, which reads the vector-indexed tables, but does not require the full SELECT permission.	2025-10-03 16:55:57 +02:00
Michał Hudobski	3025a35aa6	auth: add possibilty to check for any permission in set This commit adds a new version of command_desc struct that contains a set of permissions instead of a singular permission. When this struct is passed to ensure/check_has_permission, we check if the user has any of the included permission on the resource.	2025-10-03 16:55:57 +02:00
Michał Hudobski	ae86bfadac	auth: add a new permission VECTOR_SEARCH_INDEXING This patch adds a new permission: VECTOR_SEARCH_INDEXING, that is grantable only for ALL KEYSPACES. It will allow selecting from tables with vector search indexes. It is meant to be used by the Vector Store service to allow it to build indexes without having full SELECT permissions on the tables.	2025-10-03 16:36:54 +02:00
Ferenc Szili	20aeed1607	load balancing: extend locator::load_stats to collect tablet sizes This commit extend the TABLE_LOAD_STATS RPC with data about the tablet replica sizes and effective disk capacity. Effective disk capacity of a node is computed as a sum of the sizes of all tablet replicas on a node and available disk space. This is the first change in the size based load balancing series. Closes scylladb/scylladb#26035	2025-10-03 13:37:22 +02:00
Pavel Emelyanov	37f59cef04	Merge 'tools: fix documentation links after change to source-available' from Botond Dénes Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these. Fixes: scylladb/scylladb#26320 Broken documentation link fix for the tool help output, needs backport to all live source-available versions. Closes scylladb/scylladb#26322 * github.com:scylladb/scylladb: tools/scylla-sstable: fix doc links release: adjust doc_link() for the post source-available world tools/scylla-nodetool: remove trailing " from doc urls	2025-10-03 13:53:19 +03:00
Pavel Emelyanov	7116e7dac6	api: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-03 13:51:25 +03:00
Pavel Emelyanov	42657105a3	api: Coroutinize get_built_indexes handler code "While at it". It looks much simpler this way. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-03 13:50:58 +03:00
Pavel Emelyanov	f77f9db96c	api: Remove system_keyspace ref from column_family API block This reference was only needed to facilitate get_built_indexes handler to work. Now it's gone and the sys.ks. reference is no longer needed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-03 13:50:22 +03:00
Pavel Emelyanov	95b616d0e5	api: Move get_built_indexes from column_family to view_builder The handler effectively works with the view_builder and should be registerd in the block that has this service captured. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-10-03 13:49:33 +03:00
Tomasz Grabiec	9ebdeb261f	tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count The old logic assumes that replicas are spread across whole DC when determining how many tablets we need to have at least 10 tablets per shard. If replicas are actually confined to a subset of racks, that will come up with a too high count and overshoot actual per-shard count in this rack. Similar problem happens for scaling-down of tablet count, when we try to keep per-shard tablet count below the goal. It should be tracked per-rack rather than per-DC, since racks can differ in how loaded they are by RF if it's a rack-list.	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	6962464be7	locator: Make hasher for endpoint_dc_rack globally accessible	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	85ddb832b4	test: tablets: Add test for replica allocation on rack list changes	2025-10-02 19:45:00 +02:00
Benny Halevy	4955ca3ddd	test: lib: topology_builder: generate unique rack names Encode the dc identifier into each rack name so each dc will have its own unique racks. Just for easier distinction in logs. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	5fc617ecf5	test: Add tests for rack list RF	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	1d34614421	doc: Document rack-list replication factor Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	655f4ffa3c	topology_coordinator: Restore formatting	2025-10-02 19:45:00 +02:00
Tomasz Grabiec	a21bbd4773	topology_coordinator: Cancel keyspace alter on broader set of errors We now include keyspace metadata construction, which can throw if validation fails. We want to fail the ALTER rather than keep retrying.	2025-10-02 19:44:59 +02:00
Tomasz Grabiec	d02f93e77e	topology_coordinator: Make keyspace alter process options through as_ks_metadata_update() There are several problems with how ALTER execution works with tablets. 1) Currently, option processing bypasses ks_prop_defs::prepare_options(), and will pass them directly to keyspace_metadata. This deviates from the vnode path, causing discrepancy in logic. But also there will be some non-trivial options post-processing added there - numeric RF will be replaced with a rack list. We should preserve it in the tablet path which alters the keyspace, otherwise it will fail when trying to construct network_topology_strategy. 2) Option merging happens on the flat version of the map, which won't work correctly with extended map which contains lists. We want the new list to replace the old list or numeric RF, not get its items merged. For example: We want: {'dc1': 3} + {'dc1': ['rack1', 'rack2']} = {'dc1': ['rack1', 'rack2']} If we merge flattened options, we would get incorrect flattened options: {'dc1': 3, 'dc1:0', 'rack1' 'dc1:1', 'rack2'} 3) We lose atomicity of update. Validation and merging which happens on the CQL coordinator is done in a different group0 transaction context than mutation generation inside topology coordinator later. Fixes https://github.com/scylladb/scylladb/issues/25269	2025-10-02 19:44:29 +02:00
Tomasz Grabiec	849ab5545f	cql3: ks_prop_defs: Preserve old options In `2d9b8f2`, semantics of ALTER was changed for tablet-based keyspaces which makes "replication" assignment act like +=, where replication options are merged with the old options. This merging is currently performed in the CQL statement level on options map, before passing to topology coordinator. This will change in later commit, so move merging here. Merging options of flattened level will not be correct because it doesn't recognize nested collections, like rack lists. We want: {'dc1': 3} + {'dc1': ['rack1', 'rack2']} = {'dc1': ['rack1', 'rack2']} If we merge flattened options, we would get incorrect flattened options: {'dc1': 3, 'dc1:0', 'rack1' 'dc1:1', 'rack2'} Which cannot be parsed back into ks_prop_defs on the topology coordinator. Refs https://github.com/scylladb/scylladb/pull/20208#issuecomment-3174728061 Refs #25549	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	0d0c06da06	cql3: ks_prop_defs: Introduce flattened()	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	6b7b0cb628	locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace()	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	e5b7452af2	tablet_allocator: Respect binding replicas to racks	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	6de342ed3e	locator: network_topology_strategy: Respect rack list when reallocating tablets	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	8e9a58b89f	cql3: ks_prop_defs: Fail with more information when options are not in expected format Before, we would throw vague sstring_out_of_range from substr() when the name doesn't have a nested key separate with ":", e.g "durable_writes" instead of "durable_writes:durable_writes".	2025-10-02 19:42:39 +02:00
Tomasz Grabiec	66755db062	locator, cql3: Support rack lists in replication options Allows per-DC replication factor to be either a string, holding a numerical value, or a list of strings, holding a list of rack names. The rack list is not respected yet by the tablet allocator, this is achieved in subsequent commit. This changes the format of options stored in the flattened map in system_schema.keyspaces#replication. Values which are rack lists, are converted into multiple entries, with the list index appended to the key with ':' as the separator: For example, this extended map: { 'dc1': '3', 'dc2': ['rack1', 'rack2'] } is stored as a flattened map: { 'dc1': '3', 'dc2:0': 'rack1', 'dc2:1': 'rack2' } Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>	2025-10-02 19:42:39 +02:00
Botond Dénes	9310d3eff3	scylla-gdb.py: small-objects: don't cast free_object to void* When walking the free-list of a pool or a span, the small-object code casts the dereferenced `free_object` to `void`. This is unnecessary, just use the `next` field of the `free_object` to look up the next free object. I think this monkey business with `void*` was done to speed up walking the free-list, but recently we've seen small-object --summarize fail in CI, and it could be related. Fixes: #25733 Closes scylladb/scylladb#26339	2025-10-02 13:36:49 +03:00
Botond Dénes	9d08a380db	Merge 'Fix getendpoints command for compound keys containing ':'' from Taras Veretilnyk Before, the `nodetool getendpoints` expected the key as one string separated by : (for example 1:val:ue). This caused errors if any part of the key had a colon because it was unclear whether a colon was a separator or part of the key. This change adds a new API endpoint, `/storage_service/natural_endpoints/v2/{keyspace}`, which accepts composite partition keys as multiple key_component query parameters (e.g., ?key_component=1&key_component=val:ue). The `nodetool getendpoints` command was updated to support a new `--key-components` option, allowing users to pass key components as an array. The client and test infrastructure were extended to support multiple values for a query parameter, and tests were added to verify correct behavior with composite keys. The previous method of passing partition keys as colon-separated strings is preserved for backward compatibility. Backport is not required, since this change relies on recent Seastar updates Fixes #16596 Closes scylladb/scylladb#26169 * github.com:scylladb/scylladb: docs: document --key-components option for getendpoints test/nodetool/test_getendpoints: add coverage for --key-components param in getendpoints nodetool: Introduce new option --key-components to specify compound partition keys as array rest_api/test_storage_service: add v2 natural_endpoints test for composite key with multiple components api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':' rest_api_mock: support duplicate query parameters test/rest_api: support multiple query values per key in RestApiSession.send() nodetool: add support of new seastar query_parameters_type to scylla_rest_client	2025-10-02 09:04:40 +03:00
Aleksandra Martyniuk	0e73ce202e	test: wait for cql in test_two_tablets_concurrent_repair_and_migration_repair_writer_level In test_two_tablets_concurrent_repair_and_migration_repair_writer_level safe_rolling_restart returns ready cql. However, get_all_tablet_replicas uses the cql reference from manager that isn't ready. Wait for cql. Fixes: #26328 Closes scylladb/scylladb#26349	2025-10-02 06:41:36 +03:00
Avi Kivity	7230a04799	dht, sstables: replace vector with chunked_vector when computing sstable shards sstable::compute_shards_for_this_sstable() has a temporary of type std::vector<dht::token_range> (aka dht::partition_range_vector), which allocates a contiguous 300k when loading an sstable from disk. This causes large allocation warnings (it doesn't really stress the allocator since this typically happens during startup, but best to clear the warning anyway). Fix this by changing the container to by chunked_vector. It is passed to dht::ring_position_range_vector_sharder, but since we're the only user, we can change that class to accept the new type. Fixes #24198. Closes scylladb/scylladb#26353	2025-10-02 00:47:42 +02:00
Michał Jadwiszczak	d92628e3bd	test/cluster/test_view_building_coordinator: skip reproducer instead of xfail The reproducer for issue scylladb/scylladb#26244 takes some time and since the test is failing, there is no point in wasting resources on it. We can change the xfail mark to skip. Refs scylladb/scylladb#26244 Closes scylladb/scylladb#26350	2025-10-01 18:33:05 +02:00
Tomasz Grabiec	c5731221c0	cql3: Fail early on vnode/tablet flavor alter Some tests expect this error. Later, prepare_options() will be changed in a way which would fail to accept new options in such case before vnode/tablet flavor change is detected, tripping the tests.	2025-10-01 16:06:52 +02:00
Tomasz Grabiec	11b4a1ab58	cql3: Extract convert_property_map() out of Cql.g So that complex code is in a .cc file for better IDE assistance.	2025-10-01 16:06:52 +02:00
Tomasz Grabiec	b6df186e54	schema: Use definition from the header instead of open-coding it	2025-10-01 16:06:52 +02:00
Tomasz Grabiec	726548b835	locator: Abstract obtaining the number of replicas from replication_strategy_config_option It will become more complex when options will contain rack lists. It's a good change regardless, as it reduces duplication and makes parsing uniform. We already diverged to use stoi / stol / stoul. The change in create_keyspace_statement.cc to add a catch clause is needed because get_replication_factor() now throws configuration_exception on parsing errors instead of std::invalid_argument, so the existing catch clause in the outer scope is not effective. That loop is trying to interpret all options as RF to run some validations. Not all options are RF, and those are supposed to be ignored.	2025-10-01 16:06:52 +02:00
Tomasz Grabiec	91e51a5dd1	cql3, locator: Use type aliases for option maps In preparation for changing their structure. 1) std::map<sstring, sstring> -> replication_strategy_config_options Parsed options. Values will become std::variant<sstring, rack_list> 2) std::map<sstring, sstring> -> property_definitions::map_type Flattened map of options, as stored system tables.	2025-10-01 16:06:51 +02:00
Tomasz Grabiec	3c31e148c5	locator: Add debug logging	2025-10-01 16:06:28 +02:00
Benny Halevy	da6e2fdb1b	locator: Pass topology to replication strategy constructor	2025-10-01 16:06:28 +02:00
Benny Halevy	3965e29075	abstract_replication_strategy, network_topology_strategy: add replication_factor_data class Prepare for supporting also list of rack names. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-10-01 16:06:27 +02:00
Taras Veretilnyk	6d8224b726	docs: document --key-components option for getendpoints	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	6381c63d65	test/nodetool/test_getendpoints: add coverage for --key-components param in getendpoints Adds a parameterized test to verify that multiple --key-components arguments are handled correctly by nodetool's getendpoints command. Ensures the constructed REST request includes all key_component values in the expected format.	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	78888dd76c	nodetool: Introduce new option --key-components to specify compound partition keys as array Allows getendpoints to accept components of partition key using the --key-components option. Key components are passed as an array and sent to the new /natural_endpoints/v2/{keyspace} endpoint.	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	2456ebd7c2	rest_api/test_storage_service: add v2 natural_endpoints test for composite key with multiple components Adds a test case for the `/storage_service/natural_endpoints/v2/{keyspace}` endpoint, verifying that it correctly resolves natural endpoints for a composite partition key passed as multiple `key_component` query parameters.	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	89d474ba59	api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':' The original `/storage_service/natural_endpoints` endpoint uses colon-separated strings for composite keys, which causes ambiguity when key components contained colons. This commits adds a new `/storage_service/natural_endpoints/v2/{keyspace}` endpoint that accepts partition key components via repeated `key_component` query parameters to avoid this issue.	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	65ade28a9c	rest_api_mock: support duplicate query parameters Previously, only the last value of a repeated query parameter was captured, which could cause inaccurate request matching in tests. This update ensures that all values are preserved by storing duplicates as lists in the `params` dict.	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	b60afeaa46	test/rest_api: support multiple query values per key in RestApiSession.send() Previously, the send() method in RestApiSession only supported one value per query parameter key. This patch updates it to support passing lists of values, allowing the same key to appear multiple times in the query string (e.g. ?key=value1&key=value2).	2025-10-01 15:53:25 +02:00
Taras Veretilnyk	53883958d6	nodetool: add support of new seastar query_parameters_type to scylla_rest_client	2025-10-01 15:52:18 +02:00
Avi Kivity	15fa1c1c7e	Merge 'sstables/trie: translate all key cells in one go, not lazily' from Michał Chojnowski Applying lazy evaluation to the BTI encoding of clustering keys was probably a bad default. The possible benefits are dubious (because it's quite likely that the laziness won't allow us to avoid that much work), but the overhead needed to implement the laziness is large and immediate. In this patch we get rid of the laziness. We rewrite lazy_comparable_bytes_from_clustering_position and lazy_comparable_bytes_from_ring_position so that they performs the key translation eagerly, all components to a single bytes_ostream in one synchronous call. perf_bti_key_translation (microbenchmark added in this series, 1 iteration is 100 translations of a clustering key with 8 cells of int32_type): ``` Before: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6 After: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9 ``` Enhancement, backport not required. Closes scylladb/scylladb#26302 * github.com:scylladb/scylladb: sstables/trie: BTI-translate the entire partition key at once sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset() sstables/trie: perform the BTI-encoding of position_in_partition eagerly types/comparable_bytes: add comparable_bytes_from_compound test/perf: add perf_bti_key_translation	2025-10-01 14:59:06 +03:00
Dawid Mędrek	b409e85c20	view: Stop requiring experimental feature We modify the requirements for using materialized views in tablet-based keyspaces. Before, it was necessary to enable the configuration option `rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS` enabled, and using the experimental feature `views-with-tablets`. We drop the last requirement. We adjust code to that change and provide a new validation test. We also update the user documentation to reflect the changes. Fixes scylladb/scylladb#23030	2025-10-01 09:01:53 +02:00
Dawid Mędrek	288be6c82d	db/view: Verify valid configuration for tablet-based views Creating a materialized view or a secondary index in a tablet-based keyspace requires that the user enabled two options: * experimental feature `views-with-tablets`, * configuration option `rf_rack_vaid_keyspaces`. Because the latter has only become a necessity recently (in this series), it's possible that there are already existing materialized views that violate it. We add a new check at start-up that iterates over existing views and makes sure that that is not the case. Otherwise, Scylla notifies the user of the problem.	2025-10-01 09:01:53 +02:00
Dawid Mędrek	00222070cd	db/view: Require rf_rack_valid_keyspaces when creating view We extend the requirements for being able to create materialized views and secondary indexes in tablet-based keyspaces. It's now necessary to enable the configuration option `rf_rack_valid_keyspaces`. This is a stepping stone towards bringing materialized views and secondary indexes with tablets out of the experimental phase. We add a validation test to verify the changes. Refs scylladb/scylladb#23030	2025-10-01 09:01:50 +02:00
Dawid Mędrek	71606ffdda	test/cluster/random_failures: Skip creating secondary indexes Materialized views are going to require the configuration option `rf_rack_valid_keyspaces` when being created in tablet-based keyspaces. Since random-failure tests still haven't been adjusted to work with it, and because it's not trivial, we skip the cases when we end up creating or dropping an index.	2025-10-01 09:01:38 +02:00
Dawid Mędrek	6322b5996d	test/cluster/mv: Mark test_mv_rf_change as skipped The test will not work with `rf_rack_valid_keyspaces`. Since the option is going to become a requirement for using views with tablets, the test will need to be rewritten to take that into consideration. Since that adjustment doesn't seem trivial, we mark the test as skipped for the time being.	2025-10-01 09:01:29 +02:00
Botond Dénes	bdca5600ef	Merge 'Prevent stalls due to large tablet mutations' from Benny Halevy Currently, replica::tablet_map_to_mutation generates a mutation having a row per tablet. With enough tablets (10s of thousands) in the table we observe reactor stalls when freezing / unfreezing such large mutations, as seen in https://github.com/scylladb/scylladb/pull/18095#issuecomment-2029246954, and I assume we would see similar stalls also when converting those mutation into canonical_mutation and back, as they are similar to frozen_mutation, and bit more expensive since they also save the column mappings. This series takes a different approach than allowing freeze to yield. `tablet_map_to_mutation` is changed to `tablet_map_to_mutations`, able to generate multiple split mutations, that when squashed together are equivalent to the previously large mutation. Those mutations are fed into a `process_mutation` callback function, provided by the caller, which may add those mutation to a vector for further processing, and/or process them inline by freezing or making a canonical mutation. In addition, split the large mutations would also prevent hitting the commitlog maximum mutation size. Closes scylladb/scylladb#18162 * github.com:scylladb/scylladb: schema_tables: convert_schema_to_mutations: simplify check for system keyspace tablets: read_tablet_mutations: use unfreeze_and_split_gently storage_service: merge_topology_snapshot: freeze snp.mutations gently mutation: async_utils: add unfreeze_and_split_gently mutation: add for_each_split_mutation tablets: tablet_map_to_mutations: maybe split tablets mutation tablets: tablet_map_to_mutations: accept process_func perf-tablets: change default tables and tablets-per-table perf-tablets: abort on unhandled exception	2025-10-01 07:04:09 +03:00
Ernest Zaslavsky	043d2dfb30	treewide: seastar module update Seastar module update ``` 9c07020a Merge 'http: Introduce retry strategy machinery for http client (take two)' from Ernest Zaslavsky 58404b81 http: check for abort at start of `make_request` 35a9e086 http: support per-call `retry_strategy` in `make_request` 96538b92 http: integrate `retry_strategy` into HTTP client 77c3ba14 http: initial implementation of `retry_strategy` b9b9e7bf memory: Call finish_allocation() at the end of allocate_aligned() 2052c200 Merge 'file: coroutinize some functions' from Avi Kivity 7b65e50c file: reindent after coroutinization 837f64b5 file: coroutinize dma_read_impl() 9220607b file: coroutinize dma_read_exactly_impl() d1414541 file: coroutinize set_lifetime_hint_impl() 94d8fd08 file: coroutinize get_lifetime_hint_impl() 392efff4 file: coroutinize maybe_read_eof() e68a3173 file: do_dma_read_bulk: remove "rstate" local 14ac42cd file: coroutinize do_dma_read_bulk() 5446cbab net: Use future::then_unpack() helper to unpack tuples 9e88c4d8 posix-stack: Initialize unique_ptr-s with new result directly 51fb302e rpc: connection::process() use structured binding be2c2b54 http: Explicitly deprecate request::content ``` Closes scylladb/scylladb#26342	2025-10-01 06:44:31 +03:00
Jenkins Promoter	f8c02a420d	Update pgo profiles - aarch64	2025-10-01 05:32:35 +03:00
Jenkins Promoter	b45a57f65e	Update pgo profiles - x86_64	2025-10-01 04:54:14 +03:00
Dawid Mędrek	994f09530f	test/cluster: Adjust MV tests to RF-rack-validity Some of the new tests covering materialized views explicitly disabled the configuration option `rf_rack_valid_keyspaces`. It's going to become a new requirement for views with tablets, so we adjust those tests and enable the option. There is one exception, the test: `cluster/mv/test_mv_topology_change.py::test_mv_rf_change` We handle it separately in the following commit.	2025-09-30 20:01:25 +02:00
Luis Freitas	884c584faf	Update ScyllaDB version to: 2026.1.0-dev	2025-09-30 18:54:09 +03:00
Benny Halevy	1ceb49f6c1	schema_tables: convert_schema_to_mutations: simplify check for system keyspace Currently, the function unfreezes each schema mutation partition and then checks if it's for a system keyspace. This isn't really needed since we can check the partition key using the frozen_mutation, skip it if the partition is for a system keyspace. Note that the constructed partition_key just copies the frozen partition_key_view, without copying or deserializing the actual key contents. Also, reserve `results` capacity using the queried partitions' size to prevent reallocations of the results vector. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	b17a36c071	tablets: read_tablet_mutations: use unfreeze_and_split_gently Split the tablets mutations by number of rows, based on `min_tablets_in_mutation` (currently calibrated to 1024), similar to the splitting done in `storage_service::merge_topology_snapshot`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	25b68fd211	storage_service: merge_topology_snapshot: freeze snp.mutations gently We don't need to store all snp.mutations in a vector and then freeze the whole vector. They can be frozen one at a time and collected into a vector, while maybe yielding between each mutation to prevent stalls. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	fd38cfaf69	mutation: async_utils: add unfreeze_and_split_gently Unfreeze the frozen_mutation, possibly splitting it based on max_rows. The process_mutation function is called for each split mutation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	faa0ee9844	mutation: add for_each_split_mutation Allows processing of the split mutations one at a time. This can reduce memory footprint as the caller won't have to store a vector of the split mutations and then convert it (e.g. freeze the mutations or convert them to canonical mutations). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	d21984d0cc	tablets: tablet_map_to_mutations: maybe split tablets mutation Split the generated tablets mutation if we run out of task quota to prevent stalls, both when preparing the mutations and later on when freezing/unfreezing them or converting them to canonical_mutation and back. Note that this will convert large mutation to long vectors of mutations. A followup change is considered to convert std::vector:s of mutations to chunked_vector to prevent large allocations. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:41 +03:00
Benny Halevy	aaddff5211	tablets: tablet_map_to_mutations: accept process_func Prepare for generating several mutations for the tablet_map by calling process_func for each generated mutation. This allows the caller to directly freeze those mutations one at a time into a vector of frozen mutations or simililarly convert them into canonical mutations. Next patch will split large tablet mutations to prevent stalls. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:15:38 +03:00
Avi Kivity	5d1846d783	dist: scylla_raid_setup: don't override XFS block size on modern kernels In `6977064693` ("dist: scylla_raid_setup: reduce xfs block size to 1k"), we reduced the XFS block size to 1k when possible. This is because commitlog wants to write the smallest amount of padding it can, and older Linux could only write a multiple of the block size. Modern Linux [1] can O_DIRECT overwrite a range smaller than a filesystem block. However, this doesn't play well with some SSDs that have 512 byte logical sector size and 4096 byte physical sector size - it causes them to issue read-modify-writes. To improve the situation, if we detect that the kernel is recent enough, format the filesystem with its default block size, which should be optimal. Note that commitlog will still issue sub-4k writes, which can translate to RMW. There, we believe that the amplification is reduced since sequential sub-physical-sector writes can be merged, and that the overhead from commitlog space amplification is worse than the RMW overhead. Tested on AWS i4i.large. fsqual report: ``` memory DMA alignment: 512 disk DMA alignment: 512 filesystem block size: 4096 context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0.0003 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.7961 (BAD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0.0001 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 (GOOD) context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8006 (BAD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0.0001 (GOOD) context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD) ``` The sub-block overwrite cases are GOOD. In comparison, the fsqual report for 1k (similar): ``` memory DMA alignment: 512 disk DMA alignment: 512 filesystem block size: 1024 context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0.0005 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.7948 (BAD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0015 (GOOD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0022 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.4999 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0 (GOOD) context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.798 (BAD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0012 (GOOD) context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0019 (GOOD) context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.5 (BAD) context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD) context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD) context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD) ``` Fixes #25441. [1] `ed1128c2d0` Closes scylladb/scylladb#25445	2025-09-30 17:14:36 +03:00
Benny Halevy	3c07e0e877	perf-tablets: change default tables and tablets-per-table tablets-per-table must be a power of 2, so round up 10000 to 16K. also, reduce number of tables to have a total of about 100K tablets, otherwise we hit the maximum commitlog mutation size limit in save_tablet_metadata. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:07:06 +03:00
Benny Halevy	2c3fb341e9	perf-tablets: abort on unhandled exception Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-09-30 17:07:06 +03:00
Nadav Har'El	926089746b	message: move RPC compression from utils/ to message/ The directory utils/ is supposed to contain general-purpose utility classes and functions, which are either already used across the project, or are designed to be used across the project. This patch moves 8 files out of utils/: utils/advanced_rpc_compressor.hh utils/advanced_rpc_compressor.cc utils/advanced_rpc_compressor_protocol.hh utils/stream_compressor.hh utils/stream_compressor.cc utils/dict_trainer.cc utils/dict_trainer.hh utils/shared_dict.hh These 8 files together implement the compression feature of RPC. None of them are used by any other Scylla component (e.g., sstables have a different compression), or are ready to be used by another component, so this patch moves all of them into message/, where RPC is implemented. Theoretically, we may want in the future to use this cluster of classes for some other component, but even then, we shouldn't just have these files individually in utils/ - these are not useful stand-alone utilities. One cannot use "shared_dict.hh" assuming it is some sort of general-purpose shared hash table or something - it is completely specific to compression and zstd, and specifically to its use in those other classes. Beyond moving these 8 files, this patch also contains changes to: 1. Fix includes to the 5 moved header files (.hh). 2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt for the three moved source files (.cc). 3. In the moved files, change from the "utils::" namespace, to the "netw::" namespace used by RPC. Also needed to change a bunch of callers for the new namespace. Also, had to add "utils::" explicitly in several places which previously assumed the current namespace is "utils::". Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25149	2025-09-30 17:03:09 +03:00
Pavel Emelyanov	269aaee1b4	Merge 'test: dtest: test_limits.py: migrate from dtest' from Dario Mirovic This PR migrates limits tests from dtest to this repository. One reason is that there is an ongoing effort to migrate tests from dtest to here. Debug logs are enabled on `test_max_cells` for `lsa-timing` logger, to have more information about memory reclaim operation times and memory chunk sizes. This will allow analysis of their value distributions, which can be helpful with debugging if the issue reoccurs. Also, scylladb keeps sql files with metrics which, with some modifications, can be used to track metrics over time for some tests. This would show if there are pauses and spikes or the test performance is more or less consistent over time. scylla-dtest PR that removes migrated tests: [limits_test.py: remove tests already ported to scylladb repo #6232](https://github.com/scylladb/scylla-dtest/pull/6232) Fixes #25097 This is a migration of existing tests to this repository. No need for backport. Closes scylladb/scylladb#26077 * github.com:scylladb/scylladb: test: dtest: limits_test.py: test_max_cells log level test: dtest: limits_test.py: make the tests work test: dtest: test_limits.py: remove test that are not being migrated test: dtest: copy unmodified limits_test.py	2025-09-30 16:57:32 +03:00
Piotr Dulikowski	5e5a3c7ec5	view_building_worker.cc: fix spelling (commiting -> committing) The typo is reported by GitHub action on each PR, so let's fix it to reduce the noise for everybody. Closes scylladb/scylladb#26329	2025-09-30 16:47:03 +03:00
Emil Maskovsky	b0de054439	docs: fix typos and spelling errors Corrected spelling mistakes, typos, and minor wording issues to improve the developer documentation. No backport: There is no functional change, and the doc is mostly relevant to master, so it doesn't need to be backported. Closes scylladb/scylladb#26332	2025-09-30 13:16:49 +02:00
Avi Kivity	72609b5f69	Merge 'mv: generate view updates on pending replica' from Michael Litvak Generate view updates from a pending base replica if it's a reading replica, i.e. it's in the last stage of transition write_both_read_new before becoming the new base replica. Previously we didn't generate view updates on a pending replica. The problem with that is that when a base token is migrated from one replica B1 to another B2, at one stage we generate view updates only from B1, then at the next stage we generate view updates only from B2. During this transition, it can happen that for some write neither B1 nor B2 generate view update, because each one sees the other as the base replica. We fix this by generating view updates from both base replicas in the phase before the transition. We can generate view updates on the pending replica in this case, even if it requires read-before-write, because it's in a stage where it contains all data and serves reads. Fixes https://github.com/scylladb/scylladb/issues/24292 backport not needed - the issue mostly affects MV with tablets which is still experimental Closes scylladb/scylladb#25904 * github.com:scylladb/scylladb: test: mv: test view update during topology operations mv: generate view updates on both shards in intranode migration mv: generate view updates on pending replica	2025-09-30 13:17:16 +03:00
Piotr Wieczorek	4be0bdbc07	alternator: Don't emit a redundant REMOVE event in Alternator Streams for PutItem calls Until now, every PutItem operation appeared in the Alternator Streams as two events - a REMOVE and a MODIFY. DynamoDB Streams emits only INSERT or MODIFY, depending on whether a row was replaced, or created anew. A related issue scylladb#6918 concerns distinguishing the mutation type properly. This was because each call to PutItem emitted the two CDC rows, returned by GetRecords. Since this patch, we use a collection tombstone for the `:attrs` column, and a separate tombstone for each regular column in the table's schema. We don't expect that new tables would have any other regular column, except for the `:attrs` and keys, but we may encounter them in in upgraded tables which had old GSIs or LSIs. Fixes: scylladb#6930. Closes scylladb/scylladb#24991	2025-09-30 13:12:16 +03:00
Michał Hudobski	ae4d4908ba	configure: increase SCHEDULING_GROUPS_COUNT to 20 We would like to have an additional service level available for users of the Vector Store service, which would allow us to de/prioritize vector operations as needed. To allow that, we increase the number of scheduling groups from 19 to 20 and adjust the related test accordingly. Closes scylladb/scylladb#26316	2025-09-30 12:41:28 +03:00
Nadav Har'El	38002718a9	cqlpy: improve testing for "duration" column type We had very rudimentary tests for the "duration" CQL type in the cqlpy framework - just for reproducing issue #8001. But we left two alternative formats, and a lot of corner cases, untested. So this patch aims to add the missing tests - to exhaustively cover the "duration" literal formats and their capabilities. Some of the examples tested in the new test are inspired by Cassandra's unit test test/unit/org/apache/cassandra/cql3/DurationTest.java and the corner cases that this file covers. However, the new tests are not direct translation of that file because DurationTest.java was not a CQL test - it was a unit test of Cassandra's internal "Duration" type, so could not be directly translated into a CQL-based test. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25092	2025-09-30 12:25:02 +03:00
Botond Dénes	efd99bb0af	Merge 'Return tablet ranges from range_to_endpoint_map API' from Pavel Emelyanov The handler in question when called for tablets-enabled keyspace, returns ranges that are inconsistent with those from system.tablets. Like this: system.tablets: ``` TabletReplicas(last_token=-4611686018427387905, replicas=[('e43ce450-2834-4137-92b7-379bb37684d1', 0), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 0)]) TabletReplicas(last_token=-1, replicas=[('22c84cba-d8d0-4d20-8d46-eb90865bb612', 0), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 1)]) TabletReplicas(last_token=4611686018427387903, replicas=[('22c84cba-d8d0-4d20-8d46-eb90865bb612', 1), ('67c82fc2-8ef9-4dd9-8cf6-c7f9372ce207', 1)]) TabletReplicas(last_token=9223372036854775807, replicas=[('e43ce450-2834-4137-92b7-379bb37684d1', 1), ('22c84cba-d8d0-4d20-8d46-eb90865bb612', 0)]) ``` range_to_endpoint_map: ``` {'key': ['-9069053676502949657', '-8925522303269734226'], 'value': ['127.110.40.2', '127.110.40.3']} {'key': ['-8925522303269734226', '-8868737574445419305'], 'value': ['127.110.40.2', '127.110.40.3']} ... {'key': ['-337928553869203886', '-288500562444694340'], 'value': ['127.110.40.1', '127.110.40.3']} {'key': ['-288500562444694340', '105026475358661740'], 'value': ['127.110.40.1', '127.110.40.3']} {'key': ['105026475358661740', '611365860935890281'], 'value': ['127.110.40.1', '127.110.40.3']} ... {'key': ['8307064440200319556', '9117218379311179096'], 'value': ['127.110.40.2', '127.110.40.1']} {'key': ['9117218379311179096', '9125431458286674075'], 'value': ['127.110.40.2', '127.110.40.1']} ``` Not only the number of ranges differs, but also separating tokens do not match (e.g. tokens -2 and 0 belong to different tablets according to system.tablets, but fall into the same "range" in the API result). The source of confusion is that despite storage_service::get_range_to_address_map() is given correct e.r.m. pointer from the table, it still uses token_metadata::sorted_token() to work with. The fix is -- when the e.r.m. is per-table, the tokens should be get from token_metadata's tablet_map (e.g. compare this to storage_service::effective_ownership() -- it grabs tokens differently for vnodes/tables cases). This PR fixes the mentioned problem and adds validation test. The test also checks /storage_service/describe_ring endpoint that happens to return correct set of values. The API is very ancient, so the bug is present in all versions with tablets Fixes #26331 Closes scylladb/scylladb#26231 * github.com:scylladb/scylladb: test: Add validation of data returned by /storage_service endpoints test,lib: Add range_to_endpoint_map() method to rest client api: Indentation fix after previous patches storage_service: Get tablet tokens if e.r.m. is per-table storage_service,api: Get e.r.m. inside get_range_to_address_map() storage_service: Calculate tokens on stack	2025-09-30 11:20:35 +03:00
Michał Jadwiszczak	3bbbbf419b	test/cluster/test_view_building_coordinator: add reproducer for staging sstables with tablet merge The test verifies if staging sstables are processed correctly after tablet merge. Refs scylladb/scylladb#26244 Closes scylladb/scylladb#26286	2025-09-30 09:05:31 +02:00
Avi Kivity	4d9271df98	Merge 'sstables: introduce sstable version `ms`' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation. This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version. (Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)). The high-level structure of the PR is: 1. Introduce new component types — `Partitions` and `Rows`. 2. Teach `class sstable` to open them when they exist. 3. Teach the sstable writer how to write index data to them. 4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead). 5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`. 6. Prepare unit tests for the appearance of `ms`. 7. Enable `ms` in unit tests. 8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled). 9. Prepare integration tests for the appearance of `ms`. 10. Enable both `ms` and `me` in tests where we want both versions to be tested. This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config. Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`. This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row. `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 70 18.0 18 128 500001 1 1 647 19.0 19 132 0 1000000 1 748 15.0 15 116 0 500000 2 372 29.0 29 284 0 250000 4 227 56.0 56 504 0 125000 8 116 106.0 106 928 0 62500 16 67 195.0 195 1732 ``` `build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`: ``` offset stride rows iterations avg aio aio (KiB) 500000 1 1 51 5.1 5 20 500001 1 1 64 5.3 5 20 0 1000000 1 679 4.0 4 16 0 500000 2 492 8.0 8 88 0 250000 4 804 16.0 16 232 0 125000 8 409 31.0 31 516 0 62500 16 97 54.0 54 1056 ``` Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`: Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`. For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`. External tests: I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed. New functionality, no backport needed. Closes scylladb/scylladb#26215 * github.com:scylladb/scylladb: test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes test/cluster: add test_bti_index.py test: prepare bypass_cache_test.py for `ms` sstables sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` db/config: expose "ms" format to the users via database config test: in Python tests, prepare some sstable filename regexes for `ms` sstables: add `ms` to `all_sstable_versions` test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests test/lib/index_reader_assertions: skip some row index checks for BTI indexes test/boost/sstable_inexact_index_test: explicitly use a `me` sstable test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables test/resource: add `ms` sample sstable files for relevant tests test/boost/sstable_compaction_test: prepare for `ms` sstables. test/boost/index_reader_test: prepare for `ms` sstables test/boost/bloom_filter_tests: prepare for `ms` sstables test/boost/sstable_datafile_test: prepare for `ms` sstables test/boost/sstable_test: prepare for `ms` sstables. sstables: introduce `ms` sstable format version tools/scylla-sstable: default to "preferred" sstable version, not "highest" sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller sstables/mx: make Index and Summary components optional sstables: open Partitions.db early when it's needed to populate key range for sharding metadata sstables: adapt sstable::set_first_and_last_keys to sstables without Summary sstables: implement an alternative way to rebuild bloom filters for sstables without Index utils/bloom_filter: add `add(const hashed_key&)` sstables: adapt estimated_keys_for_range to sstables without Summary sstables: make `sstable::estimated_keys_for_range` asynchronous sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary replica/database: add table::estimated_partitions_in_range() sstables/mx: implement sstable::has_partition_key using a regular read sstables: use BTI index for queries, when present and enabled sstables/mx/writer: populate BTI index files sstables: create and open BTI index files, when enabled sstables: introduce Partition and Rows component types sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`	2025-09-30 09:40:02 +03:00
Piotr Dulikowski	4581c72430	Merge 'lwt: prohibit for tablet-based views and cdc logs' from Petr Gusev `SELECT` commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees and in general don't make much sense. In this PR we prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based views for compatibility. Similar logic is applied to CDC log tables. We also add a general check that disallows colocating a table with another colocated table, since this is not needed for now. Fixes https://github.com/scylladb/scylladb/issues/26258 backports: not needed (a new feature) Closes scylladb/scylladb#26284 * github.com:scylladb/scylladb: cql_test_env.cc: log exception when callback throws lwt: prohibit for tablet-based views and cdc logs tablets: disallow chains of colocated tables database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda	2025-09-30 07:15:16 +02:00
Michał Chojnowski	771a82969e	test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes Adds a test for the bloom filter rebuild mechanism in `ms` sstables.	2025-09-29 22:15:26 +02:00
Michał Chojnowski	c1e6cd58fa	test/cluster: add test_bti_index.py Add a test which checks that `ms` sstables can be enabled and disabled.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	7a2dc9cbfa	test: prepare bypass_cache_test.py for `ms` sstables The test looks at metrics to confirm whether queries hit the row cache, the index cache or the disk, depending on various settings. BIG index readers use a two level, read-through index cache, where the higher layer stores parsed "index pages" of Index.db, while the lower layer is a cache of raw 4kiB file pages of Index.db. Therefore, if we want to count index cache hits, the appropriate metric to check in this case is `scylla_sstables_index_page_hits", which counts hits in the higher layer. This is done by the test. However, BTI index readers don't have an equivalent of the higher cache layer. Their cache only stores the raw 4 kiB pages, and the hits are counted in `scylla_sstables_index_page_cache_hits`. (The same metric is incremented by the lower layer of the BIG index cache). Before this commit, the test would fail with `ms` sstables, because their reads don't increment `scylla_sstables_index_page_hits`. In this commit we adapt the test so that it instead checks `scylla_sstables_index_page_cache_hits` for `ms` sstables.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	621bfbe6d9	sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present test_column_family.py::test_sstables_by_key_reader_closed injects a failure into `index_reader::advance_lower_and_check_if_present`. To preserve this tests when BTI indexes are made the default, we have to add a corresponding error injection to `bti_index_reader::advance_lower_and_check_if_present`.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	182c8ce87b	test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables BIG sstables and BTI sstables use different code paths for validating the Data file against the index. So we want to test both types of indexes, not just the default one. This patch changes the test so that it explicitly tests both `me` and `ms` instead of only testing the default format. Note that we disable some tests for BTI indexes: the tests which check that validation detects mismatches between the row index ("promoted index") and the Data file. This is because currently iteration over the row index in BTI isn't implemented at the moment, so for BTI the validation behaves as if there was no row indexes.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	aed1cb6f65	tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write` A useful option in general, and I'll need it to test multiple versions in `test_sstable_validation.py`.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	ef11dc57c1	db/config: expose "ms" format to the users via database config Extend the `sstable_format` config enum with a "ms" value, and, if it's enabled (in the config and in cluster features), use it for new sstables on the node. (Before this commit, writing `ms` sstables should only be possible in unit tests, via internal APIs. After this commit, the format can be enabled in the config and the database will write it during normal operation). As of this commit, the new format is not the default yet. (But it will become the default in a later commit in the same series).	2025-09-29 22:15:25 +02:00
Michał Chojnowski	2ed2033224	test: in Python tests, prepare some sstable filename regexes for `ms`	2025-09-29 22:15:25 +02:00
Michał Chojnowski	fe9f5f4da2	sstables: add `ms` to `all_sstable_versions` Add `ms` to the lists of sstable formats. This will cause it to be included in various unit tests.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	9155eeed10	test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests Add `ms` to tests which already test many format versions. The tests check that sstable files in newer verisons are the same as in `mc`. Arbitrarily, for `ms`, we only check the files common between `mc` and `ms`. If we want to extend this test more, so that it checks that `Partitions.db` and `Rows.db` don't change over time, we have to add `ms` versions of all the sstables under `test/resources` which are used in this test. We won't do that in this patch series. And I'm not sure if we want to do that at all.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	70a6c9481b	test/lib/index_reader_assertions: skip some row index checks for BTI indexes Block monotonicity checks can't be implemented for BTI row indexes because they don't store full clustering positions, only some encoded prefixes. The emptiness check could be implemented with some effort, but we currently don't bother. The two tests which use this `is_empty()` method aren't very useful anyway. (They check that the promoted index is empty when there are no clustering keys. That doesn't really need a dedicated test).	2025-09-29 22:15:25 +02:00
Michał Chojnowski	d53f362328	test/boost/sstable_inexact_index_test: explicitly use a `me` sstable The test currently implicitly uses the default sstable format. But it assumes that the index reader type is `sstables::index_reader`, and it wants some methods specific to that type (and absent from the base `abstract_index_reader`). If we switch the default format from `me` to `ms`, without doing something about this, this test will start failing on the downcast to `sstables::index_reader`. We deal with this by explicitly specifying `me`. `me` and `ms` data readers are identical. And this is a test of the data reader, not the index reader. So it's perfectly fine to just use `me`.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	fca56cb458	test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables This is an old test for some workaround for incorrectly-generated promoted indexes. It doesn't make sense to port this test to newer sstable formats. So just skip it for the new sstable versions.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	1f7882526b	test/resource: add `ms` sample sstable files for relevant tests There are some tests which want sstables of all format versions in `test/resource`. This tests adds `ms` files for those tests. I didn't think much about this change, I just mechanically generated the `ms` from the existing `me` sstables in the same directories (using `scylla sstable upgrade`) for the tests which were complaining about the lack of `ms` files.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	6143dce3db	test/boost/sstable_compaction_test: prepare for `ms` sstables. Fix incompatibilites between the test's assumptions and the upcoming addition of `ms` sstables. Refer to individual tests for comments.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	622149a183	test/boost/index_reader_test: prepare for `ms` sstables Adjust the incompatibilities between the test and the upcoming `ms` sstables. Refer to individual test for comments.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	a67d10d15d	test/boost/bloom_filter_tests: prepare for `ms` sstables The test for the bloom filter rebuild mechanism has to be adjusted, because `ms` sstables won't use this mechanism.	2025-09-29 22:15:25 +02:00
Michał Chojnowski	312423fe53	test/boost/sstable_datafile_test: prepare for `ms` sstables The tests touched in this commit are concerned specifically with Summary. They are not applicable to sstables with BTI indexes.	2025-09-29 22:15:24 +02:00
Michał Chojnowski	924b8eec11	test/boost/sstable_test: prepare for `ms` sstables. Skip `ms` sstables in an uninteresting test which relies on `sstables::index_reader`.	2025-09-29 22:15:24 +02:00
Michał Chojnowski	db4283b542	sstables: introduce `ms` sstable format version Introduce `ms` -- a new sstable format version which is a hybrid of Cassandra's `me` and `da`. It is based on `me`, but with the index components (Summary.db and Index.db) replaced with the index components of `da` (Partitions.db and Rows.db). As of this patch, the version is never chosen anywhere for writing sstables yet. It is only introduced. We will add it to unit tests in a later commit, and expose it to users in yet later commit.	2025-09-29 22:15:24 +02:00
Michał Chojnowski	17085dc1e4	tools/scylla-sstable: default to "preferred" sstable version, not "highest" Later in this patch series we will introduce `ms` as the new highest format, but we won't be able to make it the default within the same series due to some dtest incompatibilities. Until `ms` is the default, we don't `scylla sstable` to default to it, even though it's the highest. Let's choose the default version in `scylla sstable` using the same method which is used by Scylla in general: by letting the `sstable_manager` choose.	2025-09-29 22:13:59 +02:00
Nadav Har'El	3a5475afb7	Merge 'metrics, vector search: add metrics to the vector store client' from Michał Hudobski This PR adds metrics to the vector store client, as described in https://scylladb.atlassian.net/wiki/spaces/RND/pages/86245395/Vector+Store+Core+APIs#Metrics: - number of the dns refreshes We would like the dns refreshes to see if the network client is working properly. Here is the added metric: \# HELP scylla_vector_store_dns_refreshes Number of DNS refreshes \# TYPE scylla_vector_store_dns_refreshes gauge scylla_vector_store_dns_refreshes{shard="0"} 1.000000 Fixes: VECTOR-68 Closes scylladb/scylladb#25288 * github.com:scylladb/scylladb: metrics, test: added a test case for vs metrics metrics, vector_search: add a dns refresh metric vector_search: move the ann implementation to impl	2025-09-29 22:31:03 +03:00
Petr Gusev	29f9c355ab	cql_test_env.cc: log exception when callback throws When a test fails inside a do_with_cql_env callback, the logs don’t make it clear where the failure happened. This is because cql_env immediately begins shutting down services, which obscures the original failure.	2025-09-29 17:53:36 +02:00
Lakshmi Narayanan Sreethar	7b97928152	cmake: link `vector_search` to `test-lib` instead of `cql3` PR #26237 fixed linker errors by linking `cql3` to `vector_search` but this introduced a circular dependency between these two static libraries, sometimes causing failures during compilation : ``` ninja: error: dependency cycle: /home/user/Development/scylladb/build/debug/cql3/CqlParser.hpp -> data_dictionary/libdata_dictionary.a -> data_dictionary/CMakeFiles/data_dictionary.dir/data_dictionary.cc.o -> /home/user/Development/scylladb/build/debug/cql3/CqlParser.hpp ``` So, instead of linking the `vector_search` library to the `cql3` library, link it directly to the executable where the `cql3` library is also to be linked. For the test cases, this means linking `vector_search` to the `test-lib` library. Since both `vector_search` and `cql3` are static libraries, the linker will resolve them correctly regardless of the order in which they are linked. Refs #26235 Refs #26237 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26318	2025-09-29 17:46:58 +03:00
Botond Dénes	fe73c90df9	tools/scylla-sstable: fix doc links The doc links in scylla-sstable help output are static, so they always point to the documentation of the latest stable release, not to the documentation of the release the tool binary is from. On top of that, the links point to old open-source documentation, which is now EOL. Fix both problems: point link at the new source-available documentation pages and make them version aware.	2025-09-29 17:34:37 +03:00
Botond Dénes	15a4a9936b	release: adjust doc_link() for the post source-available world There is no more separate enterprise product and the doc urls are slightly different.	2025-09-29 17:02:55 +03:00
Botond Dénes	5a69838d06	tools/scylla-nodetool: remove trailing " from doc urls They are accidental leftover from a previous way of storing command descriptions.	2025-09-29 17:02:40 +03:00
Benny Halevy	b81c6a339b	test_tablets_merge: test_tablet_split_merge_with_many_tables: reduce number of tables in debug mode As the test hits timeouts in debug mode on aarch64. Fixes #26252 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#26303	2025-09-29 15:30:13 +03:00
Michael Litvak	d94c1f6674	test: mv: test view update during topology operations add new test cases checking view consistency when writing to a table with MV and generating view updates while data is migrated. one case has tablet migrations while writing to the table. The other case does the equivalent for vnode keyspaces - it adds a new node. The tests reproduce issue scylladb/scylladb#24292	2025-09-29 13:44:04 +02:00
Michael Litvak	c9237bf5f6	mv: generate view updates on both shards in intranode migration Similarly to the issue of tokens migrating from one host to another, where we need to generate view updates on both replicas before transitioning in order to not lose view updates, we need to do the same in case of intranode migration. In intranode migration we migrate tokens from one shard to another. Previously we checked shard_for_reads in order to generate view updates only on the single shard that is selected for reads, and not on a pending shard that is not ready yet. The problem is that shard_for_reads switches from the source shard to the destination shard in a single transition, and during that switch we can lose view updates because neither shard sees itself as the shard for reads. We fix this by having a phase before the transition when both shards are ready for reads and both will generate view updates.	2025-09-29 13:44:04 +02:00
Michael Litvak	d842ea2dc9	mv: generate view updates on pending replica Generate view updates from a pending base replica if it's a reading replica, i.e. it's in the last stage of transition write_both_read_new before becoming the new base replica. Previously we didn't generate view updates on a pending replica. The problem with that is that when a base token is migrated from one replica B1 to another B2, at one stage we generate view updates only from B1, then at the next stage we generate view updates only from B2. During this transition, it can happen that for some write neither B1 nor B2 generate view update, because each one sees the other as the base replica. We fix this by generating view updates from both base replicas in the phase before the transition. We can generate view updates on the pending replica in this case, even if it requires read-before-write, because it's in a stage where it contains all data and serves reads. Fixes scylladb/scylladb#24292	2025-09-29 13:44:04 +02:00
Botond Dénes	7a773da425	Merge 'Speed up test cluster/test_alternator::test_localnodes_joining_nodes' from Nadav Har'El Before this patch, the test `cluster/test_alternator::test_localnodes_joining_nodes` was one of the slowest tests in the test/cluster framework, taking over two minutes to run. As comments in the test already acknowledged, there was no good reason why this test had to be so slow. The test needed to, intentionally, boot a server which took a long time (2 minutes) to fail its boot. But it didn't really need to wait for this failure - the right thing to do was to just kill the server at the end of the test. But we just didn't have the test-framework API to do it. So in this series, the first patch introduces the missing API, and the second patch uses it to fix test_localnodes_joining_nodes to kill the (unsuccessfully) booting server. After this patch, the test takes just 7 seconds to run. This is a test speedup only, so no real need to backport it - old release anyway get fewer test runs and the latency of these runs is less important. Closes scylladb/scylladb#25312 * github.com:scylladb/scylladb: test/cluster: greatly speed up test_localnodes_joining_nodes test/pylib: add the ability to stop currently-starting servers	2025-09-29 14:34:34 +03:00
Dawid Mędrek	d6fcd18540	test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces The test cases in the file aren't run via an existing interface like `do_with_cql_env`, but they rely on a more direct approach -- calling one of the schema loader tools. Because of that, they manage the `db::config` object on their own and don't enable the configuration option `rf_rack_valid_keyspaces`. That hasn't been a problem so far since the test doesn't attempt to create RF-rack-invalid keyspaces anyway. However, in an upcoming commit, we're going to further restrict views with tablets and require that the option is enabled. To prepare for that, we enable the option in all test cases. It's only necessary in a small subset of them, but it won't hurt the enforce it everywhere, so let's do that. Refs scylladb/scylladb#23958	2025-09-29 13:07:08 +02:00
Dawid Mędrek	a1254fb6f3	db/view: Name requirement for views with tablets We add a named requirement, a function, for materialized views with tablets. It decides whether we can create views and secondary indexes in a given keyspace. It's a stepping stone towards modifying the requirements for it. This way, we keep the code in one place, so it's not possible to forget to modify it somewhere. It also makes it more organized and concise.	2025-09-29 13:07:08 +02:00
Michał Chojnowski	4ca215abbc	sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader Partitions.db uses a piece of the murmur hash of the partition key internally. The same hash is used to query the bloom filter. So to avoid computing the hash twice (which involves converting the key into a hashable linearized form) it would make sense to use the same `hashed_key` for both purposes. This is what we do in this patch. We extract the computation of the `hashed_key` from `make_pk_filter` up to its parent `sstable_set_impl::create_single_key_sstable_reader`, and we pass this hash down both to `make_pk_filter` and to the sstable reader. (And we add a pointer to the `hashed_key` as a parameter to all functions along the way, to propagate it). The number of parameters to `mx::make_reader` is getting uncomfortable. Maybe they should be packed into some structs.	2025-09-29 13:01:22 +02:00
Michał Chojnowski	420e215873	sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash Partitions.db internally uses a piece of the partition key murmur hash (the same hash which is used to compute the token and the relevant bits in the bloom filter). Before this patch, the Partitions.db reader computes the hash internally from the `sstables::partition_key`. That's a waste, because this hash is usually also computed for bloom filter purposes just before that. So in this patch we let the caller pass that hash instead. The old index interface, without the hash, is kept for convenience. In this patch we only add a new interface, we don't switch the callers to it yet. That will happen in the next commit.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	cee4011e7a	sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller Partitions.db internally uses a piece of the partition key murmur hash (the same hash which is used to compute the token and the relevant bits in the bloom filter). Before this patch, the Partitions.db writer computes the hash internally from the `sstables::partition_key`. That's a waste, because this hash is also computed for bloom filter purposes just before that, in the owning sstable writer. So in this patch we let the caller pass that hash here instead.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	f8e3d5e7c2	sstables/mx: make Index and Summary components optional In previous patches we (hopefully) modified all users of Index and Summary components so that they don't longer need those components to exist. (And can use Partitions and Rows components instead).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	f003cbce6d	sstables: open Partitions.db early when it's needed to populate key range for sharding metadata If there's no metadata file with sharding metadata, the owning shards of an sstable are computed based on the partition key range within the sstable. This range is set in `set_first_and_last_keys()`, which (since another commit in this commit series) reads it either from the Summary component or from the footer of the Partitions component, whichever is available. But in some code paths `set_first_and_last_keys()` is called before the footer of Partitions is loaded. If the sstable doesn't have Summary, only Partitions, then the `set_first_and_last_keys()` will fail. To prevent that, in those cases we have to open the file and read its footer early, before the `set_first_and_last_keys()` calls. Note: the changes in this commit shouldn't matter during normal operation, in which a Scylla component with sharding metadata is available. But it might be used when old and/or incomplete sstables are read.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	4bdf5ca0cf	sstables: adapt sstable::set_first_and_last_keys to sstables without Summary `sstable::set_first_and_last_keys` currently takes the first and last key from the Summary component. But if only BTI indexes are used, this component will be nonexistent. In this case, we can use the first and last keys written in the footer of Partitions.db.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	b1984d6798	sstables: implement an alternative way to rebuild bloom filters for sstables without Index For efficiency, the cardinality of the bloom filter (i.e. the number of partition keys which will be written into the sstable) has to be known before elements are inserted into the filter. In some cases (e.g. memtables flush) this number is known exactly. But in others (e.g. repair) it can only be estimated, and the estimation might be very wrong, leading to an oversized filter. Because of that, some time ago we added a piece of logic (ran after the sstable is written, but before it's sealed) which looks at the actual number of written partitions, compares it to the initial estimate (on which the size of the bloom filter was based on), and if the difference is unacceptably large, it rewrites the bloom filter from partition keys contained in Index.db. But the idea to rebuild the bloom filters from index files isn't going to work with BTI indexes, because they don't store whole partition keys. If we want sstables which don't have Index.db files, we need some other way to deal with oversized filters. Partition keys can be recovered from Data.db, but that would often be way too expensive. This patch adds another way. We introduce a new component file, TemporaryHashes. This component, if written at all, contains the 16-byte murmur hash for every partition key, in order, and can be used in place of Index to reconstruct the bloom filter. (Our bloom filters are actually built from the set of murmur hashes of partition keys. The first step of inserting a partition key into a filter is hashing the key. Remembering the hashes is sufficient to build the filter later, without looking at partition keys again.) As of this patch, if the Index component is not being written, we don't allocate and populate a bloom filter during the Data.db write. Instead, we write the murmur hashes to TemporaryHashes, and only later, after the Data write finishes, we allocate the optimal-size, bloom filter, we read the hashes back from TemporaryHashes, and we populate the filter with them. That is suboptimal. Writing the hashes to disk (or worse, to S3) and reading them back is more expensive than building the bloom filter during the main Data pass. So ideally it should be avoided in cases where we know in advance that the partition key count estimate is good enough. (Which should be the case in flushes and compactions). But we defer that to a future patch. (Such a change would involve passing some flag to the sstable writer if the cardinality estimate is trustworthy, and not creating TemporaryHashes if the estimate is trustworthy).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	c549afa1a9	utils/bloom_filter: add `add(const hashed_key&)` In one of the next patches, we will want to use (in BTI partition index writer) the same hash as used by the bloom filter, and we'll also want to allow rebuilding the filter in a second pass (after the whole sstable is written) from hashes (as opposed to rebuilding from partition keys saved in Index.db, which is something we sometimes do today) saved to a temporary file. For those, we need an interface that allows us to compute the hash externally, and only pass the hash to `add()`.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	3c83914814	sstables: adapt estimated_keys_for_range to sstables without Summary Before this patch, `estimated_keys_for_range` assumes the presence of the Summary component. But we want to make this component optional in this series. This patch adds a second branch to this function, for sstables which don't have a BIG index (in particular, Summary component), but have a BTI index (Partitions component). In this case, instead of calculating the estimate as "fraction of summary overlapping with given range, multiplied by the total key estimate", we calculate it as "fraction of Data file overlapping with given range, multiplied by the total key estimate". (With an extra conditional for the special case when the given range doesn't overlap with the sstable's range at all. In this case, if the ranges are adjacent, the main path could easily return "1 partition" instead of "0 partitions", due to the inexactness of BTI indexes for range queries. Returning something non-zero in this case would be unfortunate, so the extra conditional makes sure that we return 0).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	55c4b89b88	sstables: make `sstable::estimated_keys_for_range` asynchronous Currently, `sstable::estimated_keys_for_range` works by checking what fraction of Summary is covered by the given range, and multiplying this fraction to the number of all keys. Since computing things on Summary doesn't involve I/O (because Summary is always kept in RAM), this is synchronous. In a later patch, we will modify `sstable::estimated_keys_for_range` so that it can deal with sstables that don't have a Summary (because they use BTI indexes instead of BIG indexes). In that case, the function is going to compute the relevant fraction by using the index instead of Summary. This will require making the function asynchronous. This is what we do in this patch. (The actual change to the logic of `sstable::estimated_keys_for_range` will come in the next patch. In this one, we only make it asynchronous).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	70994170e2	sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary `sstable::get_estimated_key_count()` estimates the partition count from the size of Summary, and the interval between Summary entries. But we want to allow writing sstables without a Summary (i.e. sstables that use BTI indexes instead of BIG indexes), so we want a way to get the key count without involving Summary. For that, we can use the `estimated_partition_size` histogram in Statistics. By counting the histogram entries, we get the exact number of partitions in the sstable.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	68c33c0173	replica/database: add table::estimated_partitions_in_range() Add a function which computes an estimated number of partitions in the given token range. We will use this helper in a later patch to replace a few places in the code which de facto do the same thing "manually".	2025-09-29 13:01:21 +02:00
Michał Chojnowski	5f4b9a03d1	sstables/mx: implement sstable::has_partition_key using a regular read A BTI index isn't able to determine if a given key is present in the sstable, because it doesn't store full keys. (It only stores prefixes of decorated keys, so it might give false positives). If the sstable only has BTI index, and no BIG index, then `sstable::has_partition_key()` will have to be implemented with with something else than just the index reader. We might as well ignore the index in any cases and just check that a regular data read for the given partition returns a non-empty result. `sstable::has_partition_key` is only used in the `column_family/sstables/by_key` REST API call that nobody uses anyway, no point in trying to make special optimizations for it.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	893eb4ca1f	sstables: use BTI index for queries, when present and enabled This patch teaches `sstable::make_index_reader` how to create a BTI index reader, from the the `Partitions.db` and `Rows.db` components, if they exist (in which case they are opened by this point).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	e0fda9ae6f	sstables/mx/writer: populate BTI index files In the previous patch we added code responsible for creating and opening Partitions.db and Rows.db, but we left those files empty. In this patch, we populate the files using `trie::bti_row_index_writer` and `trie::bti_partition_index_writer`. Note: for the row index, we insert the same clustering blocks to both indexes. The logic for choosing the size of the blocks hasn't been changed in any way. Much of this patch has to do with propagating the current range tombstone down to all places which can start a new clustering block. The reason we need that is that, for each clustering block, BIG indexes store the range tombstone succeeding the block (i.e. the range tombstone in between the given block and its successor) BTI indexes store the range tombstone preceding the block, (i.e. the range tombstone in between the given block and its predecessor). So before the patch there's no code which looks at the current tombstone when starting the block, only when ending the block. This patch adds an extra copy for each `decorated_key`. This is mostly unavoidable -- the BTI partition writer just has to remember the key until its successor appears, to find the common prefix. (We could avoid the key copy if the BTI isn't used, though. We don't do that in this patch, we just let the copy happen).	2025-09-29 13:01:21 +02:00
Michał Chojnowski	cdcf34b3a0	sstables: create and open BTI index files, when enabled This patch adds code responsible for creation and opening of BTI index components (Rows.db, Partitions.db) when BTI index writing is enabled. (It is enabled if the cluster feature is enabled and the relevant config entry permits it). The files are empty for now, and are never read. We will populate and use them in following patches.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	18875621e8	sstables: introduce Partition and Rows component types BTI indexes are made up of Partition.db and Rows.db files. In this patch we introduce the corresponding component types. In Cassandra, BTI is a separate "sstable format", with a new set of versions. (I.e. `bti-da`, as opposed to `big-me`). In this patch series, we are doing something different: we are introducing version `ms`, which is like `me`, except with `Index.db` and `Summary.db` replaced with `Partitions.db` and `Rows.db`. With a setup like that, Scylla won't yet be able to read Cassandra's BTI (`da`) files, because this patch doesn't teach Scylla about `da`. (But the way to that is open. It would just require first implementing several other things which changed between `me` and `da`). (And, naturally Cassandra will reject `ms` sstables. But this isn't the first time we are breaking file compatibility with Cassandra to some degree. Other examples include encryption and dictionary compression). Note: Partitions.db and Rows.db contain prefixes of keys, which is sensitive information, so they have to be encrypted.	2025-09-29 13:01:21 +02:00
Michał Chojnowski	e04ee6d5f6	sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time` There's a test (boost/sstable_compaction_test.cc::tombstone_purge_test) which tests the value of `_stats.capped_tombstone_deletion_time`. Before this patch, for "ms" sstables, `to_deletion_time` would have be called twice for each written partition tombstone, which would fail the test. Since `_pi_write_m.partition_tombstone` always ends up being converted from `tombstone` to `sstables::deletion_time` anyway, let's just make it a `sstables::deletion_time` to begin with. This will ensure that `to_deletion_time` will be able to be only called once per partition tombstone.	2025-09-29 13:01:20 +02:00
Piotr Dulikowski	3abe6eadce	Merge 'Add CQL documentation for vector queries using SELECT ANN' from Szymon Wasik This PR adds the missing documentation for the SELECT ... ANN statement that allows performing vector queries. This is just the basic explanation of the grammar and how to use it. More comprehensive documentation about vector search will be added separately in Scylla Cloud documentation and features description. Links to this additional documentation will be added as part of VECTOR-244. Fixes: VECTOR-247. No backport is needed as this is the new feature. Closes scylladb/scylladb#26282 * github.com:scylladb/scylladb: cql3: Update error messages to be in line with documentation. docs: Add CQL documentation for vector queries using SELECT ANN	2025-09-29 12:46:55 +02:00
Dario Mirovic	b3347bcf84	test: dtest: limits_test.py: test_max_cells log level Set `lsa-timing` logger log level to `debug`. This will help with the analysis of the whole spectrum of memory reclaim operation times and memory sizes. Refs #25097	2025-09-29 12:39:53 +02:00
Dario Mirovic	554fd5e801	test: dtest: limits_test.py: make the tests work Remove unused imports and markers. Remove Apache license header. Enable the test in suite.yaml for `dev` and `debug` modes. Refs #25097	2025-09-29 12:39:53 +02:00
Dario Mirovic	70128fd5c7	test: dtest: test_limits.py: remove test that are not being migrated Refs #25097	2025-09-29 12:39:52 +02:00
Dario Mirovic	82e9623911	test: dtest: copy unmodified limits_test.py Copy limits_test.py from scylla-dtest to test/cluster/dtest/limits_test.py. Add license header. Disable it for `debug`, `dev`, and `release` mode. Refs #25097	2025-09-29 12:39:52 +02:00
Michał Hudobski	eb8d60f5d4	metrics, test: added a test case for vs metrics This commit adds a test case that checks that vector search metrics are correctly added and have the correct value.	2025-09-29 12:29:21 +02:00
Michał Hudobski	fe4bfffca5	metrics, vector_search: add a dns refresh metric This commit adds a dns refresh counting metric to the vector_store service. We would like to track it to make sure that the networking is working correctly.	2025-09-29 12:28:52 +02:00
Michał Hudobski	74becdd04b	vector_search: move the ann implementation to impl The implementation of the ann function should have been placed in the impl struct, not in the client itself. This commit fixes that.	2025-09-29 12:26:42 +02:00
Piotr Dulikowski	3a05df742e	Merge 'Fix for auth version change during node startup' from Marcin Maliszkiewicz Before this patch we may trigger `SCYLLA_ASSERT(legacy_mode(_qp))`. That's because some auth startup is done in the background and assumes that auth version doesn't change in the middle of the startup. But topology coordinator may decide to do the migration at any time, regadless if auth service is fully started on all nodes. This change makes sure that in legacy startup flow we'll always use old auth-v1 keyspace and therefore auth version change in the middle won't negatively affect the flow. Fixes https://github.com/scylladb/scylladb/issues/25505 Closes scylladb/scylladb#25949 * github.com:scylladb/scylladb: auth: mark some auth-v1 functions as legacy auth: use old keyspace during auth-v1 consistently auth: document setting _superuser_created_promise flow in auth-v1	2025-09-29 11:41:27 +02:00
Ernest Zaslavsky	debc756794	treewide: Move transport related files to a `transport` directory As requested in #22112 , moved the files and fixed other includes and build system. Moved files: - generic_server.hh - generic_server.cc - protocol_server.hh Fixes: #22112 This is a cleanup, no need to backport Closes scylladb/scylladb#25090	2025-09-29 11:46:06 +03:00
Nadav Har'El	69672a5863	alternator: fix deprecation warning Until recently, Seastar's HTTP server's reply::write_body() only supported a few "well-known" content types. But Alternator uses a lesser known one - "application/x-amz-json-1.0" - so it was forced to use a wrong (but legal) content type, and later override it with the correct one. This was really ugly and we had a comment that once this feature was fixed in Seastar, we should remove the ugly workaround. Well, the time has finally come. We can now finally pass the correct content type to write_body(), and don't need to call the deprecated type-changing function later. The new implementation is less awkward, but actually longer - whereas previously we only set the content type in one place - just before the done(), after this patch we actually need to do it in three places where we write the body (string response, streaming response and error response). But I think this is actually better - there is no inherent reason why, for example, error messages and success messages needed to use the same content type. We use a new constant REPLY_CONSTANT_TYPE so that we don't need to repeat it three times. We already have a regression test for the content-type returned by Alternator, test_manual_requests.py::test_content_type, and this test continues to pass after the patch. But this test only checked the short response path, so we add additional tests for the streaming response path and for the error response path. As usual, the new tests pass on DynamoDB as well. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26268	2025-09-29 11:40:46 +03:00
Pavel Emelyanov	daea284072	Merge 'Make compaction module more self contained' from Botond Dénes There is still some compaction related code left in `sstables/`, move this to `compaction/` to make the compaction module more self-contained. Code cleanup, no backport. Closes scylladb/scylladb#26277 * github.com:scylladb/scylladb: sstables,compaction: move make_sstable_set() implementations to compactions/ sstables,compaction: move compaction exceptions to compaction/	2025-09-29 11:38:30 +03:00
Nadav Har'El	1aef733d48	Merge 'Alternator/cache expressions' from Szymon Malewski Before this patch, every expression in Alternator's requests was parsed from string to adequate structure. This patch enables caching, where input expression strings are mapped to parsed template structures. Every new valid (parsable) expression is added to the cache. The cache has limited (configurable) size - when it is reached, the least recently used entry is removed. When requested expression is in the cache, the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values). Caching is implemented for all expression types. The cache is per shard - shared for all operations, expression types, tables, users. Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled). Basic metrics (total count of hits and misses for each expression type and number of evicted enries) are implemented. Cache features are tested in boost unit tests and overall expression caching is tested with Python tests - both mostly rely on metrics. refs #5023 `perf-alternator` test shows improvement (median): \| test \| throughput \| instructions_per_op \| cpu_cycles_per_op \| allocs_per_op \| \| ------ \| ---------------- \| ----------------------------- \| --------------------------- \| ------------------- \| \| read \| +6.0% \| -8.5% \| -7.0% \| -4.9% \| \| write \| +13.4% \| -17.6% \| -14.7% \| -7.4% \| \| write(lwt) \| +12.7% \| -7.9% \| -6.9% \| -2.8% \| \| write_rwm \| +5.4% \| -10.5% \| -7.3% \| -4.1% \| "read" had a ProjectionExpression with 10 column names, "write" had a UpdateExpression with 10 column names and "write_rmw" had both ConditionExpression and UpdateExpression. This patch also includes minor refactoring of other expressions related tests (https://github.com/scylladb/scylladb/issues/22494) - use `test_table_ss` instead of `test_table`. Fixes #25855. This is new feature - no backporting. Closes scylladb/scylladb#25176 * github.com:scylladb/scylladb: alternator: use expression caching alternator: adds expression cache implementation utils: extend lru_string_map utils: add lru_string_map alternator/expressions: error on parsing empty update expression alternator/expressions: fix single value condition expression parsing test/alternator: use `test_table_ss` instead of `test_table` in expressions related tests.	2025-09-29 11:36:31 +03:00
Nadav Har'El	8c99f807d6	test/alternator: regression test for DescribeTable's index schema In issue #5320, we reported a bug where DescribeTable returns the wrong schema for a GSI - it returned as a sort key an attribute which the user didn't actually ask to be a sort key, and was only added because of a requirement of Scylla's materialized-views implementation. We already had a test, test_gsi_2_describe_table_schema, that reproduces that issue. But that test only exercised the specific case that we knew had a bug. There is a risk that the fix to #5320 (which was recently merged) will actually break other cases - different combinations of base and GSI keys, or even LSI keys - and we won't have tests reproducing it. So this patch adds comprehensive regression tests for how DescribeTable shows GSIs and LSIs for all possible combinations of base keys and GSI/LSI keys. As we prove in test comments (and in code) we need to test 15 GSIs and 2 LSIs to test every possible combination. These tests aren't very slow, because we only need to create three base tables to test all these combinations. As usual, the new tests pass on DynamoDB. The new GSI test failed on Alternator before #5320 was fixed, but now passes. The fact all of its cases pass shows that the fix to #5320 didn't cause regressions in other types of GSIs or LSIs. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26047	2025-09-29 08:50:30 +03:00
Piotr Dulikowski	7b91dd9297	Merge 'vector_store_client: Add support for multiple URIs' from Karol Nowacki vector_store_client: Add support for multiple URIs The vector store client now supports a comma-separated list of URIs in the `vector_store_primary_uri` configuration option. It uses the vector store nodes from these URIs for load balancing and high availability, querying the next node if the current one fails. References: VECTOR-187 No backport is needed as this is a new feature. Closes scylladb/scylladb#26212 * github.com:scylladb/scylladb: vector_store_client: Rename host_port struct to uri vector_store_client: Add support for multiple URIs vector_store_client: Remove methods used only in tests	2025-09-29 07:40:45 +02:00
Avi Kivity	5fc3ef56c4	build: switch to Seastar API_LEVEL 8 (noncopyable_function in json) Seastar API level 8 changes a function type from std::function to noncopyable_function. Apply those changes in tree and update the build configuration. Closes scylladb/scylladb#26006	2025-09-29 08:33:49 +03:00
Pavel Emelyanov	c029afc6d8	Merge 'test.py: dtest: port cfid_test.py' from Evgeniy Naydanov As a part of the porting process remove unused markers. Explicitly enable auto snapshots for the test, as they are required for it. Enable the test in suite.yaml (run in dev mode only) Closes scylladb/scylladb#26248 * github.com:scylladb/scylladb: test.py: dtest: make cfid_test.py run using test.py As a part of the porting process remove unused markers. test.py: dtest: copy unmodified cfid_test.py	2025-09-29 08:26:21 +03:00
Botond Dénes	6ba1d686e6	sstables,compaction: move make_sstable_set() implementations to compactions/ Various compaction strategies still have their respective make_sstable_set() implementation in sstables/sstable_set.cc. Move them to the appropriate .cc files in compaction/, making the compaction module more self contained.	2025-09-29 06:49:14 +03:00
Botond Dénes	9c85046f93	sstables,compaction: move compaction exceptions to compaction/ sstables/exceptions.hh still hosts some compaction specific exception types. Move them over to the new compaction/exceptions.hh, to make the compaction module more self-contained.	2025-09-29 06:49:14 +03:00
Botond Dénes	2b4a140610	replica: move querier code to replica namespace The query namespace is used for symbols which span the coordinator and replica, or that are mostly coordinator side. The querier is mainly in this namespace due to its similar name, but this is a mistake which confuses people. Now that the code was moved to replica/, also fix the namespace to be namespace replica.	2025-09-29 06:44:52 +03:00
Botond Dénes	ee3d2f5b43	root,replica: mv querier to replica/ The querier object is a confusing one. Based on its name it should be in the query/ module and it is already in the query namespace. But this is actually a completely replica-side logic, implementing the caching of the readers on the replica. Move it to the replica module to make this more clear.	2025-09-29 06:33:53 +03:00
Michał Chojnowski	ff5add4287	sstables/trie: BTI-translate the entire partition key at once Delaying the BTI encoding of partition keys is a good idea, because most of the time they don't have to be encoded. Usually the token alone is enough for indexing purposes. But for the translation of the `partition_key` part itself, there's no good reason to make it lazy, especially after we made the translation of clustering keys eager in a previous commit. Let's get rid of the `std::generator` and convert all cells of the partition key in one go.	2025-09-29 04:10:40 +02:00
Michał Chojnowski	4e35220734	sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset() Using `std::generator` could incurs some unnecessary allocation or confuse the optimizer. Let's replace it with something simpler.	2025-09-29 04:10:40 +02:00
Michał Chojnowski	88c9af3c80	sstables/trie: perform the BTI-encoding of position_in_partition eagerly Applying lazy evaluation to the BTI encoding of clustering keys was probably a bad default. The benefits are dubious (because it's quite likely that the laziness won't allow us to avoid that much work), but the overhead needed to implement the laziness is large and immediate. In this patch we get rid of the laziness. We rewrite lazy_comparable_bytes_from_clustering_position so that it performs the translation eagerly, all components to a single bytes_ostream. Note: the name lazy_comparable_bytes_from_clustering_position stays, because the interface is still lazy. perf_bti_key_translation: Before: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6 After: test iterations median mad min max allocs tasks inst cycles lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9	2025-09-29 04:10:40 +02:00
Michał Chojnowski	7d57643361	types/comparable_bytes: add comparable_bytes_from_compound Add a function which converts compound types (keys and key prefixes) to BTI encoding. It's almost the same as the existing `lazy_comparable_bytes_from_compound` (in bti_key_translation.cc), except it eagerly serializes key components to a bytes_ostream instead of lazily yielding them from a generator. We will remove `lazy_comparable_bytes_from_compound` in a later commit.	2025-09-29 04:10:38 +02:00
Michał Chojnowski	3703197c4c	test/perf: add perf_bti_key_translation Add a microbenchmark for translating keys to BTI encoding.	2025-09-29 04:08:00 +02:00
Michael Litvak	6bc41926e2	view_builder: reduce log level for expected aborts during view creation When draining the view builder, we abort ongoing operations using the view builder's abort source, which may cause them to fail with abort_requested_exception or raft::request_aborted exceptions. Since these failures are expected during shutdown, reduce the log level in add_new_view from 'error' to 'debug' for these specific exceptions while keeping 'error' level for unexpected failures. Closes scylladb/scylladb#26297	2025-09-28 22:55:07 +03:00
Avi Kivity	5b40d4d52b	Merge 'root,replica: mv multishard_mutation_query -> replica/multishard_query' from Botond Dénes The code in `multishard_mutation_query.cc` implements the replica-side of range scans and as such it belongs in the replica module. Take the opportunity to also rename it to `multishard_query`, the code implements both data and mutation queries for a long time now. Code cleanup, no backport required. Closes scylladb/scylladb#26279 * github.com:scylladb/scylladb: test/boost: rename multishard_mutation_query_test to multishard_query_test replica/multishard_query: move code into namespace replica replica/multishard_query.cc: update logger name docs/paged-queries.md: update references to readers root,replica: move multishard_mutation_query to replica/	2025-09-28 20:24:46 +03:00
Avi Kivity	5b6570be52	Merge 'db/config: Add SSTable compression options for user tables' from Nikos Dragazis ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size. The same default applies to system tables as well. This series introduces a new configuration option to allow customizing the default for user tables. It also adds some tests for the new functionality. Fixes #25195. Closes scylladb/scylladb#26003 * github.com:scylladb/scylladb: test/cluster: Add tests for invalid SSTable compression options test/boost: Add tests for SSTable compression config options main: Validate SSTable compression options from config db/config: Add SSTable compression options for user tables db/config: Prepare compression_parameters for config system compressor: Validate presence of sstable_compression in parameters compressor: Add missing space in exception message	2025-09-28 20:23:23 +03:00
Artsiom Mishuta	eedd61f43f	test.py: remove 'sudo' from resource_gather.py The container now runs as root (`4c1f4c419c`), so sudo it's not needed anymore Closes scylladb/scylladb#26294	2025-09-28 16:51:19 +03:00
Avi Kivity	1c1e8802d5	Merge 'Fix lifetime problems between group0 and sstable dictionary trainings' from Michał Chojnowski Apparently the group0 server object dies (and is freed) during drain/shutdown, and I didn't take that into account in my https://github.com/scylladb/scylladb/pull/23025, which still attempts to use it afterwards. The patch fixes two problems. The problem with `is_raft_leader` has been observed in tests. The problems with `publish_new_sstable_dict` has not been observed, but AFAIU (based on code inspection) it exists. I didn't attempt to prove its existence with a test. Should be backported to 2025.3. Closes scylladb/scylladb#25115 * github.com:scylladb/scylladb: storage_service: in publish_new_sstable_dict, use _group0_as instead of the main abort source storage_service: hold group0 gate in `publish_new_sstable_dict`	2025-09-28 14:27:37 +03:00
Szymon Malewski	6ce7843774	alternator: use expression caching Before this patch, every expression in Alternator's requests was parsed from string to adequate structure. This patch enables caching - all calls to parse an expression (all types) are proxied through the cache. New expression is added to the cache, the least recently used entry (above cache size) is removed. For existing entries the copy of the template is returned - individual instances still need to be resolved (placeholders substituted with names and values). The cache is per shard - shared for all operations, expression types, tables, users. Default cache size is 2000 entries per shard and it has configuration option `alternator_max_expression_cache_entries_per_shard` (0 means cache disabled). Added Python tests are based on metrics.	2025-09-28 04:27:44 +02:00
Szymon Malewski	b75c6c9ef7	alternator: adds expression cache implementation Every expression in Alternator's requests is parsed from string to adequate structure. This patch implements a caching structure (input expression strings mapping to parsed 'template' structures), which will be used for handling requests in following commits. If the reqested expression is valid (parsable) the cache will always return a value - if it is not already in the cache it will be created and stored. The cache has limited (live configurable) size - when it is reached, the least recently used entry is removed. The copy of the template in cache is returned - individual instances still need to be resolved (placeholders substituted with names and values). Invalid requests will have no effect on the cache - the parser throws an exception. Caching is implemented for all expression types. Internally it is based on helper structure `lru_string_map`. Basic metrics (total count of hits and misses for each expression type and number of evictions) are implemented. Metrics are used in boost unit tests.	2025-09-28 04:27:44 +02:00
Szymon Malewski	bb8004e52d	utils: extend lru_string_map This patch extend `lru_string_map` with `sized_string_map` - a class that helps to control cache size. It implements cache resizing in background thread.	2025-09-28 04:27:33 +02:00
Szymon Malewski	5332ceb24e	utils: add lru_string_map Adds a lru_string_map definition. This structure maps a string keys to templated arguments, allowing efficient lookup and adding keys. Each lookup (and adding) puts the keys on internal LRU list and the entires may be efficiently removed in a LRU order. It will be a base for the expression cache in Alternator.	2025-09-28 04:06:00 +02:00
Szymon Malewski	65c95d3a93	alternator/expressions: error on parsing empty update expression With this patch empty update expression is no longer accepted by the parser. So far it was rejected only after resolving, however it could pollute the expression cache.	2025-09-28 04:06:00 +02:00
Szymon Malewski	be159acc03	alternator/expressions: fix single value condition expression parsing Primitive conditions usually use operator with two or more values. The only case of a "single value" condition is a function call - DynamoDB does not accept other general values (i.e., attribute or value references). In Alternator single general value was parsed as correct and only failed later when the calculated value ended up to not be a boolean. This works, but not when attribute or value actually is boolean. What is more, when a parsed (but not resolved) expression is cached, this invalid expression could pollute cache. This would be also the only case where the same string can be parsed both as a condition and a projection expression. The issue is fixed by explicitly checking this case at primitive condition parsing. Updated test confirms consistence between Alternator and DynamoDB. Fixes #25855.	2025-09-28 04:06:00 +02:00
Szymon Malewski	7ed38155a3	test/alternator: use `test_table_ss` instead of `test_table` in expressions related tests. This patch includes minor refactoring of expressions related tests (#22494) - use `test_table_ss` instead of `test_table`.	2025-09-28 04:06:00 +02:00
Botond Dénes	34cc7aafae	tools/scylla-sstable: introduce the upgrade command An offline, scylla-sstable variant of nodetool upgradesstables command. Applies latest (or selected) sstable version and latest schema. Closes scylladb/scylladb#26109	2025-09-27 16:53:14 +03:00
Avi Kivity	24b5d08731	Merge 'Remove table::for_all_partitions_slow()' from Pavel Emelyanov This method was once implemented by calling table::for_all_partitions(), which was supposed to be non-slow version. Then callers of "non-slow" method were updated and the method itself was renamed into "_slow()" one. Nowadays only one test still uses it. At the same time the method itself mostly consists of a boilerplate code that moves bits around to call lambda on the partitions read from reader. Open-coding the method into the calling test results in much shorter and simpler to follow code. Code cleanup, no backport needed Closes scylladb/scylladb#26283 * github.com:scylladb/scylladb: test: Fix indentation after previous patch test: Opencode for_all_partitions_slow() test: Coroutinize test_multiple_memtables_multiple_partitions inner lambda table: Move for_all_partitions_slow() to test	2025-09-27 16:26:18 +03:00
Karol Nowacki	f8b1addfaf	vector_store_client: Rename host_port struct to uri The `host_port` struct represents the parsed components of the vector store URI. Renaming it to `uri` more accurately reflects its purpose.	2025-09-27 09:04:46 +02:00
Karol Nowacki	27f6459766	vector_store_client: Add support for multiple URIs The vector store client now supports a comma-separated list of URIs in the `vector_store_primary_uri` configuration option. It uses the vector store nodes from these URIs for load balancing and high availability, querying the next node if the current one fails.	2025-09-27 09:04:45 +02:00
Karol Nowacki	a310cb4c64	vector_store_client: Remove methods used only in tests The `vector_store_client::port()` and `vector_store_client::host()` methods were only used in the test code. Moreover, these tests are no longer needed, as the proper parsing of the URI is already tested in other tests that perform requests to the vector store server mock.	2025-09-27 08:47:00 +02:00
Piotr Dulikowski	39145ff1d0	Merge 'vector_store_client: Add support for load balancing' from Karol Nowacki This change introduces a load balancing mechanism for the vector store client. The client can now distribute requests across multiple vector store nodes. The distribution mechanism performs random selection of nodes for each request. References: VECTOR-187 No backport is needed as this is a new feature. Closes scylladb/scylladb#26205 * github.com:scylladb/scylladb: vector_store_client: Add support for load balancing vector_store_client_test: Introduce vs_mock_server vector_store_client_test: Relocate to a dedicated directory	2025-09-26 18:55:14 +02:00
Petr Gusev	c1cc52c8c8	lwt: prohibit for tablet-based views and cdc logs SELECT commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees. We prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based view for compatibility. Similar logic is applied to CDC log tables. Fixes scylladb/scylladb#26258	2025-09-26 17:06:58 +02:00
Szymon Wasik	ccfe80ab97	cql3: Update error messages to be in line with documentation. ANN (aproximate nearest neighborhood) is just the name of the type of algorithm used to perform vector search. For this reason the error messages should refer to vector queries rather than ANN queries.	2025-09-26 17:01:10 +02:00
Petr Gusev	8adbb6c4dd	tablets: disallow chains of colocated tables	2025-09-26 16:52:43 +02:00
Petr Gusev	b01f56a6d3	database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda In upcoming commits we’ll add a test to ensure that a table cannot be colocated with another table that is itself already colocated. This must also hold in the case where both colocated tables are created simultaneously in a single migration_manager announcement. We use Paxos tables as an example of colocated tables in this test. To support this, get_base_table_for_tablet_colocation needs to look for the base table among the batch of tables being created.	2025-09-26 16:46:32 +02:00
Pavel Emelyanov	04a40b08f7	test: Fix indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-26 16:39:09 +03:00
Pavel Emelyanov	813619f939	test: Opencode for_all_partitions_slow() The method is a large boilerplate that moves stuff around to do simple thing -- read mutations from reader in a row and "check" them with a lambda, optionally breaking the loop if lambda wants it. The whole thing is much shorter if the caller kicks reader itsown. One thing to note -- reader is not closed if something throws in between, but that's test anyway, if something throws, test fails and not closed reader is not a big deal. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-26 16:36:58 +03:00
Pavel Emelyanov	c1ebf987a9	test: Coroutinize test_multiple_memtables_multiple_partitions inner lambda The only place where it needs futures is to call the for_all_partitions_slow() from a table Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-26 16:35:59 +03:00
Pavel Emelyanov	f3c57f7dd0	table: Move for_all_partitions_slow() to test It's now only used by a single test, so move it there and remove from public table API. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-26 16:33:25 +03:00
Szymon Wasik	0194a53659	docs: Add CQL documentation for vector queries using SELECT ANN This patch adds the missing documentation for the SELECT ... ANN statement that allows performing vector queries. This is just the basic explanation of the grammar and how to use it. More comprehensive documentation about vector search will be added separately in Scylla Cloud documentation and features description. Links to this additional documentation will be added as part of VECTOR-244. Fixes: VECTOR-247.	2025-09-26 15:07:00 +02:00
Marcin Maliszkiewicz	5f0041d068	auth: mark some auth-v1 functions as legacy	2025-09-26 14:40:53 +02:00
Marcin Maliszkiewicz	793a64a50e	auth: use old keyspace during auth-v1 consistently Before this patch we may trigger assertion on legacy_mode(_qp). That's because some auth startup is done in the background and assumes that auth version doesn't change in the middle of the startup. But topology coordinator may decide to do the migration at any time, regadless if auth service is fully started on all nodes. This change makes sure that in legacy startup flow we'll always use old auth-v1 keyspace and therefore auth version change in the middle won't negatively affect the flow.	2025-09-26 14:40:52 +02:00
Karol Nowacki	a0e62ef8de	vector_store_client: Add support for load balancing This change introduces a load balancing mechanism for the vector store client. The client can now distribute requests across multiple vector store nodes. The distribution mechanism performs random selection of nodes for each request.	2025-09-26 13:44:28 +02:00
Karol Nowacki	ee90530c31	vector_store_client_test: Introduce vs_mock_server Introduce the `vs_mock_server` test class, which is capable of counting incoming requests. This will be used in subsequent tests to verify load balancing logic.	2025-09-26 12:27:06 +02:00
Nikos Dragazis	8410532fa0	test/cluster: Add tests for invalid SSTable compression options Complementary to the previous patch. It triggers semantic validation checks in `compression_parameters::validate()` and expects the server to exit. The tests examine both command line and YAML options. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-26 12:02:42 +03:00
Nikos Dragazis	6ba0fa20ee	test/boost: Add tests for SSTable compression config options Since patch `03461d6a54`, all boost unit tests depending on `cql_test_env` are compiled into a single executable (`combined_tests`). Add the new test in there. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-26 12:02:42 +03:00
Nikos Dragazis	8d5bd212ca	main: Validate SSTable compression options from config `compression_parameters` provides two levels of validation: * syntactic checks - implemented in the constructor * semantic checks - implemented by `compression_parameters::validate()` The former are applied implicitly when parsing the options from the command line or from scylla.yaml. The latter are currently not applied, but they should. In lack of a better place, apply them in main, right after joining the cluster, to make sure that the cluster features have been negotiated. The feature needed here is the `SSTABLE_COMPRESSION_DICTS`. Validation will fail if the feature is disabled and a dictionary compression algorithm has been selected. Also, mark `validate()` as const so that it can be called from a config object. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-26 12:02:42 +03:00
Nikos Dragazis	e1d9c83406	db/config: Add SSTable compression options for user tables ScyllaDB offers the `compression` DDL property for configuring compression per user table (compression algorithm and chunk size). If not specified, the default compression algorithm is the LZ4Compressor with a 4KiB chunk size (refer to the default constructor for `compression_parameters`). The same default applies to system tables as well. Add a new configuration option to allow customizing the default for user tables. Use the previously hardcoded default as the new option's default value. Note that the option has no effect on ALTER TABLE statements. An altered table either inherits explicit compression options from the CQL statement, or maintains its existing options. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-26 12:02:00 +03:00
Botond Dénes	52c05d89aa	test/boost: rename multishard_mutation_query_test to multishard_query_test	2025-09-26 11:15:38 +03:00
Botond Dénes	3be4f0698f	replica/multishard_query: move code into namespace replica Complete the migration, add code to the replica namespace too.	2025-09-26 11:15:38 +03:00
Botond Dénes	ed50a307db	replica/multishard_query.cc: update logger name To reflect the new file name.	2025-09-26 11:15:38 +03:00
Botond Dénes	8f90137e87	docs/paged-queries.md: update references to readers Both links to reader code are outdated, update them.	2025-09-26 11:15:38 +03:00
Botond Dénes	fb16c0a6d4	root,replica: move multishard_mutation_query to replica/ It belongs there, it is a completely replica-side thing. Also take the opportunity to rename it to multishard_query.{hh,cc}, it is not just mutation anymore (data query is also implemented).	2025-09-26 11:15:38 +03:00
Piotr Dulikowski	68d5dcfa23	Merge 'Coroutinize gossipring property file snitch' from Pavel Emelyanov Most of it's then-chains are quire hairy and look much nicer as coroutines. Last patch restores indentation. Code cleanup, no backport required. Closes scylladb/scylladb#26271 * github.com:scylladb/scylladb: snitch: Reindent after previous changes snitch: Make periodic_reader_callback() a coroutine snitch: Coroutinize pause_io() snitch: Coroutinize stop() snitch: Coroutinize reload_configuration() snitch: Coroutinize read_property_file() snitch: Coroutinize start() snitch: Coroutinize property_file_was_modified()	2025-09-26 08:32:19 +02:00
Avi Kivity	0f4363cc8d	Merge 'sstable: add more complete schema to scylla component' from Botond Dénes Sstables store a basic schema in the statistics component. The scylla-sstable tool uses this to be able to read and dump sstables in a self-contained manner, without requiring an external schema source. The problem is that the schema stored int he statistics component is incomplete: it doesn't store column names for key columns, so these have placeholder names in dump outputs where column names are visible. This is not a disaster but it is confusing and it can cause errors in scripts which want to check the content of sstables, while also knowing the schema and expecting the proper names for key columns. To make sstables truly self-contained w.r.t. the schema, add a complete schema to the scylla component. This schema contains the names and types of all columns, as well as some basic information about the schema: keyspace name, table name, id and version. When available, scylla-sstable's schema loader will use this new more complete schema and fall-back to the old method of loading the (incomplete) schema from the statistics component otherwise. New feature, no backport required. Closes scylladb/scylladb#24187 * github.com:scylladb/scylladb: test/boost/schema_loader_test: add specific test with interesting types test/lib/random_schema: add random_schema(schema_ptr) constructor test/boost/schema_loader_test: test_load_schema_from_sstable: add fall-back test tools/schema_loader: add support for loading from scylla-metadata tools/schema_loader: extract code which load schema from statistics sstables: scylla_metadata: add schema member	2025-09-26 00:21:17 +03:00
Artsiom Mishuta	f23d19e248	test.py: fix dumping big logs to output 1. Remove dumping cluster logs and print only the link to the log. 2. Fail the test (to fail CI and not ignore the problem) and mark the cluster as dirty (to avoid affecting subsequent tests) in case setup/teardown fails. 3. Add 2 cqlpy tests that fail after applying step 2 to the dirties_cluster list so the cluster is discarded afterward. Closes scylladb/scylladb#26183	2025-09-25 22:36:46 +03:00
Pavel Emelyanov	78d32c52f2	test: Use map_reduce0 in sstable_directory_test.cc (and coroutinize) There's a code that tries to accumulate some counter across a sharded service by hand. Using map_reduce0() looks nicer and avoids the smp-safe atomic counter. Also -- coroutinuze the thing while at it Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26259	2025-09-25 21:22:12 +03:00
Pavel Emelyanov	56547992b9	snitch: Reindent after previous changes Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	234865d13c	snitch: Make periodic_reader_callback() a coroutine It was a void method called from timer that spawned a fiber into a background. Now make it a coroutine, but spawn it to background by caller. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	3bd4673b03	snitch: Coroutinize pause_io() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	f7608261b4	snitch: Coroutinize stop() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	46860b461c	snitch: Coroutinize reload_configuration() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	90751cd499	snitch: Coroutinize read_property_file() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	04d2502e8f	snitch: Coroutinize start() And merge two if branches that call set_snitch_ready() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:48 +03:00
Pavel Emelyanov	48f7614ea9	snitch: Coroutinize property_file_was_modified() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 18:59:47 +03:00
Michael Litvak	ad1a5b7e42	service/qos: set long timeout for auth queries on SL cache update pass an appropriate query state for auth queries called from service level cache reload. we use the function qos_query_state to select a query_state based on caller context - for internal queries, we set a very long timeout. the service level cache reload is called from group0 reload. we want it to have a long timeout instead of the default 5 seconds for auth queries, because we don't have strict latency requirement on the one hand, and on the other hand a timeout exception is undesired in the group0 reload logic and can break group0 on the node. Fixes scylladb/scylladb#25290	2025-09-25 16:55:29 +02:00
Michael Litvak	3c3dd4cf9d	auth: add query_state parameter to query functions add a query_state parameter to several auth functions that execute internal queries. currently the queries use the internal_distributed_query_state() query state, and we maintain this as default, but we want also to be able to pass a query state from the caller. in particular, the auth queries currently use a timeout of 5 seconds, and we will want to set a different timeout when executed in some different context.	2025-09-25 16:46:50 +02:00
Michael Litvak	a1161c156f	auth: refactor query_all_directly_granted rewrite query_all_directly_granted to use execute_internal instead of query_internal in a style that is more consistent with the rest of the module. This will also be useful for a later change because execute_internal accepts an additional parameter of query_state.	2025-09-25 16:37:04 +02:00
Szymon Wasik	7c4ef9aae7	docs: Add documentation for creating vector search indexes This patch adds CQL documentation about creating vector search indexes. It includes the syntax and description of parameters. It does not cover VECTOR type that is already supported and documented and it does not cover querying vectors which will be covered by a separate PR. Fixes: VECTOR-217 Closes scylladb/scylladb#26233	2025-09-25 14:49:50 +02:00
Pavel Emelyanov	6670090581	Merge 'compaction: move code to namespace compaction' from Botond Dénes The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented. Code cleanup, no backport. Closes scylladb/scylladb#26214 * github.com:scylladb/scylladb: compaction: remove using namespace {compaction,sstables} compaction: move code to namespace compaction	2025-09-25 15:10:11 +03:00
Karol Nowacki	c4e13959ab	vector_store_client_test: Relocate to a dedicated directory The `vector_store_client_test` is moved from `test/boost` to a new `test/vector_search` directory. This change prepares a dedicated location for all upcoming tests related to the vector search feature.	2025-09-25 14:04:28 +02:00
Botond Dénes	1999d8e3d3	compaction: remove using namespace {compaction,sstables} Some files in compaction/ have using namespace {compaction,sstables} clauses, some even in headers. This is considered bad practice and muddies the namespace use. Remove them.	2025-09-25 15:03:57 +03:00
Botond Dénes	86ed627fc4	compaction: move code to namespace compaction The namespace usage in this directory is very inconsistent, with files and classes scattered in: * global namespace * namespace compaction * namespace sstables With cases, where all three used in the same file. This code used to live in sstables/ and some of it still retains namespace sstables as a heritage of that time. The mismatch between the dir (future module) and the namespace used is confusing, so finish the migration and move all code in compaction/ to namespace compaction too. This patch, although large, is mechanic and only the following kind of changes are made: * replace namespace sstable {} with namespace compaction {} * add namespace compaction {} * drop/add sstables:: * drop/add compaction:: * move around forward-declarations so they are in the correct namespace context This refactoring revealed some awkward leftover coupling between sstables and compaction, in sstables/sstable_set.cc, where the make_sstable_set() methods of compaction strategies are implemented.	2025-09-25 15:03:56 +03:00
Pavel Emelyanov	b30c8a1f25	test: Add validation of data returned by /storage_service endpoints The test compares the ranges that are returned from /describe_ring and /range_to_endpoint_map with the information obtained from system.tablets and makes sure that - the number of ranges - the boundary tokens - the target replicas (nodes only) match. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-25 14:53:22 +03:00
Nadav Har'El	265660d22f	test/cluster: greatly speed up test_localnodes_joining_nodes The test cluster/test_alternator::test_localnodes_joining_nodes takes a whopping 2 minutes and 9 seconds to run before this patch. After this patch, it takes just 7 seconds. The slowness of this test was caused by booting a second node that hangs during boot for 2 minutes, deliberately. We never intended for this boot to finish (the whole point of this test is to run before it finishes), but unfortunately had to wait for it to avoid all sort of nasty problems with unwaited futures. As comments already explained in the code, the solution to this problem is to kill the server at the end of the test - after we kill it, we can wait for it - this wait will very quickly notice that the server addition failed, and not need to wait 2 minutes. But until the previous patch, we had no API to find the server which is starting (not yet running), or to kill it. After the previous patch, we do have such an API, and can now use it, and see this test finish in 7 seconds instead of 2 minutes and 9 seconds.	2025-09-25 14:00:16 +03:00
Nadav Har'El	aa8d6e9e74	test/pylib: add the ability to stop currently-starting servers Some tests need the ability to abruptly stop a server in the test cluster before it fully booted - e.g., because the test knows (and perhaps even expects) that the boot is hung. But before this patch, manager.server_stop() could only kill servers in "running" state. This patch adds to pylib tracking of "starting" servers - servers which we are starting but haven't finished booting - their list can be returned by the manager.starting_servers(). The manage.server_stop function can now kill a server which is just starting - not just "running" servers. To avoid breaking existing tests, manager.all_servers() continues to return just running and stopped servers - not "starting" servers. By the way, when a starting server is killed, it is not listed as stopped - it just behaves as a normal failure to add the server, and not as a server which successfully joined the cluster but was later stopped. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-25 14:00:16 +03:00
Lakshmi Narayanan Sreethar	6d3a8e89e7	cmake: introduce `Scylla_WITH_DEBUG_INFO` option The `configure.py` script has an `--debuginfo` option that allows overriding compiler debug information generation, regardless of the build mode. Add a similar option to CMake, and ensure it is set when CMake is invoked from `configure.py` with `--debuginfo` enabled. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26243	2025-09-25 12:49:36 +03:00
Łukasz Paszkowski	62e27e0f77	test_out_of_space_prevention.py: Fix flaky test_node_restart_while_tablet_split test The test starts a 3-node cluster and immediately creates a big file on the first nodes in order to trigger the out of space prevention to disable compaction, including the SPLIT compaction. In order to trigger a SPLIT compaction, a keyspace with 1 initial tablet is created followed by alter statement with `tablets = {'min_tablet_count': 2}`. This triggers a resize decision that should not finalize due to disabled compaction on the first node. The test is flaky because, the keyspace is created with RF=1 and there is no guarantee that the tablet replica will be located on the first node with critical disk utilization. If that is not the case, the split is finalized and the test fails, because it expect the split to be blocked. Change to RF=3. This ensures there is exactly one tablet replica on each node, including the one with critical disk utilization. So SPLIT is blocked until the disk utilization on the first node, drops below the critical level. Fixes: https://github.com/scylladb/scylladb/issues/25861 Closes scylladb/scylladb#26225	2025-09-25 11:54:48 +03:00
Nadav Har'El	f65998cd39	alternator: improve error handling of incorrect ARN Before this patch, if an ARN that is passed to Alternator requests like TagResource is well-formatted but points to non-existent table, Alternator returns the unhelpful error: (AccessDeniedException) when calling the TagResource operation: Incorrect resource identifier This patch modifies this error to be: (ResourceNotFoundException) when calling the TagResource operation: ResourceArn 'arn:scylla:alternator:alternator_alternator_Test_ 1758532308880:scylla:table/alternator_Test_1758532308880x' not found This is the same error type (ResourceNotFoundException) that DynamoDB returns in that case - and a more helpful error message. This patch also includes a regression test that checks the error type in this case. The new test fails on Alternator before this patch, and passes afterwards (and also passes on DyanamoDB). Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#26179	2025-09-25 11:51:17 +03:00
Botond Dénes	80bc1af05a	test/boost/schema_loader_test: add specific test with interesting types Although the existing random schema test should cover all types, it is good to have targeted tests for more interesting types.	2025-09-25 11:28:35 +03:00
Botond Dénes	f10af4b5eb	test/lib/random_schema: add random_schema(schema_ptr) constructor Allow using the convenient random data generation facilities, for any schema.	2025-09-25 11:28:34 +03:00
Botond Dénes	4c9da11bfb	test/boost/schema_loader_test: test_load_schema_from_sstable: add fall-back test The test now tests loading the schema from the scylla component by default. Force testing the fall-back (read schema from statistics) by deleting the Scylla.db component. Also improve the test by comparing the column names and types, to check that when loaded from the scylla component, the key names are also correct.	2025-09-25 11:28:34 +03:00
Botond Dénes	b85d858f6d	tools/schema_loader: add support for loading from scylla-metadata When available, load the schema from the Scylla component, where the column names of keys are also available. Fall-back to loading the schema from the Statistics component otherwise (previous behaviour).	2025-09-25 11:28:34 +03:00
Botond Dénes	ace2ba06c3	tools/schema_loader: extract code which load schema from statistics Soon there will be an alternative method too: load from scylla-metadata.	2025-09-25 11:28:34 +03:00
Botond Dénes	234f905fa4	sstables: scylla_metadata: add schema member To store the most important schema fields, like id, version, keyspace name, table name and the list of all columns, along with their kind, name and type. This will serve as alternative schema source to the one stored in statistics component. This latter one doesn't store any of the metatada and neither does primary key names (just the types), so it is leads to confusion when it is used as schema source for scylla-sstable. This new schema stored in the scylla-metadata component is not intended to be a full-schema, equivalent to the one stored in the schema tables, it is intended to be good enough for scylla-sstable being able to parse sstables in a self-sufficient manner.	2025-09-25 11:28:34 +03:00
Botond Dénes	9908eb3b75	Merge 'Coroutinize filesystem_storage::check_create_links_replay()' from Pavel Emelyanov Shorter and cleaner this way. The method is doing parallel_for_each(some_lambda) and the PR only touches the lambda, the outer invocation is probably not worth it to convert plain return into a co_await Coroutinization enhancement, no need to backport Closes scylladb/scylladb#26188 * github.com:scylladb/scylladb: sstables: Restore indentation after previous patch sstables: Coroutinize filesystem_storage::check_create_links_replay()	2025-09-25 11:05:52 +03:00
Asias He	b31e651657	repair: Always reset node ops progress to 100% upon completion Always set the node ops progress to 100% when the operation finishes, regardless of success or failure. This ensures the progress never remains below 100%, which would otherwise indicates a pending node operation in case of an error. Fixes #26193 Closes scylladb/scylladb#26194	2025-09-25 11:05:52 +03:00
Botond Dénes	50038ef2cc	Merge 'alternator: update references to alternator streams issue' from Michael Litvak update all the references about the issue of tablets support for alternator streams to issue https://github.com/scylladb/scylladb/issues/23838 instead of https://github.com/scylladb/scylladb/issues/16317. The issue https://github.com/scylladb/scylladb/issues/16317 is about support of CDC with tablets, but it is now closed and it didn't address alternator streams. the remaining issues about alternator streams should be addressed as part of https://github.com/scylladb/scylladb/issues/23838, so fix the references in order for them not to be missed. backport is not needed Closes scylladb/scylladb#26178 * github.com:scylladb/scylladb: test/cqlpy/test_permissions: unskip test for tablets alternator: update references to alternator streams issue	2025-09-25 11:05:52 +03:00
Botond Dénes	f7fd12c2f5	Merge 'test: fix test_one_big_mutation_corrupted_on_startup' from Cezar Moise The commitlog in the tests with big mutations were corrupted by overwriting 10 chunks of 1KB with random data, which could not be enough due to randomness and the big size of the commitlog (~65MB). - change `corrupt_file` to overwrite a based on a percentage of the file's size instead of fixed number of chunks - fix typos - cleanup comments for clarity Closes: #25627 Closes scylladb/scylladb#25979 * github.com:scylladb/scylladb: test: cleanup big mutation commitlog tests test: fix test_one_big_mutation_corrupted_on_startup	2025-09-25 11:05:52 +03:00
Botond Dénes	8d0913cdfe	Merge 'Fix: small grammatical changes' from Sayanta Banerjee Fixed some minor grammatical changes in the documentation. Closes scylladb/scylladb#24675 * github.com:scylladb/scylladb: Update docs/features/cdc/cdc-streams.rst Small grammatical changes	2025-09-25 11:05:51 +03:00
Dani Tweig	a10cac9c0a	Update urgent_issue_reminder.yml - run daily The action will run daily, alerting about urgent issues not touched in the last 7 days. Closes scylladb/scylladb#25598	2025-09-25 11:05:51 +03:00
Asias He	4f3d076dab	tablets: Demote set sstables_repaired_at log to debug level This is log is too excessive when tablet count is high. Demote to debug level. Fixes #25926 Closes scylladb/scylladb#26175	2025-09-25 11:05:51 +03:00
Ferenc Szili	d462dc8839	docs: add description of number of tablets computed by tablet allocator This change adds the documentation section which explains the algorithm to compute the absolute number of tablets a table has. Fixes: #25740 Closes scylladb/scylladb#25741	2025-09-25 11:05:51 +03:00
Pavel Emelyanov	8438c59ad3	scylla-gdb: Fix fair-queue entry printing Catching a live entry in IO queue is very rare event, so we haven't seen it so far, but the `_ticket` member had been removed ~2 years ago and had been replaced with `_capacity` which is plain 64bit integer. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26185	2025-09-25 11:05:51 +03:00
Botond Dénes	118561dd4f	Merge 'configure.py: fix passing of user linker flags to cmake' from Lakshmi Narayanan Sreethar Any user provided linker flags are converted to a semicolon separated string by the `configure.py` script and then passed to cmake via the `CMAKE_EXE_LINKER_FLAGS` option. But `CMAKE_EXE_LINKER_FLAGS` expects semicolon separated values only when set from within CMakeLists. When the option is set from the shell, which is the case with `configure.py` executing cmake, the values should be separated by space. So, pass the flags as it is instead of separating them with semicolons. `configure.py` also incorrectly concatenates the user linker flags and internal linker flags without a space in between causing flags like '-gz' and '-fuse-ld=lld' to merge into a single invalid argument. Fix that by using a proper space when concatenating these two flags. Fixes #26232 Fix to a dev build option. Backport not required. Closes scylladb/scylladb#26234 * github.com:scylladb/scylladb: configure.py: fix concatenation of user linker flags configure.py: fix passing of user linker flags to cmake	2025-09-25 11:05:51 +03:00
Lakshmi Narayanan Sreethar	690546fa40	cmake: link vector_search library to cql3 library The `indexed_table_select_statement::actually_do_execute()` method references `vector_search::vector_store_client::ann()`, but the `vector_search` library, which provides its definition, is not linked with the `cql3` library. This causes linker errors when other targets are built, for example linking `comparable_bytes_test`, which links the `types` library that in turn links `cql3` throws the following error : ``` ...error: undefined symbol: vector_search::vector_store_client::ann... ``` Fix by adding `vector_search` to the private link libraries of `cql3`. Fixes #26235 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26237	2025-09-25 11:05:51 +03:00
Pavel Emelyanov	d670e01bab	api: Handle stop compaction endpoint without database help The handler in question calls replica::database's invoke_on_all and then gets compaction manager from local db and finds the table object from it as well. The latter is needed to provide filter function for compaction_manager::stop_compaction() call and stop only compactions for specific table. Using replica::database can be avoided here (that's the part of dropping http_context -> database dependency eventually): - using sharded<compaction_manager> instead, it's c.m. that's needed on all shards, not really the database - don't search for table object on db, instead get table ID from parsed table_info instead to provide the correct filter function (continuation of #25846) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26082	2025-09-25 11:05:50 +03:00
Pavel Emelyanov	8f815de1e0	Merge 'treewide: move away from accessing httpd::request::query_parameters' from Botond Dénes Acecssing this member directly is deprecated, migrate code to use {get,set}_query_param() and friends instead. Fixes: https://github.com/scylladb/scylladb/issues/26023 Preparation for seastar update, no backport required. Closes scylladb/scylladb#26024 * github.com:scylladb/scylladb: treewide: move away from accessing httpd::request::query_parameters test/pylib/s3_server_mock.py: better handle empty query params	2025-09-25 11:05:50 +03:00
Łukasz Paszkowski	5f6df4eb97	test/storage: Properly mount/clear volumes Due to a missing functionality in PythonTest, `unshare` is never used to mount volumes. As a consequence: + volumes are created with sudo which is undesired + they are not cleared automatically Even having the missing support in place, the approach with mounting volumes with `unshare` would not work as http server, a pool of clusters, and scylla cluster manager are started outside of the new namespace. Thus cluster would have no access to volumes created with `unshare`. The new approach that works with and without dbuild and does not require sudo, uses the following three commands to mount a volume: truncate -s 100M /tmp/mydevice.img mkfs.ext4 /tmp/mydevice.img fuse2fs /tmp/mydevice.img test/ Additionally, a proper cleanup is performed, i.e. servers are stopped gracefully and and volumes are unmounted after the tests using them are completed. Fixes: https://github.com/scylladb/scylladb/issues/25906 Closes scylladb/scylladb#26065	2025-09-25 11:05:50 +03:00
Wojciech Przytuła	e5e913ab8e	Update CPP-RS Driver's link to documentation As the proper documentation of CPP-RS Driver is already there, let's update the links to point to it instead of the GitHub repo. Closes scylladb/scylladb#26089	2025-09-25 11:05:50 +03:00
Marcin Maliszkiewicz	2c6e1402af	auth: document setting _superuser_created_promise flow in auth-v1	2025-09-25 10:05:39 +02:00
Evgeniy Naydanov	eea166c809	test.py: dtest: make cfid_test.py run using test.py As a part of the porting process remove unused markers. Explicitly enable auto snapshots for the test, as they are required for it. Enable the test in suite.yaml (run in dev mode only)	2025-09-25 11:04:00 +03:00
Evgeniy Naydanov	18723b41cf	test.py: dtest: copy unmodified cfid_test.py	2025-09-25 10:33:18 +03:00
Łukasz Paszkowski	29de947851	test_out_of_space_prevention.py: Fix flaky test_user_writes_rejection test The test starts a 3-node cluster and immediately creates a big file on one of the nodes, to trigger the out of space prevention to start rejecting writes on this node. Then a write is executed and checked it did not reach the node with critical disk utilization but reached the remaining nodes (it should, RF=3 is set) However, when not specified, a default LOCAL_ONE consistency level is used. This means that only one node is required to acknowledge the write. After the write, the test checks if the write + did NOT reach the node with critical disk utilization (works) + did reach the remaning nodes This can cause the test to fail sporadically as the write might not yet be on the last node. Use CL=QUORUM instead. Fixes: https://github.com/scylladb/scylladb/issues/26004 Closes scylladb/scylladb#26030	2025-09-25 08:05:45 +03:00
Lakshmi Narayanan Sreethar	5f6e1edd93	configure.py: fix concatenation of user linker flags `configure.py` incorrectly concatenates the user linker flags and internal linker flags without a space in between causing flags like '-gz' and '-fuse-ld=lld' to merge into a single invalid argument. Fix that by using a proper space when concatenating these two flags. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-24 20:56:59 +05:30
Lakshmi Narayanan Sreethar	0c23d4f22d	configure.py: fix passing of user linker flags to cmake Any user provided linker flags are converted to a semicolon separated string by the `configure.py` script and then passed to cmake via the `CMAKE_EXE_LINKER_FLAGS` option. But `CMAKE_EXE_LINKER_FLAGS` expects semicolon separated values only when set from within CMakeLists. When the option is set from the shell, which is the case with `configure.py` executing cmake, the values should be separated by space. So, pass the flags as it is instead of separating them with semicolons. Fixes #26232 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-24 20:40:13 +05:30
Piotr Dulikowski	5d5244abaf	Merge 'vector_store_client: Add support for multiple IPs in DNS responses' from Karol Nowacki vector_store_client: Add support for multiple IPs in DNS responses The DNS resolution logic now processes all IP addresses returned in a DNS response, not just the primary one. The client will iterate through the list of resolved IPs, attempting to query the next one if a request fails. This improves high availability by allowing the client to query other available nodes if one is down. References: VECTOR-187 As this is a new feature no backport is needed. Closes scylladb/scylladb#26055 * github.com:scylladb/scylladb: vector_store_client: Rename HTTP_REQUEST_RETRIES to ANN_RETRIES vector_store_client: Format with clang-format vector_store_client: Add support for multiple IPs in DNS responses vector_store_client_test: Extract `make_vs_server` helper function vector_store_client_test: Ensure cleanup on exception vector_store_client_test: Fix unreliable unavailable port tests	2025-09-24 16:24:19 +02:00
Ferenc Szili	c6c9c316a7	load_balancer: fix std::out_of_bounds when decommissioning with empty nodes Consider the following: The tablet load balancer is working on: - node1: an empty node (no tablets) with a large disk capacity - node2: an empty node (no tablets) with a lower disk capacity then node1 - node3: is being decommissioned and contains tablet replicas In load_balancer::make_internode_plan() the initial destination node/shard is selected like this: // Pick best target shard. auto dst = global_shard_id {target, _load_sketch->get_least_loaded_shard(target)}; load_sketch::get_least_loaded_shard(host_id) calls ensure_node() which adds the host to load_sketch's internal hash maps in case the node was not yet seen by load_sketch. Let's assume dst is a shard on node1. Later in load_balancer::make_internode_plan() we will call pick_candidate() to try to find a better destination node than the initial one: // May choose a different source shard than src.shard or different destination host/shard than dst. auto candidate = co_await pick_candidate(nodes, src_node_info, target_info, src, dst, nodes_by_load_dst, drain_skipped); auto source_tablets = candidate.tablets; src = candidate.src; dst = candidate.dst; If pick_candidate() selects some other empty destination (due to larger capacity: node1) node, and that node has not yet been seen by load_sketch (because it was empty), a subsequent call to load_sketch::pick() will search for the node using std::unordered_map::at(), and because the node is not found it will throw a std::out_of_bounds() exception crashing the load balancer. This problem is fixed by changing load_sketch::populate() to initialize its internal maps with all the nodes which populate()'s arguments filter for. Fixes: #26203 Closes scylladb/scylladb#26207	2025-09-24 15:27:19 +02:00
Pavel Emelyanov	b85673e9b0	test,lib: Add range_to_endpoint_map() method to rest client Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-24 15:44:57 +03:00
Pavel Emelyanov	5746e61a60	api: Indentation fix after previous patches Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-24 15:44:50 +03:00
Pavel Emelyanov	1649808429	storage_service: Get tablet tokens if e.r.m. is per-table Getting all token metadata tokens is not correct, the resulting map will be over-populated. Compare this with effective_ownership() method -- it also gets different tokens for vnodes and tablets cases. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-24 15:42:59 +03:00
Pavel Emelyanov	bac9f200b3	storage_service,api: Get e.r.m. inside get_range_to_address_map() Now it's the caller (API handler) that gets e.r.m. from keyspace or table, and this patch moves this selection into the callee. This is preparational change. Next patch will need to pass optional table_id to get_range_to_address_map(), and to make this table_id presense consistent with whether e.r.m. is per table or not, it's simpler to move e.r.m. evaluation into the latter method as well. (indentation in API handler is deliberately left broken) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-24 15:40:14 +03:00
Pavel Emelyanov	0c258187d9	storage_service: Calculate tokens on stack And std::move() it into the callee. No functional changes here, it's here to reduce churn in the next patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-24 15:39:23 +03:00
Avi Kivity	fb5664a1d5	interval: split interval_bound implementation for const references In `f3dccc2215` ("interval: change start()/end() not to return references to data members"), we introduced interval_bound_const_ref as a lightweight alternative to interval_bound that does not carry a T. This was needed because interval no longer contains interval_bound:s. This interval_bound_const_ref was just an interval_bound<const T&>, and converting constructors and operators were added to move between the interval_bound<T> and interval_bound<const T&>. However, these happen to be illegal in C++ and just happened to work in clang 20. Clang 21 tightened its checks and these are now flagged. The problem is that when instantiating interval_bound<const T&> the converting constructor looks like a copy constructor; and while it's behind a constraint (that evaluates to false) the rules don't care about that. Fix this by having a separate interval_bound_const_ref template. The new template is slightly better as it allows assignment (since the payload is a pointer rather than a reference). Not that it's really needed. The C++ rule was reported [1] as too restrictive, but there is no resolution yet. [1] https://cplusplus.github.io/CWG/issues/2837.html Closes scylladb/scylladb#26081	2025-09-24 13:57:21 +02:00
Alex Dathskovsky	5e89a78c8f	raft: refactor can_vote logic and type This PR refactors the can_vote function in the Raft algorithms for improved clarity and maintainability by providing safer strong boolean types to the raft algorithm. Fixes: #21937 Backport: No backport required Closes scylladb/scylladb#25787	2025-09-24 13:55:05 +02:00
Nikos Dragazis	a7e46974d4	db/config: Prepare compression_parameters for config system SSTable compression is currently configurable only per table, via the `compression` property in CREATE/ALTER TABLE statements. This is represented internally via the `compression_parameters` class. We plan to offer the same options via the configuration as well, to make the default compression method for user tables configurable. This patch prepares the ground by making the `compression_parameters` usable as a `config_file::named_value`, namely: * Define an extraction operator (required by `boost::program_options` for parsing the options from command line). * Define a formatter (required by `named_value::operator()`). * Define a template specialization for `config_type_for` (required by `named_value` constructor). * Define a yaml converter (required for parsing the options from scylla.yaml). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-24 14:51:39 +03:00
Nikos Dragazis	ea41f652c4	compressor: Validate presence of sstable_compression in parameters SSTable compression parameters should always define an algorithm via the `sstable_compression` sub-option. Add a check in the constructor to ensure this is always provided (unless no options are given, which is interpreted as "no compression"). This change has no user-visible effect, since the same check is already performed at a higher-level, while validating the CQL properties of CREATE TABLE and ALTER TABLE statements (see `cf_prop_defs::validate()`). However, it will become useful in later patches, when compression config options will be introduced. Although now redundant, keep the sanity check in `cf_prop_defs::validate()` to maintain consistency of error messages with Cassandra. Note also that Cassandra uses 'class' instead of 'sstable_compression' since version 3.11.10, but Scylla still doesn't support this, see: https://github.com/scylladb/scylladb/issues/4200 Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-24 14:50:04 +03:00
Botond Dénes	761a32927e	Merge 'scrub: Handle malformed_sstable_exception in scrub skip mode' from Taras Veretilnyk This PR improves the handling of malformed SSTables during scrub and adds tests to validate the updated behavior. When scrub is used, there is an increased chance of encountering malformed SSTables. These should not be retried as in regular compaction. Instead, they must be handled according to the selected scrub mode: in skip mode, in case of malformed_sstable_exception, invalid data or whole SSTable should be removed, in abort and segregate modes, the scrub process should abort. Previously, all modes treated malformed_sstable_exception the same way, causing scrub to abort even when skip mode was selected. This PR updates the scrub logic to properly handle malformed SSTable exceptions based on the selected mode. Unit tests are added to verify the intended behavior. Fixes scylladb/scylladb#19059 Backport is not required, it is an improvement Closes scylladb/scylladb#25828 * github.com:scylladb/scylladb: sstable_compaction_test: add scrub tests for malformed SSTables scrub: skip sstable on malformed sstable exception in skip mode	2025-09-24 14:28:43 +03:00
Avi Kivity	b67928a457	build: suppress linker pgo hash mismatch warnings Since we generage pgo profiles once a fortnight (and not every build), pgo hash mismatches are expected as the code built diverges from the code measured. The warnings about hash mismatches don't provide any value (they cannot be acted upon) and are only distracting. A possible downside is that we'll miss the pgo training process failing (it is visible in the warnings list getting longer and longer), but that's not a proper indication. Suppress them with the appropriate linker switch. Ref #26010. Closes scylladb/scylladb#26162	2025-09-24 13:07:07 +02:00
Ernest Zaslavsky	5ba5aec1f8	treewide: Move mutation related files to a `mutation` directory As requested in #22104, moved the files and fixed other includes and build system. Moved files: - combine.hh - collection_mutation.hh - collection_mutation.cc - converting_mutation_partition_applier.hh - converting_mutation_partition_applier.cc - counters.hh - counters.cc - timestamp.hh Fixes: #22104 This is a cleanup, no need to backport Closes scylladb/scylladb#25085	2025-09-24 13:23:38 +03:00
Artsiom Mishuta	1493fd1521	test.py: adjust framework verbosity default settings Currently, test.py always runs in verbose mode in non-TTY due to the custom reporter TabularConsoleOutput (custom-implemented test reporter in test.py) limitations, because in non-verbose mode, it "Use ANSI Escape sequences for manipulating console," which is not possible in non-TTY. So, it also affects the new pure pytest runner (now only boosts tests run under it), but it should not, since pytest standard output does not depend on TTY. This commit changes the logic to increase only TabularConsoleOutput verbosity for CI(non-TTY) run instead of the whole framework. Closes scylladb/scylladb#26202	2025-09-24 12:05:14 +02:00
Avi Kivity	0cfb78ed38	build: drop -fvisibility=hidden compiler flag The -fvisibility=hidden flag removes a shared library's symbols from the dynamic symbol table, reducing the shared object size and the dynamic linking time. In return, the user promises not to rely on the uniqueness of objects declared with exactly the same name in the shared libary and the main executable. However, we violate this assumption. In Seatar's noncopyable_function, we compare _vtable to _s_empty_vtable (in operator bool()). The full name of this _s_empty_vtable is seastar::noncopyable_function<Signature>::_s_empty_vtable. Since it can be instantiated in both the main executable and in libseastar.so, the comparison can fail even though we're comparing what is, in C++ view, the address of a unique object to itself. To solve the problem, we can either: - reimplement noncopyable_function::operator bool in a way that does not depent on the object name (for exaple, 'return _vtable->empty;') where _empty is a boolean initialized to true for the empty vtable), and be careful not to repeat the mistake elsewhere - drop -fvisibility=hidden Here, we choose the second option. The benefit of -fvisibility=hidden is important, but we only use shared libraries in debug mode, and the time spent chasing these apparent one-definition-rule violations is more important than some milliseconds during launch time and a few megabytes in libseastar.so. We do trade it in for -fvisibility-inlines-hidden. This is similar to -fvisibility=hidden, but applies only to addresses of inline functions. Since comparing functions by address is very rare, and inline functions are very common, it seems like a reasonable tradeoff to make. Fixes #25286. Closes scylladb/scylladb#26154	2025-09-24 12:32:35 +03:00
Botond Dénes	1ac7b4c35e	treewide: move away from accessing httpd::request::query_parameters Acecssing this member directly is deprecated, migrate code to use {get,set}_query_param() and friends instead. Fixes: https://github.com/scylladb/scylladb/issues/26023	2025-09-24 11:52:15 +03:00
Botond Dénes	5891aeb1fb	test/pylib/s3_server_mock.py: better handle empty query params Instead of re-inventing empty param handling, use the built-in keep_blank_values=True param of the urllib.parse.parse_qs(). Handles correctly the case where the `=` is also present but no value follows, this is the sytnax used by the new query_params in seastar::http::request. Also add an exception to build_POST_response(). Better than a cryptic message about encode() not callable on NoneType.	2025-09-24 11:52:15 +03:00
Karol Nowacki	706eeee1bd	vector_store_client: Rename HTTP_REQUEST_RETRIES to ANN_RETRIES Rename `HTTP_REQUEST_RETRIES` to `ANN_RETRIES` in `vector_store_client`, as it now applies to all vector store nodes, not just HTTP requests. Also, remove an unused test setter function.	2025-09-24 10:51:43 +02:00
Karol Nowacki	8411a03f22	vector_store_client: Format with clang-format	2025-09-24 10:41:37 +02:00
Karol Nowacki	57d1b601a8	vector_store_client: Add support for multiple IPs in DNS responses The DNS resolution logic now processes all IP addresses returned in a DNS response, not just the primary one. The client will iterate through the list of resolved IPs, attempting to query the next one if a request fails. This improves high availability by allowing the client to query other available nodes if one is down.	2025-09-24 10:41:37 +02:00
Karol Nowacki	cc616252a4	vector_store_client_test: Extract `make_vs_server` helper function The `make_vs_server` function is refactored into a standalone helper to allow its reuse in upcoming test cases.	2025-09-24 10:41:37 +02:00
Karol Nowacki	6da598fa4a	vector_store_client_test: Ensure cleanup on exception Move the mock/test server shutdown into a `finally()` block to guarantee cleanup even if the test case throws an exception.	2025-09-24 10:41:37 +02:00
Karol Nowacki	381586f1b8	vector_store_client_test: Fix unreliable unavailable port tests The `generate_unavailable_localhost_port` function is not robust because it can suffer from a race condition. It finds an available port but does not keep it occupied, meaning another process could bind to it before the test can use it. The `unavailable_server` helper is a more robust solution. It creates a server that listens on a port for its entire lifetime and immediately closes any incoming connections. This guarantees the port remains unavailable, making the test more reliable.	2025-09-24 10:23:24 +02:00
Piotr Dulikowski	bfb8e807be	Merge 'streaming/stream_blob: generate view updates from staging sstables' from Michał Jadwiszczak After https://github.com/scylladb/scylladb/pull/22034, staging status of sstables streamed via file streaming was ignored and view updates were never generated. This patch fixes it and now staging sstables are registered to `view_building_worker`. Then, the worker create view building tasks for those sstables, so the view building coordinator can schedule them once the tablet migration is finished. Fixes https://github.com/scylladb/scylla-enterprise/issues/4572 This fix affects only views on tablets, which are still experimental, so no backport needed. Closes scylladb/scylladb#25776 * github.com:scylladb/scylladb: test/test_view_building_coordinator: add reproducer for file streaming streaming/stream_blob: register staging sstables to process them	2025-09-24 09:15:33 +02:00
Lakshmi Narayanan Sreethar	f2308a2ce5	compaction/twcs: use on_internal_error for invalid timestamp resolution When the `time_window_compaction_strategy::to_timestamp_type` encounters an invalid timestamp resolution it throws an `std::runtime_error`. This patch replaces it with `on_internal_error()` and also logs the invalid timestamp resolution for easier debugging. Refs #25913 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26138	2025-09-24 09:14:16 +03:00
Wojciech Mitros	eb92f50413	hinted_handoff: drain hints after the target node stops owning tokens When a node is being replaced, it enters a "left" state while still owning tokens. Before this patch, this is also the time when we start draining hints targeted to this node, so the hints may get sent before the token ownership gets migrated to another replica, and these hints may get lost. In this patch we postpone the hint draining for the "left" nodes to the time when we know that the target nodes no longer hold ownership of any tokens - so they're no longer referenced in topology. I'm calling such nodes "released". Before this change, when we were starting draining hints, we knew the IP addresses of the target nodes. We lose this information after entering "left" stage, so when draining hints after a node is "released", we can't drain the hints targeted to a specific IP instead of host_id. We may have hints targeted to IPs if the migration rom IP-based to host_ID-based hints didn't happen yet. The migration happens when enabling a cluster feature since 2024.2.0, so such hints can only exist if we perform a direct upgrade from a version before 2024.2.0 to a version that has this change (2025.4.0+). To avoid losing hints completely when such an upgrade is combined with a node removal/replacement, we still drain hints when the node enters a "left" state and the migration of hints to host_id wasn't performed yet. For these drains, the problematic scenario can't occur because it only affects tablets, and when upgrading from a version before 2024.2.0, no tablets can exist yet. If we perform such a drain, we no longer need to drain hints when entering the "released" state, so we only drain when entering that state if the migration was already completed. With this setup, we'll always drain hints at least once when a node is leaving. However, if the migration to host_ids finishes between entering the "left" state and the "released" state, we'll attempt to drain the hints twice. This shouldn't be problem though because each `drain_for()` is performed with the `_drain_lock` and after a `hint_endpoint_manger` is drained, it's removed, so we won't try to drain it twice. This patch also includes a test for verifying that hints are properly replayed after a node replace. Fixes https://github.com/scylladb/scylladb/issues/24980 Closes scylladb/scylladb#24981	2025-09-24 07:11:59 +02:00
Michał Chojnowski	b76716c8aa	tools/schema_loader: disable tablet-related restrictions in the placeholder keyspace Passing `0` as the `initial_tablets` argument causes `schema_loaders`'s placeholder keyspace to be a tablet keyspace. This causes `scylla sstable` to reject some table schemas which are legitimate in this context. For example, `scylla sstable` refuses to work with sstables which contains `counter` columns, because tablets don't support counters. This is undesirable. Let's make `schema_loader`'s keyspace a non-tablet keyspace. Closes scylladb/scylladb#26192	2025-09-24 06:55:28 +03:00
Botond Dénes	7c6fb131f3	Merge 'compaction: ensure that all compaction executors are stopped' from Aleksandra Martyniuk Currently, while stopping the compaction_manager, we stop task_manager compaction module and concurrently run compaction_manager::really_do_stop. really_do_stop stops and waits for all task_executors that are kept in compaction_manager::_tasks, but nothing ensures that no more tasks will be added there. Due to leftover tasks, we trigger on_fatal_internal_error. Modify the order of compaction_manager::stop. After the change, we stop compaction tasks in the following order: - abort module abort source; - close module gate in the background; - stop_ongoing_compactions (kept in compaction_manager::_tasks); - wait until module gate is closed. Check module abort source before creating compaction executor and adding it to _tasks. Thanks to the above, we can be sure that: - after module::stop there will be no tasks in _tasks; - compaction_manager::stop aborts all tasks; we don't wait for any whole compaction to finish. Fixes: https://github.com/scylladb/scylladb/issues/25806. Fixes shutdown bug; Needs backports to all version Closes scylladb/scylladb#25885 * github.com:scylladb/scylladb: compaction: move _tasks check compaction: stop compaction module in really_do_stop	2025-09-24 06:49:52 +03:00
Lakshmi Narayanan Sreethar	82c95699ea	types/comparable_bytes: add compatability test data for DateTpe Byte comparable encoding for DateType was introduced in `bf90018b8e`. This PR updates the compatibility test data to include the type in the test coverage. Refs #19407 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26208	2025-09-24 06:42:24 +03:00
Aleksandra Martyniuk	48bbe09c8b	test: fix test_two_tablets_concurrent_repair_and_migration_repair_writer_level test_two_tablets_concurrent_repair_and_migration_repair_writer_level waits for the first node that logs info about repair_writer using asyncio.wait. The done group is never awaited, so we never learn about the error. The test itself is incorrect and the log about repair_writer is never printed. We never learn about that and tests finishes successfully after 10 minutes timeout. Fix the test: - disable hinted handoff; - repair tablets of the whole table: - new table is added so that concurrent migration is possible; - use wait_for_first_completed that awaits done group; - do some cleanups. Remove nightly mark. Fixes: #26148. Closes scylladb/scylladb#26209	2025-09-24 06:40:45 +03:00
Avi Kivity	2239474a87	Merge 'tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true' from Tomasz Grabiec Greatly improves performance of plan making, because we don't consider candidates in other racks, most of which will fail to be selected due to replication constraints (no rack overload). Also (but minor) reduces the overhead of candidate evaluation, as we don't have to evaluate rack load. Enabled only for rf_rack_valid_keyspaces because such setups guarantee that we will not need (because we must not) move tablets across racks, and we don't need to execute the general algorithm for the whole DC. Tested with perf-load-balancing, which performs a single scale-out operation on a cluster which initially has 10 nodes 88 shards each, 2 racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new nodes (same shard count). Time to reballance the cluster (plan making only, sum of all iterations, no streaming): Before: 16 min 25 s After: 0 min 25 s Before, plan making cost (single incremental iteration) alternated between fast (0.1 [s]) and slow (14.1 [s]): testlog - Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0) testlog - Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0) The slow run chose min and max nodes in different racks, hence the fast path failed to find any candidates and we switched to exhaustive search of candidates in other nodes. After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls. Fixes #26016 Closes scylladb/scylladb#26017 * github.com:scylladb/scylladb: test: perf: perf-load-balancing: Add parallel-scaleout scenario test: perf: perf-load-balancing: Convert to tool_app_template tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true	2025-09-23 22:45:35 +03:00
Tomasz Grabiec	981592bca5	tablet: scheduler: Do not emit conflicting migrations in the plan Plan-making is invoked independently for different DCs (and in the future, racks) and then plans are merged. It could be that the same tablets are selected for migration in different DCs. Only one migration will prevail and be committed to group0, so it's not a correctness problem. Next cycle will recognize that the tablet is in transition and will not be selected by plan-maker. But it makes plan-making less efficient. It may also surprise consumers of the plan, like we saw in #25912. So we should make plan-maker be aware of already scheduled transitions and not consider those tablets as candidates. Fixes #26038 Closes scylladb/scylladb#26048	2025-09-23 22:40:08 +03:00
Nikos Dragazis	1106157756	compressor: Add missing space in exception message Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-23 21:04:22 +03:00
Łukasz Paszkowski	5089ffe06f	tools: toolchain: add e2fsprogs, fuse3 to the dependencies The packages contain filesystem utilities to create volumes such that sudo/unshare are not required. Closes #26135 [avi: regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz ] Closes scylladb/scylladb#26165	2025-09-23 18:49:37 +03:00
Botond Dénes	f9172c934a	Merge 'comparable_bytes: handle counter type' from Lakshmi Narayanan Sreethar Byte comparable format is not supported for counter types. This patch adds explicit handling for them for completeness, allowing the default abstract type handler to be removed. Refs #19407 New features. No need to backport. Closes scylladb/scylladb#26206 * github.com:scylladb/scylladb: types/comparable_bytes: remove default abstract type handler types/comparable_bytes: handle counter type	2025-09-23 18:23:49 +03:00
Pavel Emelyanov	f6e8a14fb0	Update seastar submodule Includes fix for scylla-gdb -- the fair_queue is now hierarchy and priority queues are no longer accessible via _handles member. However, the fix is incomplete -- it silently assumes that the hierarchy is flat (and it _is_ flat now, scylla doesn't yet create nested groups) but it should be handled eventually * seastar b6be384e...c8a3515f (8): > Merge 'Nested scheduling groups (IO classes)' from Pavel Emelyanov test: Add test case for wake-from-idle accumulator fixups test: Add fair_queue test for queues activations test: Expand fair queue random run test with groups test: Add test for basic fair-queue nested linkage test: Cleanup scheduling groups after io_queue_test cases code: Update IO priority group shares from supergroup shares change io_queue: Register class groups in fair-queue fair_queue: Add test class friendship fair_queue: Move nr_classes into group_data fair_queue: Fix indentation after previous patch fair_queue: Implement hierarchical queue activation (wakeup) fair_queue: Remove now unused push/pop helpers fair_queue: Implement hierarchical priority_entry::pop_front() fair_queue: Implement hierarchical priority_entry::top() fair_queue: Link priority_entries into tree fair_queue: Add priority_class_group_data::reserve() fair_queue: Move more bits onto priority_entry fair_queue: Move shares on priority_entry fair_queue: Move last_accumulated on priority_class_group_data fair_queue: Introduce priority_class_group_data fair_queue: Inherit priority_class_data from priority_entry fair_queue: Rename priority_class_ptr ioqueue: Opencode get_class_info() helper ioq: Move fair_queue class registration down > tls: Rework session termination > http/request: get_query_param(): add default_value argument > http/request: add has_query_param() > sharded: Deprecate distributed alias > io_tester: Allow configuring sloppy_size_hint for files > file: Remove duplicating static_assert-ions Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26143	2025-09-23 18:20:45 +03:00
Michał Jadwiszczak	73ce19939e	test/test_view_building_coordinator: add reproducer for file streaming The test reproduces issue scylladb/scylla-enterprise#4572. Before the fix, file-streaming didn't respect staging status of a sstable and view updates weren't generated, leading to base-view inconsistency. The test creates the inconsistency in the view, triggers file-streaming of staging sstables and verifies that the updates are generated.	2025-09-23 15:34:42 +02:00
Michał Jadwiszczak	0f3827d509	streaming/stream_blob: register staging sstables to process them After scylladb/scylladb#22034, staging status of sstables streamed via file streaming was ignored and view updates were never generated. This patch fixes it and now staging sstables are registered to `view_building_worker`. Then, the worker create view building tasks for those sstables, so the view building coordinator can schedule them once the tablet migration is finished. Fixes scylladb/scylla-enterprise#4572	2025-09-23 15:34:42 +02:00
Patryk Jędrzejczak	da44d6af09	Merge 'Move some compaction manager API handlers from storage_service.cc to tasks.cc' from Pavel Emelyanov There's a bunch of /storage_service/... endpoints that start compaction manager tasks and wait for it. Most of them have async peer in /tasks/... that start the very same task, but return to the caller with the task ID. This patch moves those handlers' code from storage_service.cc to tasks.cc, next to the corresponding async peers, to keep handlers that need compaction_manager in one place. That's preparation for more future changes. Later all those endpoints will stop using database from http_context and will capture the compaction_manager they need from main, like it was done in #20962 for /compaction_manager/... endpoints. Even "more later", the former and the latter blocks of endpoints will be registered and unregistered together, e.g. like database endpoints were collected in one reg/unreg sequence by #25674. Part of http_context dependencies cleanup effort, no need to backport. Closes scylladb/scylladb#26140 * https://github.com/scylladb/scylladb: api: Move /storage_service/compact to tasks.cc api: Move /storage_service/keyspace_upgrade_sstables to tasks.cc api: Move /storage_service/keyspace_offstrategy_compaction to tasks.cc api: Move /storage_service/keyspace_cleanup to tasks.cc api: Move /storage_service/keyspace_compaction to tasks.cc	2025-09-23 15:08:48 +02:00
Taras Veretilnyk	b81caf3f54	sstable_compaction_test: add scrub tests for malformed SSTables Add unit tests for scrub behavior with malformed SSTables: - sstable_scrub_abort_mode_malformed_sstable_test(verifies scrub aborts on malformed SSTables) - sstable_scrub_skip_mode_malformed_sstable_test(verifies scrub skips malformed SSTables without aborting)	2025-09-23 14:34:09 +02:00
Taras Veretilnyk	796bdfc9f7	scrub: skip sstable on malformed sstable exception in skip mode Previously, malformed_sstable_exception during scrub was handled the same way in abort, skip and segragate mode, causing the scrub process to abort even when skip was specified. This commit updates the behavior to correctly handle malformed_sstable_exception in skip mode by removing invalid data or whole malformed SSTable instead of aborting the entire scrub.	2025-09-23 14:34:09 +02:00
Aleksandra Martyniuk	97c77d7cd5	compaction: move _tasks check In compaction_manager::really_do_stop we check whether _tasks list is empty after the compactions are stopped. However, a new task may still sneak in, causing the assertion failure. Such a task won't be there for long - module::make_task will fail as the module is already stopped. Move the assertion, that checks if _tasks is empty, after the compaction_states' gates are closed. Fixes: #25806.	2025-09-23 14:22:19 +02:00
Aleksandra Martyniuk	17707d0e6b	compaction: stop compaction module in really_do_stop Currently, compaction::task_manager_module is stopped in compaction_manager::stop, concurrently to really_do_stop. We can't predict the order of the two. Do not set _task_manager_module to nullptr at stop, because compaction_manager::really_do_stop() may be called before the actual shutdown, while other components still try to use it. compaction::task_manager_module does not keep a pointer to compaction_manager, so we won't end up with memory leak. Stop compaction module in really_do_stop, after ongoing compactions are stopped. It's a preparation for further patches.	2025-09-23 14:21:15 +02:00
Lakshmi Narayanan Sreethar	0914978605	types/comparable_bytes: remove default abstract type handler Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-23 13:44:59 +05:30
Lakshmi Narayanan Sreethar	ee1e648a7f	types/comparable_bytes: handle counter type Byte comparable format is not supported for counter types. This patch adds explicit handling for them for completeness, allowing the default abstract type handler to be removed in the next patch. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-23 13:44:59 +05:30
Andrzej Jackowski	c8f45dbbb2	test: speed up test_long_query_timeout_erm `test_long_query_timeout_erm` is slow because it has many parameterized variants, and it verifies timeout behavior during ERM operations, which are slow by nature. This change speeds the test up by roughly 3× (319s -> 114s) by: - Removing two of the five scenarios that were near duplicates. - Shortening timeout values to reduce waiting time. - Parallelizing waiting on server_log with asyncio.TaskGroup(). The two removed scenarios (`("SELECT", True, False)`, `("SELECT_WHERE", True, False)`) were near duplicates to `("SELECT_COUNT_WHERE", True, False)` scenario, because all three scenarios use non-mapreduce query and triggers basically the same system behavior. It is sufficient to keep only one of them, so the test verifies three cases: - One with nodes shutdown - One with mapreduce query - One with non-mapreduce query Fixes: scylladb/scylla#24127 Closes scylladb/scylladb#25987	2025-09-23 10:28:07 +03:00
Piotr Dulikowski	482ddfb3b4	Merge 'mv: handle mismatched base/view replica count caused by RF change' from Wojciech Mitros During an ALTER KEYSPACE statement execution where a table with a view is present, we need to perform tablet migrations for both tables. These migrations are not synchronized, so at some point the base may have a different number of non-pending replicas than the view. Because of that, we can't pair them correctly. If there is more non-pending base replicas than view replicas, we don't need to do anything because the view replica that didn't finish migrating is a pending replica and will get view updates from all base replicas. But if there is more non-pending view replicas than base replicas, we may currently lose view updates to the new view replica. This patch adds a workaround for this scenario. If after one migration we have too more non-pending view replicas than base replicas, we add it to the pending replica list so that it gets an update anyway. This patch will also take effect if the base and view replica counts differ due to some other bug. To track that, a new metric is added to count such occurrences. This patch also includes a test for this exact scenario, which is enforced by an injection. Fixes https://github.com/scylladb/scylladb/issues/21492 Closes scylladb/scylladb#24396 * github.com:scylladb/scylladb: mv: handle mismatched base/view replica count caused by RF change mv: save the nodes used for pairing calculations for later reuse mv: move the decision about simple rack-aware pairing later	2025-09-23 08:10:08 +02:00
Dawid Mędrek	35f7d2aec6	db/batchlog: Drop batch if table has been dropped If there are pending mutations in the batchlog for a table that has been dropped, we'll keep attempting to replay them but with no success -- `db::no_such_column_family` exceptions will be thrown, and we'll keep trying again and again. To prevent that, we drop the batch in that case just like we do in the case of a non-existing keyspace. A reproducer test has been included in the commit. It fails without the changes in `db/batchlog_manager.cc`, and it succeeds with them. Fixes scylladb/scylladb#24806 Closes scylladb/scylladb#26057	2025-09-23 07:48:59 +02:00
Tomasz Grabiec	2b03a69065	test: perf: perf-load-balancing: Add parallel-scaleout scenario Simulates reblancing on a single scale-out involving simultaneous addition of multiple nodes per rack. Default parameters create a cluster with 2 racks, 70 tables, 256 tablets/table, 10 nodes, 88 shards/node. Adds 6 nodes in parallel (3 per rack). Current result on my laptop: testlog - Rebalance took 21.874 [s] after 82 iteration(s)	2025-09-23 00:31:31 +02:00
Tomasz Grabiec	0dcaaa061e	test: perf: perf-load-balancing: Convert to tool_app_template To support sub-commands for testing different scenarios. The current scenario is given the name "rolling-add-dec".	2025-09-23 00:30:38 +02:00
Tomasz Grabiec	c9f0a9d0eb	tablets: scheduler: Balance racks separately when rf_rack_valid_keyspaces is true Greatly improves performance of plan making, because we don't consider candidates in other racks, most of which will fail to be selected due to replication constraints (no rack overload). Also (but minor) reduces the overhead of candidate evaluation, as we don't have to evaluate rack load. Enabled only for rf_rack_valid_keyspaces because such setups guarantee that we will not need (because we must not) move tablets across racks, and we don't need to execute the general algorithm for the whole DC. Tested with perf-load-balancing, which performs a single scale-out operation on a cluster which initially has 10 nodes 88 shards each, 2 racks, RF=2, 70 tables, 256 tablets per table. Scale out adds 6 new nodes (same shard count). Time to rebalance the cluster (plan making only, sum of all iterations, no streaming): Before: 16 min 25 s After: 0 min 25 s Before, plan making cost (single incremental iteration) alternated between fast (0.1 [s]) and slow (14.1 [s]): Rebalance iteration 7 took 14.156 [s]: mig=88, bad=88, first_bad=17741, eval=93874484, skiplist=0, skip: (load=0, rack=17653, node=0) Rebalance iteration 8 took 0.143 [s]: mig=88, bad=88, first_bad=88, eval=865407, skiplist=0, skip: (load=0, rack=0, node=0) The slow run chose min and max nodes in different racks, hence the fast path failed to find any candidates and we switched to exhaustive search of candidates in other nodes. After, all iterations are fast (0.1 [s] per rack, 0.2 [s] per plan-making). The plan is twice as large because it combines the output of two subsequent (pre-patch) plan-making calls. Fixes #26016	2025-09-23 00:30:37 +02:00
Patryk Jędrzejczak	a56115f77b	test: deflake driver reconnections in the recovery procedure tests All three tests could hit https://github.com/scylladb/python-driver/issues/295. We use the standard workaround for this issue: reconnecting the driver after the rolling restart, and before sending any requests to local tables (that can fail if the driver closes a connection to the node that restarted last). All three tests perform two rolling restarts, but the latter ones already have the workaround. Fixes #26005 Closes scylladb/scylladb#26056	2025-09-22 17:21:06 +02:00
Pavel Emelyanov	bc72a637bd	sstables: Restore indentation after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-22 17:49:11 +03:00
Pavel Emelyanov	d97d595827	sstables: Coroutinize filesystem_storage::check_create_links_replay() Inner lambda only. The outer is a single parallel_for_each, probably not worth it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-22 17:48:42 +03:00
Andrzej Jackowski	15e71ee083	test: audit: stop using datetime.datetime.now() in syslog converter `line_to_row` is a test function that converts `syslog` audit log to the format of `table` audit log so tests can use the same checks for both types of audit. Because `syslog` audit doesn't have `date` information, the field was filled with the current date. This behavior broke the tests running at 23:59:59 because `line_to_row` returned different results on different days. Fixes: scylladb/scylladb#25509 Closes scylladb/scylladb#26101	2025-09-22 15:31:33 +03:00
Pavel Emelyanov	b23aab882a	Merge 'test/alternator: multiple fixes for tests so they would pass on DynamoDB' from Nadav Har'El Issue #26079 noted that multiple Alternator tests are failing when run against DynamoDB. This pull request fixes many of them, in several small patches. In one case we need to avoid a DynamoDB bug that wasn't even the point of the original test (and we create a new test specifically for that DynamoDB bug). Another test exposed a real incompatibility with Alternator (#26103) but didn't need to be exposed in this specific test so again we split the test to one that passes, and another one that xfails on Alternator (not on DynamoDB). A bigger changed had to be done to the tags feature test - since August 2024, the TagResource operation became asynchronous which broke our tests, so we fix this. Each of these changes are described in more detail in the individual patches. Refs #26079. It doesn't fix it completely because there are some tests which remain flaky, and some tests which, surprisingly, pass on us-east-1 but fail on eu-north-1. We'll need to address the rest later. No backports needed, we only run tests against DynamDB from master (when we rarely do...), not on old branches. Closes scylladb/scylladb#26114 * github.com:scylladb/scylladb: test/alternator: fix test_list_tables_paginated on DynamoDB test/alternator: fix tests in test_tag.py on DynamoDB test/alternator: fix test_health_only_works_for_root_path on DynamoDB test/alternator: reproducer tests for faux GSI range key problem test/alternator: fix test "test_17119a" to pass on DynamoDB test/alternator: fix test to pass on DynamoDB	2025-09-22 15:30:40 +03:00
Pavel Emelyanov	f6860d1de0	Merge 'mv: run view building worker fibers in streaming group' from Piotr Dulikowski The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong. Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit. No need to backport, view build coordinator is not a part of any release yet. Closes scylladb/scylladb#26122 * github.com:scylladb/scylladb: mv: fix typo in start_backgroud_fibers mv: run view building worker fibers in streaming group	2025-09-22 15:28:38 +03:00
Pavel Emelyanov	ce8dd798a2	Merge 'tools/scylla-sstable-scripts: introduce purgeable.lua and writetime-histogram.lua' from Botond Dénes `purgeable.lua` was written for a specific investigation a few years ago. `writetime-histogram.lua` is an sstable script transcription of the former scylla-sstable writetime-histogram command. This was also written for an investigation (before script command existed) and is too specific to be a native command, so was removed by `edaf67edcb`. Add both scripts to the sample script library, they can be useful, either for a future investigation, or as samples to copy+edit to write new scripts (and train AI). New sstable scripts, no backport Closes scylladb/scylladb#26137 * github.com:scylladb/scylladb: tools/scylla-sstable-scripts: introduce writetime-histogram.lua tools/scylla-sstable-scripts: introduce purgable.lua	2025-09-22 15:27:49 +03:00
Avi Kivity	29032213c8	test: avoid #include <boost/test/included/...> The boost/test/included/... directory is apparently internal and not intended for user consumption. Including it caused a One-Definition-Rule violation, due to boost/test/impl/unit_test_parameters.ipp containing code like this: ```c++ namespace runtime_config { // UTF parameters std::string btrt_auto_start_dbg = "auto_start_dbg"; std::string btrt_break_exec_path = "break_exec_path"; std::string btrt_build_info = "build_info"; std::string btrt_catch_sys_errors = "catch_system_errors"; std::string btrt_color_output = "color_output"; std::string btrt_detect_fp_except = "detect_fp_exceptions"; std::string btrt_detect_mem_leaks = "detect_memory_leaks"; std::string btrt_list_content = "list_content"; ``` This is defining variables in a header, and so can (and in fact does) create duplicate variable definitions, which later cause trouble. So far, we were protected from this trouble by -fvisibility=hidden, which caused the duplicate definitions to be in fact not duplicate. Fix this by correcting the include path away from <boost/test/included/>. Closes scylladb/scylladb#26161	2025-09-22 15:26:06 +03:00
Wojciech Mitros	d9b8278178	mv: handle mismatched base/view replica count caused by RF change During an ALTER KEYSPACE statement execution where a table with a view is present, we need to perform tablet migrations for both tables. These migrations are not synchronized, so at some point the base may have a different number of non-pending replicas than the view. Because of that, we can't pair them correctly. If there is more non-pending base replicas than view replicas, we don't need to do anything because the view replica that didn't finish migrating is a pending replica and will get view updates from all base replicas. But if there is more non-pending view replicas than base replicas, we may currently lose view updates to the new view replica. This patch adds a workaround for this scenario. If after one migration we have too more non-pending view replicas than base replicas, we add it to the pending replica list so that it gets an update anyway. This patch will also take effect if the base and view replica counts differ due to some other bug. To track that, a new metric is added to count such occurrences. This patch also includes a test for this exact scenario, which is enforced by an injection. Fixes https://github.com/scylladb/scylladb/issues/21492	2025-09-22 12:50:16 +02:00
Wojciech Mitros	59c40a2edd	mv: save the nodes used for pairing calculations for later reuse In get_view_natural_endpoint() we start with the list if host_ids from the effective replication maps, which we later translate to locator::node to get the information about racks and datacenters. We check all replicas, but we only store the ones relevant for pairing, so for tablets, the ones in the same DC as the replica sending the update. In the next patch, we'll occasionally need to send cross-dc view updates, so to avoid computing the nodes again, in this patch we adjust the logic to prepare them in advance and save them so that they can be later reused.	2025-09-22 12:45:24 +02:00
Wojciech Mitros	9d4449a492	mv: move the decision about simple rack-aware pairing later We'll need to get the lists for the whole dc when fixing replica count mismatches caused by RF changes, so let's first get these lists, and only filter them later if we decide to use simple rack-aware pairing.	2025-09-22 12:45:24 +02:00
Anna Stuchlik	b18b052d26	doc: remove n1-highmem instances from Recommended Instances	2025-09-22 12:40:36 +02:00
Nadav Har'El	b205e1a3da	Merge 'vector_store_client: Extract DNS logic into a dedicated class' from Karol Nowacki Vector search related implementation moved to a new module vector_search. As the vector search functionality is going to be extended, it is better to keep it in a separate module. The DNS resolution logic and its background task are moved out of the `vector_store_client` and into a new, dedicated class `vector_search::dns`. This refactoring is the first step towards supporting DNS hostnames that resolve to multiple IP addresses. References: VECTOR-187 No backport needed as this is refactoring. Closes scylladb/scylladb#26052 * github.com:scylladb/scylladb: vector_store_client_test: Verify DNS is not refreshed when disabled vector_store_client: Extract DNS logic into a dedicated class vector_search: Apply clang-format vector_store_client: Move to vector_search module	2025-09-22 13:24:34 +03:00
Michael Litvak	beb11760e0	test/cqlpy/test_permissions: unskip test for tablets the test was skipped for tablets because CDC wasn't supported with tablets, but now it is supported and the issue is closed, so the test should be unskipped.	2025-09-22 10:03:32 +02:00
Michael Litvak	65351fda29	alternator: update references to alternator streams issue update all the references about the issue of tablets support for alternator streams to issue #23838 instead of #16317. The issue #16317 is about support of CDC with tablets, but it is now closed and it didn't address alternator streams. the remaining issues about alternator streams should be addressed as part of #23838, so fix the references in order for them not to be missed.	2025-09-22 09:56:23 +02:00
Avi Kivity	1258e7c165	Revert "Merge 'transport: service_level_controller: create and use `driver` service level' from Andrzej Jackowski" This reverts commit `fe7e63f109`, reversing changes made to `b5f3f2f4c5`. It is causing test.py failures around cqlpy. Fixes #26163 Closes scylladb/scylladb#26174	2025-09-22 09:32:46 +03:00
Piotr Dulikowski	b382531d99	Merge 'cdc: fix create table with cdc if not exists' from Michael Litvak Fix an issue where executing a CREATE TABLE IF NOT EXISTS statement with CDC enabled fails with an error if the table already exists. Instead, the query should succeed and be a no-op. This regression was introduced by commit `fed1048059`. Previously, when executing the query, we would first check if the table exists in do_prepare_new_column_families_announcement. If it did, we would throw an already_exists_exception, which was handled correctly; otherwise, we would continue and create the CDC table in the before_create_column_families notification. The order of operations was changed in `fed1048059`, causing the regression. Now, we first create the CDC schema and add it to the schema list for creation, and then check for each of them if they already exist. The problem is that when we create the CDC schema in on_pre_create_column_families, it also checks if the CDC table already exists. If it does, it throws an invalid_request_exception, which is not caught and handled as expected. This patch restores the previous order of operations: we first check if the tables exist, and only then add the CDC schema in pre_create. Fixes https://github.com/scylladb/scylladb/issues/26142 no backport - recent regression, not released yet Closes scylladb/scylladb#26155 * github.com:scylladb/scylladb: test: add test for creating table with CDC enabled if not exists cdc: fix create table with cdc if not exists	2025-09-22 08:18:26 +02:00
Piotr Dulikowski	591a67c7e7	Merge 'view_builder: register view on all shards atomically' from Michael Litvak When the view builder starts to build a new view, each shard registers itself by writing the shard id and current token to the scylla_views_builds_in_progress table. Previously, this happened independently by each shard. We change it now to register all shards "atomically" - when a shard registers itself, it also registers all other shards with an empty status, if they aren't registered yet. This ensures that we don't have a partial state in the table where only some of the shards are registered, but we always have a status for all shards. The reason we want to register all shards atomically is that if it happens that only some of the shards were registered, then we restart and load the status from table, this doesn't work well for multiple reasons. One example is that to know how many shards we had previously, we take the maximum shard id we see in the table. If it's different than the current shard count, we will execute the reshard code. But of course, if the last shard is missing from the table because it didn't register itself, this calculation will be wrong, and we can't know the previous number of shards. This is a problem because suppose we have two shards, and shard 0 finished building the view but shard 1 didn't start. When we come up, we will think that previously we had only a single shard and it completed building everything, when in fact we built only half the view approximately. The problem is that we don't have enough information in the tables to know that. There are additional problems related to reshard. In the reshard function, whether it is executed because we actually do node reshard or because we calculated the wrong number of previous shards, if the status of some shard is missing then the calculation of new ranges will be wrong. When some shard didn't make progress we should start building the view from scratch. However, this doesn't happen if we don't have a status for the shard, because the code looks only for shards that have a status. In effect, this shard is considered complete even though it didn't start. This could cause the view building to get stuck or complete without building all tokens ranges. By registering all shards atomically, this should solve the above problems because we will always have statuses for all shards. Fixes https://github.com/scylladb/scylladb/issues/22989 backport not needed - the issue is probably not common and there's a workaround Closes scylladb/scylladb#25790 * github.com:scylladb/scylladb: test: mv: add a test for view build interrupt during registration view_builder: register view on all shards atomically	2025-09-22 08:03:44 +02:00
Karol Nowacki	6bd1d7db49	vector_store_client_test: Verify DNS is not refreshed when disabled Extend the `vector_store_client_uri_update_to_empty` test case to verify that the DNS resolver stops refreshing when the vector store is disabled.	2025-09-22 08:02:59 +02:00
Karol Nowacki	27219b8b7c	vector_store_client: Extract DNS logic into a dedicated class The DNS resolution logic and its background task are moved out of the `vector_store_client` and into a new, dedicated class `vector_search::dns`. This refactoring is the first step towards supporting DNS hostnames that resolve to multiple IP addresses. Signed-off-by: Karol Nowacki <karol.nowacki@scylladb.com>	2025-09-22 08:01:53 +02:00
Karol Nowacki	7cc7b95681	vector_search: Apply clang-format Run clang-format on the vector_search module to fix minor formatting inconsistencies.	2025-09-22 08:01:50 +02:00
Karol Nowacki	eae71d3e91	vector_store_client: Move to vector_search module Vector search related implementation moved to a new module vector_search. As the vector search functionality is going to be extended, it is better to keep it in a separate module.	2025-09-22 08:01:47 +02:00
Ferenc Szili	d9f272dbdd	load_balancer: fix badness object creation The load balancer introduced the idea of badness, which is a measure of how a tablet migration effects table balance on the source and destination. This is an abbreviated definition of the badness struct: struct migration_badness { double src_shard_badness = 0; double src_node_badness = 0; double dst_shard_badness = 0; double dst_node_badness = 0; ... double node_badness() const { return std::max(src_node_badness, dst_node_badness); } double shard_badness() const { return std::max(src_shard_badness, dst_shard_badness); } }; A negative value for either of these 4 members signifies a good migration (improves table balance), and a positive signifies a bad migration. In two places in the balancer, badness for source and destination is computed independently in two objects of type migration_badness (src_badness and dst_badness), and later combined into a single object similar to this: return migration_badness{ src_badness.shard_badness(), src_badness.node_badness(), dst_badness.shard_badness(), dst_badness.node_badness() }; This is a problem when, for instance, source shard badness is good (less that 0), shard_badness() will return 0 because of std::max(). This way the actual computed badness is not set in the final object. This can lead to incorrect decisions made later by the balancer, when it searches for the best migration among a set of candidates. Closes scylladb/scylladb#26091	2025-09-21 21:37:23 +02:00
Dawid Mędrek	0d2560c07f	test/perf/tablet_load_balancing.cc: Create nodes within one DC In `789a4a1ce7`, we adjusted the test file to work with the configuration option `rf_rack_valid_keyspaces`. Part of the commit was making the two tables used in the test replicate in separate data centers. Unfortunately, that destroyed the point of the test because the tables no longer competed for resources. We fix that by enforcing the same replication factor for both tables. We still accept different values of replication factor when provided manually by the user (by `--rf1` and `--rf2` commandline options). Scylla won't allow for creating RF-rack-invalid keyspaces, but there's no reason to take away the flexibility the user of the test already has. Fixes scylladb/scylladb#26026 Closes scylladb/scylladb#26115	2025-09-21 21:36:43 +02:00
Tomasz Grabiec	ddbcea3e2a	tablets: scheduler: Run plan-maker in maintenance scheduling group Currently, it runs in the gossiper scheduling group, because it's invoked by the topology coordinator. That scheduling group has the same amount of shares as user workload. Plan-making can take significant amount of time during rebalancing, and we don't want that to impact user workload which happens to run on the same shard. Reduce impact by running in the maintenance scheduling group. Fixes #26037 Closes scylladb/scylladb#26046	2025-09-21 18:44:57 +03:00
Tomasz Grabiec	4a83b4eef3	Merge 'topology_coordinator: abort view building a bit later in case of tablet migration' from Piotr Dulikowski In multi DC setup, tablet load balancer might generate multiple migrations of the same tablet_id but only one is actually commited to the `system.tablets` table. This PR moved abortion of view building tasks from the same start of the migration (`<no tablet transition> -> allow_write_both_read_old`) to the next step (`allow_write_both_read_old -> write_both_read_old`). This way, we'll abort only tasks for which the tablet migration was actually started. The PR also includes a reproducer test. Fixes scylladb/scylladb#25912 View building coordinator hasn't been released yet, so no backport is needed. Closes scylladb/scylladb#26144 * github.com:scylladb/scylladb: test/test_view_building_coordinator: add reproducer topology_coordinator: abort view building a bit later in case of tablet migration	2025-09-21 15:41:53 +02:00
Karol Nowacki	eedf506be5	vector_store_client: Rename vector_store_uri to vector_store_primary_uri The configuration setting vector_store_uri is renamed to vector_store_primary_uri according to the final design. In the future, the vector_store_secondary_uri setting will be introduced. This setting now also accepts a comma-separated list of URIs to prepare for future support for redundancy and load balancing. Currently, only the first URI in the list is used. This change must be included before the next release. Otherwise, users will be affected by a breaking change. References: VECTOR-187 Closes scylladb/scylladb#26033	2025-09-21 16:33:10 +03:00
Michael Litvak	3dffb8e0dc	test: mv: add a test for view build interrupt during registration Add a new test that reproduces issue #22989. The test starts view building and interrupts it by restarting the node while some shards registered their status and some didn't.	2025-09-21 10:39:30 +02:00
Michael Litvak	6043409c31	view_builder: register view on all shards atomically When the view builder starts to build a new view, each shard registers itself by writing the shard id and current token to the scylla_views_builds_in_progress table. Previously, this happened independently by each shard. We change it now to register all shards "atomically" - when a shard registers itself, it also registers all other shards with an empty status, if they aren't registered yet. This ensures that we don't have a partial state in the table where only some of the shards are registered, but we always have a status for all shards. The reason we want to register all shards atomically is that if it happens that only some of the shards were registered, then we restart and load the status from table, this doesn't work well for multiple reasons. One example is that to know how many shards we had previously, we take the maximum shard id we see in the table. If it's different than the current shard count, we will execute the reshard code. But of course, if the last shard is missing from the table because it didn't register itself, this calculation will be wrong, and we can't know the previous number of shards. This is a problem because suppose we have two shards, and shard 0 finished building the view but shard 1 didn't start. When we come up, we will think that previously we had only a single shard and it completed building everything, when in fact we built only half the view approximately. The problem is that we don't have enough information in the tables to know that. There are additional problems related to reshard. In the reshard function, whether it is executed because we actually do node reshard or because we calculated the wrong number of previous shards, if the status of some shard is missing then the calculation of new ranges will be wrong. When some shard didn't make progress we should start building the view from scratch. However, this doesn't happen if we don't have a status for the shard, because the code looks only for shards that have a status. In effect, this shard is considered complete even though it didn't start. This could cause the view building to get stuck or complete without building all tokens ranges. By registering all shards atomically, this should solve the above problems because we will always have statuses for all shards. Fixes scylladb/scylladb#22989	2025-09-21 10:39:05 +02:00
Evgeniy Naydanov	85cbe7a8d4	test: add test for creating table with CDC enabled if not exists Check if there are no errors on the second attempt of executing "create table if not exists" query if CDC is enabled.	2025-09-21 09:38:36 +02:00
Michael Litvak	5a7e6e53ff	cdc: fix create table with cdc if not exists Fix an issue where executing a CREATE TABLE IF NOT EXISTS statement with CDC enabled fails with an error if the table already exists. Instead, the query should succeed and be a no-op. This regression was introduced by commit `fed1048059`. Previously, when executing the query, we would first check if the table exists in do_prepare_new_column_families_announcement. If it did, we would throw an already_exists_exception, which was handled correctly; otherwise, we would continue and create the CDC table in the before_create_column_families notification. The order of operations was changed in `fed1048059`, causing the regression. Now, we first create the CDC schema and add it to the schema list for creation, and then check for each of them if they already exist. The problem is that when we create the CDC schema in on_pre_create_column_families, it also checks if the CDC table already exists. If it does, it throws an invalid_request_exception, which is not caught and handled as expected. This patch restores the previous order of operations: we first check if the tables exist, and only then add the CDC schema in pre_create. Fixes scylladb/scylladb#26142	2025-09-21 09:38:36 +02:00
Michał Hudobski	1690e5265a	vector search: correct column name formatting This patch corrects the column name formatting whenever an "Undefined column name" exception is thrown. Until now we used the `name()` function which returns a bytes object. This resulted in a message with a garbled ascii bytes column name instead of a proper string. We switch to the `text()` function that returns a sstring instead, making the message readable. Tests are adjusted to confirm this behavior. Fixes: VECTOR-228 Closes scylladb/scylladb#26120	2025-09-20 07:02:53 +02:00
Michał Jadwiszczak	2aabf8ee3f	test/test_view_building_coordinator: add reproducer Adds a test which reproduces the issue described in scylladb/scylladb#25912. The test creates a situation where a single tablet is replicated across multiple DCs / racks, and all those tablet replicas are eligible for migration. The tablet load balancer is unpaused at that moment which currently causes it to attempt to generate multiple migrations for different tablet replicas of the same tablet. Before the fix for #25912, this used to confuse the view build coordinator which would react to each migration attempt, pausing view building work for each tablet replica for which there was an attempt to migrate but only unpausing it for the tablet replica that was actually migrated. After the fix, the view build coordinator only reacts to the migration that has "won" so the test successfully passes.	2025-09-19 19:08:34 +02:00
Michał Jadwiszczak	50c5354d0b	topology_coordinator: abort view building a bit later in case of tablet migration In multi DC setup, tablet load balancer might generate multiple migrations of the same tablet_id but only one is actually commited to the `system.tablets` table. This patch moved abortion of view building tasks from the same start of the migration (`<no tablet transition> -> allow_write_both_read_old`) to the next step (`allow_write_both_read_old -> write_both_read_old`). This way, we'll abort only tasks for which the tablet migration was actually started. Fixes scylladb/scylladb#25912	2025-09-19 18:02:41 +02:00
Michał Chojnowski	9e70df83ab	db: get rid of sstables-format-selector Our sstable format selection logic is weird, and hard to follow. If I'm not misunderstanding, the pieces are: 1. There's the `sstable_format` config entry, which currently doesn't do anything, but in the past it used to disable cluster features for versions newer than the specified one. 2. There are deprecated and unused config entries for individual versions (`enable_sstables_mc_format`, `enable_sstables_md_format`, etc). 3. There is a cluster feature for each version: ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc. (Currently all sstable version features have been grandfathered, and aren't checked by the code anymore). 4. There's an entry in `system.scylla_local` which contains the latest enabled sstable version. (Why? Isn't this directly derived from cluster features anyway)? 5. There's `sstable_manager::_format` which contains the sstable version to be used for new writes. This field is updated by `sstables_format_selector` based on cluster features and the `system.scylla_local` entry. I don't see why those pieces are needed. Version selection has the following constraints: 1. New sstables must be written with a format that supports existing data. For example, range tombstones with an infinite bound are only supported by sstables since version "mc". So if a range tombstone with an infinite bound exists somewhere in the dataset, the format chosen for new sstables has to be at least as new as "mc". 2. A new format might only be used after a corresponding cluster feature is enabled. (Otherwise new sstables might become unreadable if they are sent to another node, or if a node is downgraded). 3. The user should have a way to inhibit format ugprades if he wishes. So far, constraint (1) has been fulfilled by never using formats older than the newest format ever enabled on the node. (With an exception for resharding and reshaping system tables). Constraint (2) has been fulfilled by calling `sstable_manager::set_format` only after the corresponsing cluster feature is enabled. Constraint (3) has been fulfilled by the ability to inhibit cluster features by setting `sstable_format` by some fixed value. The main thing I don't like about this whole setup is that it doesn't let me downgrade the preferred sstable format. After a format is enabled, there is no way to go back to writing the old format again. That is no good -- after I make some performance-sensitive changes in a new format, it might turn out to be a pessimization for the particular workload, and I want to be able to go back. This patch aims to give a way to downgrade formats without violating the constraints. What it does is: 1. The entry in `system.scylla_local` becomes obsolete. After the patch we no longer update or read it. As far as I understand, the purpose of this entry is to prevent unwanted format downgrades (which is something cluster features are designed for) and it's updated if and only if relevant cluster features are updated. So there's no reason to have it, we can just directly use cluster features. 2. `sstable_format_selector` gets deleted. Without the `system.scylla_local` around, it's just a glorified feature listener. 3. The format selection logic is moved into `sstable_manager`. It already sees the `db::config` and the `gms::feature_service`. For the foreseeable future, the knowledge of enabled cluster features and current config should be enough information to pick the right formats. 4. The `sstable_format` entry in `db::config` is no longer intended to inhibit cluster features. Instead, it is intended to select the format for new sstables, and it becomes live-updatable. 5. Instead of writing new sstables with "highest supported" format, (which used to be set by `sstables_format_selector`) we write them with the "preferred" format, which is determined by `sstable_manager` based on the combination of enabled features and the current value of `sstable_format`. Closes scylladb/scylladb#26092 [avi: Pavel found the reason for the scylla_local entry - it predates stable storage for cluster features]	2025-09-19 16:17:56 +03:00
Pavel Emelyanov	d1626dfa86	api: Move /storage_service/compact to tasks.cc This one doesn't have async peer there, but it's still a pure compaction manager endpoint handler Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-19 13:23:59 +03:00
Pavel Emelyanov	6eaa2138ad	api: Move /storage_service/keyspace_upgrade_sstables to tasks.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-19 13:23:54 +03:00
Pavel Emelyanov	fe2a184713	api: Move /storage_service/keyspace_offstrategy_compaction to tasks.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-19 13:23:49 +03:00
Pavel Emelyanov	607a39acbd	api: Move /storage_service/keyspace_cleanup to tasks.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-19 13:23:44 +03:00
Pavel Emelyanov	abd23bdd6d	api: Move /storage_service/keyspace_compaction to tasks.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-19 13:23:37 +03:00
Pavel Emelyanov	a1ea553fe1	code: Replace distributed<> with sharded<> The latter is recommended in seastar, and the former was left as compatibility alias. Latest seastar explicitly marks it as deprecated so once the submodule is updated, compilation logs will explode. Most of the patch is generated with for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]>') ; do sed -e 's/\<distributed<$[A-Za-z0-9:_]$>/sharded<\1>/g' -i $f; done for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done and a small manual change in test/perf/perf.hh Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#26136	2025-09-19 12:22:51 +02:00
Aleksandra Martyniuk	5235e3cf67	test: limit test_streaming_deadlock_removenode concurrency test_streaming_deadlock_removenode starts 10240 inserts at once, overloading a node. Due to that test fails with timeout. Limit inserts concurrency. Fixes: #25945. Closes scylladb/scylladb#26102	2025-09-19 12:50:20 +03:00
Botond Dénes	92f614dc5a	tools/scylla-sstable-scripts: introduce writetime-histogram.lua Produces a histogram with the writetime (timestamp) of the data in the sstable(s). The histogram is printed to the output, along with general stats about the processed data.	2025-09-19 11:54:01 +03:00
Botond Dénes	d298da5410	tools/scylla-sstable-scripts: introduce purgable.lua Collects and prints statistics about how much data is purgeable in an sstable. Works only with tombstone_gc = {'mode': 'timeout'}; Can help diagnosing the efficiency (or lack of) tombstone-gc.	2025-09-19 11:53:30 +03:00
Ernest Zaslavsky	e56081d588	treewide: seastar module update and fix broken rest client start using `write_body` in `rest/client` to properly set headers due to changes applied to seastar's http client Seastar module update ``` b6be384e Merge 'http: generalize Content-Type setting' from Nadav Har'El 74472298 http: generalize request's Content-Type setting 9fd5a1cc http: generalize reply's Content-Type setting a2665f38 memory: Remove deprecated enable_abort_on_allocation_failure() d2a5a8a9 resource.cc: Remove some dead code 7ad9f424 http: Add support of multiple key repetitions for the request a636baca task: Move task::get_backtrace() definition in its class a0101efa Fixed "doxygen" spelling in error message db969482 Merge 'http/reply: introduce set_cookie()' from Botond Dénes 5357b434 http/reply: introduce set_cookie() 1ddcf05f http/reply: make write_reply*() public 4b782d73 http/connection: start_response(): fix indentation 720feca0 http/reply: encapsulate reply writing in write_reply() 3e19917d Merge 'exceptions: log thrown and propagated exception with distinct log levels' from Botond Dénes db9aea93 Merge 'Correctly wrap up abandoned yielding directory lister' from Pavel Emelyanov dbb2bf3f test: Add test for input_stream::read_exactly() a5308ec9 file/directory_lister: Correctly wrap up fallback generator 4f0811f4 file/directory_lister: Convert on-stack queue to shared pointer 59801da7 tests: Add directory lister early drop cases 33233032 http/reply: s/write_reply_to_connection/write_reply/ 69b93620 http/reply: write_reply_{to_connection,headers}(): pass output stream 56e9bda7 test: Convert directory_test into seastar test 96782358 Merge 'Improve io_tester's seqwrite and append workloads' from Pavel Emelyanov 8b46e3d4 SEASTAR_ASSERT: assert to stderr and flush stream 3370e22a tutorial.md: use current_exception_as_future() e977453a Add fixture support for seastar::testing 3e70d7f7 io_tester: Do not set append_is_unlikely unconditionally 2a4ae7b4 io_tester: Count file size overflows 5e678bb5 io_tester: Tuneup size overflow check d5dad8ce io_tester: Move position management code to io_class_data 5586a056 io_tester: Rename seqwrite -> overwrite 92df2fb2 io_tester: Relax return value of create_and_fill_file() 03d9500d io_tester: Dont fill file for APPEND d6844a7b io_tester: Indentation fix after previous patch fb9e0088 io_tester: Coroutinize create_and_fill_file() 2f802f57 exceptions: log thrown and propagated exception with distinct log levels 4971fa70 util: move log-level into own header 39448fc1 Merge 'Fix and tune http::request setup by client' from Pavel Emelyanov 52d0c4fb iostream: Move output_stream::write(scattered_message) lower 7a52f734 Merge 'read_first_line: Missing pragma and licence' from Ernest Zaslavsky d0881b7e read_first_line: Add missing license boilerplate 988a0e99 read_first_line:: Add missing `#pragma once` 42675266 http: Make client::make_request accept const request& c7709fb5 http: Make request making API return exceptional future not throw b68ed89b http: Move request content length header setup 1d96dac6 http: Move request version configuration 072e86f6 http: Setup request once ``` Closes scylladb/scylladb#25915 (cherry picked from commit `44d34663bc`) Closes scylladb/scylladb#26100	2025-09-19 11:40:59 +03:00
Botond Dénes	37e46f674d	Merge 'nodetool: ignore repair request error of colocated tables' from Michael Litvak when cluster repair is run for an entire keyspace, nodetool makes a repair api request for each table. if the keyspace contains colocated tables, then the api request for the colocated tables will fail, because currently scylla doesn't allow making repair requests for specific colocated tables, but only for base tables. if the request is to repair an entire keyspace then we can ignore this, because we will make a repair request for all base tables, and this in turn will repair also all the colocated tables in the keyspace. however if specific tables are requested and some of them are colocated then we should propagate the error to let the user know the request is invalid. Refs https://github.com/scylladb/scylladb/issues/24816 no backport - no colocated tablets in previous releases Closes scylladb/scylladb#26051 * github.com:scylladb/scylladb: nodetool: ignore repair request error of colocated tables storage_service: improve error message on repair of colocated tables	2025-09-19 06:44:23 +03:00
Nadav Har'El	7be5454db1	Merge 'alternator: Store LSI keys in :attrs for newly created tables' from Piotr Wieczorek Previously, LSI keys were stored as separate, top-level columns in the base table. This patch changes this behavior for newly created tables, so that the key columns are stored inside the `:attrs` map. Then, we use top-level computed columns instead of regular ones. This makes LSI storage consistent with GSIs and allows the use of a collection tombstone on `:attrs` to delete all attributes in a row except for keys in new tables. Refs https://github.com/scylladb/scylladb/pull/24991 Refs https://github.com/scylladb/scylladb/issues/6930 Closes scylladb/scylladb#25796 * github.com:scylladb/scylladb: alternator: Store LSI keys in :attrs for newly created tables alternator/test: Add LSI tests based mostly on the existing GSI tests	2025-09-18 21:48:43 +03:00
Karol Nowacki	bc06f89a5c	vector_store_client: Fix cleanup of client_producer factory vector_store_client::stop did not properly clean up the coroutine that was waiting for a notification on the refresh_client_cv condition variable. As a result, the coroutine could try to access `this` (via current_client) after the vector_store_client was destroyed. To fix this, the `client_producer` tasks are wrapped by a gateway. The `stop` method now signals the `client_producer` condition variable and closes the gateway, which ensures that all `client_producer` tasks are finished before the `stop` function returns. The `wait_for_signal` return type was changed from `bool` to `void` as the return value was not used. Fixes: VECTOR-230 Closes scylladb/scylladb#26076	2025-09-18 21:34:34 +03:00
Avi Kivity	fe7e63f109	Merge 'transport: service_level_controller: create and use `driver` service level' from Andrzej Jackowski This patch series: - Increases the number of allowed scheduling groups to allow creation of `sl:driver` - Implements `create_driver_service_level` that creates `sl:driver` with shares=200 if it wasn't already created - Implements creation of `sl:driver` for new systems and tests in `raft_initialize_discovery_leader` - Modifies `topology_coordinator` to use create `sl:driver` after upgrades. - Implements using `sl:driver` for new connections in `transport/server` - Adds to `transport/server` recognition of driver's control connections and forcing them to keep using `sl:driver`. - Adds tests to verify the new functionality - Modifies existing tests to let them pass after `sl:driver` is added - Modifies the documentation to contain new `sl:driver` The changes were evaluated by a test with the following scenario ([test_connections-sl-driver.py](https://github.com/user-attachments/files/22021273/test_connections-sl-driver.py)): - Start ScyllaDB with one node - Create 1000 keyspaces, 1 table in each keyspace - Start `cassandra-stress` (`-rate threads=50 -mode native cql3`) - Run connection storm with 1000 session (100 python processes, 10 sessions each) The maximum latency during connection storm dropped from 224.94ms to 41.43ms (those numbers are average from 20 test executions, were max latency was in [140ms, 361ms] before change and [31.4ms, 61.5ms] after). The snippet of cassandra-stress output from the moment of connection storm: Before: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 789206, 85887, 85887, 85887, 0.6, 0.3, 2.0, 2.0, 2.5, 5.0, 9.0, 0.09679, 0, 0, 0, 0, 0, 0 total, 909322, 120116, 120116, 120116, 0.4, 0.2, 1.9, 2.0, 2.1, 3.1, 10.0, 0.09053, 0, 0, 0, 0, 0, 0 total, 964392, 55070, 55070, 55070, 0.9, 0.4, 2.0, 4.5, 7.7, 18.9, 11.0, 0.09203, 0, 0, 0, 0, 0, 0 total, 975705, 11313, 11313, 11313, 4.4, 3.5, 6.5, 24.5, 82.7, 83.0, 12.0, 0.11713, 0, 0, 0, 0, 0, 0 total, 987548, 11843, 11843, 11843, 4.2, 3.5, 6.5, 33.7, 48.6, 51.5, 13.0, 0.13366, 0, 0, 0, 0, 0, 0 total, 995422, 7874, 7874, 7874, 6.3, 4.0, 7.7, 85.6, 112.9, 113.5, 14.0, 0.14753, 0, 0, 0, 0, 0, 0 total, 1007228, 11806, 11806, 11806, 4.3, 3.5, 6.5, 29.1, 43.8, 87.1, 15.0, 0.15598, 0, 0, 0, 0, 0, 0 total, 1012840, 5612, 5612, 5612, 8.2, 5.0, 11.5, 121.8, 166.6, 170.1, 16.0, 0.16535, 0, 0, 0, 0, 0, 0 total, 1016186, 3346, 3346, 3346, 13.4, 7.4, 20.1, 204.9, 207.6, 210.4, 17.0, 0.17405, 0, 0, 0, 0, 0, 0 total, 1025462, 9276, 9276, 9276, 6.3, 3.9, 9.6, 74.6, 206.8, 210.0, 18.0, 0.17800, 0, 0, 0, 0, 0, 0 total, 1035979, 10517, 10517, 10517, 4.8, 3.5, 6.7, 38.5, 82.6, 83.0, 19.0, 0.18120, 0, 0, 0, 0, 0, 0 total, 1047488, 11509, 11509, 11509, 4.3, 3.5, 6.0, 32.6, 72.3, 74.0, 20.0, 0.18334, 0, 0, 0, 0, 0, 0 total, 1077456, 29968, 29968, 29968, 1.7, 1.6, 2.9, 3.6, 7.0, 8.2, 21.0, 0.17943, 0, 0, 0, 0, 0, 0 total, 1105490, 28034, 28034, 28034, 1.8, 1.8, 3.5, 4.6, 5.3, 13.8, 22.0, 0.17609, 0, 0, 0, 0, 0, 0 total, 1132221, 26731, 26731, 26731, 1.9, 1.8, 3.8, 5.2, 8.4, 11.1, 23.0, 0.17314, 0, 0, 0, 0, 0, 0 total, 1162149, 29928, 29928, 29928, 1.7, 1.7, 3.0, 4.5, 8.0, 9.1, 24.0, 0.16950, 0, 0, 0, 0, 0, 0 ... ``` After: ``` type total ops, op/s, pk/s, row/s, mean, med, .95, .99, .999, max, time, stderr, errors, gc: #, max ms, sum ms, sdv ms, mb ... total, 822863, 94379, 94379, 94379, 0.5, 0.3, 2.0, 2.0, 2.1, 3.7, 9.0, 0.06669, 0, 0, 0, 0, 0, 0 total, 937337, 114474, 114474, 114474, 0.4, 0.2, 2.0, 2.0, 2.1, 3.4, 10.0, 0.06301, 0, 0, 0, 0, 0, 0 total, 986630, 49293, 49293, 49293, 1.0, 1.0, 2.0, 2.1, 17.9, 19.0, 11.0, 0.07318, 0, 0, 0, 0, 0, 0 total, 1026734, 40104, 40104, 40104, 1.2, 1.0, 2.0, 2.2, 6.3, 7.1, 12.0, 0.08410, 0, 0, 0, 0, 0, 0 total, 1066124, 39390, 39390, 39390, 1.3, 1.0, 2.0, 2.2, 2.6, 3.4, 13.0, 0.09108, 0, 0, 0, 0, 0, 0 total, 1103082, 36958, 36958, 36958, 1.3, 1.1, 2.1, 2.5, 3.1, 4.2, 14.0, 0.09643, 0, 0, 0, 0, 0, 0 total, 1141987, 38905, 38905, 38905, 1.3, 1.0, 2.0, 2.4, 11.4, 12.7, 15.0, 0.09894, 0, 0, 0, 0, 0, 0 total, 1180023, 38036, 38036, 38036, 1.3, 1.0, 2.0, 3.7, 5.6, 7.1, 16.0, 0.10070, 0, 0, 0, 0, 0, 0 total, 1216481, 36458, 36458, 36458, 1.4, 1.0, 2.1, 3.6, 4.7, 5.0, 17.0, 0.10210, 0, 0, 0, 0, 0, 0 total, 1256819, 40338, 40338, 40338, 1.2, 1.0, 2.0, 2.2, 3.5, 5.4, 18.0, 0.10173, 0, 0, 0, 0, 0, 0 total, 1295122, 38303, 38303, 38303, 1.3, 1.0, 2.0, 2.4, 21.0, 21.1, 19.0, 0.10136, 0, 0, 0, 0, 0, 0 total, 1334743, 39621, 39621, 39621, 1.3, 1.0, 2.0, 2.3, 3.3, 4.0, 20.0, 0.10055, 0, 0, 0, 0, 0, 0 total, 1375579, 40836, 40836, 40836, 1.2, 1.0, 2.0, 2.1, 3.4, 5.7, 21.0, 0.09927, 0, 0, 0, 0, 0, 0 total, 1415576, 39997, 39997, 39997, 1.2, 1.0, 2.0, 2.3, 3.2, 4.1, 22.0, 0.09807, 0, 0, 0, 0, 0, 0 total, 1449268, 33692, 33692, 33692, 1.5, 1.4, 2.5, 3.2, 4.2, 5.6, 23.0, 0.09800, 0, 0, 0, 0, 0, 0 total, 1471873, 22605, 22605, 22605, 2.2, 2.0, 4.8, 5.9, 7.0, 7.9, 24.0, 0.10015, 0, 0, 0, 0, 0, 0 ... ``` Fixes: https://github.com/scylladb/scylladb/issues/24411 This is a new feature, so no backport needed. Closes scylladb/scylladb#25412 * github.com:scylladb/scylladb: docs: workload-prioritization: add driver service level test: add test to verify use of `sl:driver` transport: use `sl:driver` to handle driver's control connections transport: whitespace only change in update_scheduling_group transport: call update_scheduling_group for non-auth connections generic_server: transport: start using `sl:driver` for new connections test: add test_desc_* for driver service level test: service_levels: add tests for sl:driver creation and removal test: add reload_raft_topology_state() to ScyllaRESTAPIClient service_level_controller: automatically create `sl:driver` service_level_controller: methods to create driver service level service_level_controller: handle special sl:driver in DESC output topology_coordinator: add service_level_controller reference system_keyspace: add service_level_driver_created test: add MAX_USER_SERVICE_LEVELS	2025-09-18 19:45:17 +03:00
Karol Nowacki	b5f3f2f4c5	tools: Fix missing source file in CMake target The `json_mutation_stream_parser.cc` file was not included in the `scylla-tools` CMake target. This could lead to "undefined reference" linker errors when building with CMake. This commit adds the missing source file to the target's source list. Closes scylladb/scylladb#26108	2025-09-18 19:44:53 +03:00
Radosław Cybulski	c0db278c03	Don't report spurious keys in DescribeTable Alternator, when creating gsi, adds artificially columns, that user had not ask for. This patch prevents those columns from showing up in DescribeTable's output. Fixes #5320 Closes scylladb/scylladb#25978	2025-09-18 19:34:39 +03:00
Radosław Cybulski	6240006c5a	Fix spelling errors Closes scylladb/scylladb#26112	2025-09-18 17:37:31 +02:00
Patryk Jędrzejczak	5efc46152c	Merge 'raft_topology: Modify the conditional logic in remove node operation …' from Abhinav Kumar Jha In the current scenario, the shard receiving the remove node REST api request performs condional lock depending on whether raft is enabled or not. Since non-zero shard returns false for `raft_topology_change_enabled()`, the requests routed to non zero shards are prone to this lock which is unnecessary and hampers the ability to perform concurrent operations, which is possible for raft enabled nodes. This pr modifies the conditional lock logic and orchestrates the remove node execution logic directly to the shard0, hence the `raft_topology_change_enabled()` is now checked on the shard0 and execution is performed accordingly. Earlier, `storage_service::find_raft_nodes_from_hoeps` code threw error upon observing any non topology member present in ignore_nodes. Since we are performing concurrent remove node operations, the timing can lead to one node being fully removed before the other node remove op begins processing which can lead to runtime error in storage_service::find_raft_nodes_from_hoeps. This error throw was added to prevent users from putting random non existent nodes in ignore_nodes list. Hence made changes in that function to account for already removed nodes and ignore those nodes instead of throwing error. A test is also added to confirm the new behaviour, where concurrent remove node operations are now being performed seamlessly. This pr doesn't fix a critical bug. No need to backport it. Fixes: scylladb/scylladb#24737 Closes scylladb/scylladb#25713 * https://github.com/scylladb/scylladb: raft_topology: Modify the conditional logic in remove node operation to enhance concurrency for raft enabled clusters. storage_service: remove assumptions and checks for ignore_nodes to be normal.	2025-09-18 17:27:59 +02:00
Nadav Har'El	27c1545340	test/alternator: fix test_list_tables_paginated on DynamoDB Our list_tables() utility function, used by the test test_table.py::test_list_tables_paginated, asserts that empty pages cannot be returned by ListTables - and in fact neither DynamoDB nor Alternator returns them. But it turns out this is only true on DynamoDB's us-east-1 region, and in the eu-north-1 region, ListTables when using Limit=1 can actually return an empty last page. So let's just drop that unnecessary assertion as being wrong. In any case, this assert was in a utility function, not a test, which probably wasn't a great idea in the first place.	2025-09-18 17:46:34 +03:00
Piotr Dulikowski	fb0e5784e4	mv: fix typo in start_backgroud_fibers Letter "n" was missing in this name.	2025-09-18 15:50:16 +02:00
Piotr Dulikowski	261f61d303	mv: run view building worker fibers in streaming group The background fibers of the view building worker are indirectly spawned by the main function, thus the fibers inherit the "main" scheduling group. The main scheduling group is not supposed to be used for regular work, only for initialization and deinitialization, so this is wrong. Wrap the call to `start_backgroud_fibers()` with `with_scheduling_group` and use the streaming scheduling group. The view building worker already handles RPCs in the streaming scheduling group (which do most of the work; background fibers only do some maintenance), so this seems like a good fit.	2025-09-18 15:42:36 +02:00
Nadav Har'El	284284bf83	test/alternator: fix tests in test_tag.py on DynamoDB Until August 2024, DynamoDB's "TagResource" operation was synchronous - when it returned the tags were available for read. This is no longer true, as the new documentation says and we see in practice with many test_tag.py failing on DynamoDB. Not only do we can't read the new tags without waiting, we're not allowed to change other tags or even to delete the table without waiting. We don't need to fix Alternator for this new behavior - there is (surprisingly!) no new API to check if the tag change took affect, and it's perfectly fine that in Alternator the tags take affect immediately (when TagResource returns) and not a few seconds later. But we don't need to fix most test_tag.py tests to work with the new asynchronous API. The fix introduces convenience functions tag_resource() and untag_resource() which performs the TagResource or UntagResource operation, but also waits until the change took affect by trying ListTagsOfResources until the desired change took affect. This will make failed tests wait until the timeout (60 seconds), but that's fine - we don't expect to have failed test. After this test all tests in test/altrnator/test_tag.py pass on DynamoDB (and continue passing on Alternator). Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-18 16:38:09 +03:00
Łukasz Paszkowski	d42d4a05fb	disk_space_monitor_test.cc: Start a monitor after fake space source function is registered When the monitor is started, the first disk utilization value is obtained from the actual host filesystem and not from the fake space source function. Thus, register a fake space source function before the monitor is started. Fixes: https://github.com/scylladb/scylladb/issues/26036 Backport is not required. The test has been added recently. Closes scylladb/scylladb#26054	2025-09-18 15:03:34 +03:00
Piotr Dulikowski	5f55787e50	Merge 'CDC with tablets' from Michael Litvak initial implementation to support CDC in tablets-enabled keyspaces. The design is described in https://docs.google.com/document/d/1qO5f2q5QoN5z1-rYOQFu6tqVLD3Ha6pphXKEqbtSNiU/edit?usp=sharing It is followed closely for the most part except "Deciding when to change streams" - instead, streams are changed synchronously with tablet split / merge. Instead of the stream switching algorithm with the double writes, we use a scheme similar to the previous method for vnodes - we add the new streams with timestamp that is sufficiently far into the future. In this PR we: * add new group0-based internal system tables for tablet stream metadata and loading it into in-memory CDC metadata * add virtual tables for CDC consumers * the write coordinator chooses a stream by looking up the appropriate stream in the CDC metadata * enable creating tables with CDC enabled in tablets-enabled keyspaces. tablets are allocated for the CDC table, and a stream is created per each tablet. * on tablet resize (split / merge), the topology coordinator creates a new stream set with a new stream for each new tablet. * the cdc tablets are co-located with the base tablets Fixes https://github.com/scylladb/scylladb/issues/22576 backport not needed - new feature update dtests: https://github.com/scylladb/scylla-dtest/pull/5897 update java cdc library: https://github.com/scylladb/scylla-cdc-java/pull/102 update rust cdc library: https://github.com/scylladb/scylla-cdc-rust/pull/136 Closes scylladb/scylladb#23795 * github.com:scylladb/scylladb: docs/dev: update CDC dev docs for tablets doc: update CDC docs for tablets test: cluster_events: enable add_cdc and drop_cdc test/cql: enable cql cdc tests to run with tablets test: test_cdc_with_alter: adjust for cdc with tablets test/cqlpy: adjust cdc tests for tablets test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests cdc: enable cdc with tablets topology coordinator: change streams on tablet split/merge cdc: virtual tables for cdc with tablets cdc: generate_stream_diff helper function cdc: choose stream in tablets enabled keyspaces cdc: rename get_stream to get_vnode_stream cdc: load tablet streams metadata from tables cdc: helper functions for reading metadata from tables cdc: colocate cdc table with base cdc: remove streams when dropping CDC table cdc: create streams when allocating tablets migration_listener: add on_before_allocate_tablet_map notification cdc: notify when creating or dropping cdc table cdc: move cdc table creation to pre_create cdc: add internal tables for cdc with tablets cdc: add cdc_with_tablets feature flag cdc: add is_log_schema helper	2025-09-18 13:39:37 +02:00
Ernest Zaslavsky	d6aa04b88a	serialization: Eliminate `cql_serialization_format.hh` Eliminate `cql_serialization_format.hh` file by inlining it into `query-request.hh` header since the content is not used anywhere but the aforementioned header Removed files: - cql_serialization_format.hh Fixes: #22108 This is a cleanup, no need to backport Closes scylladb/scylladb#25087	2025-09-18 13:17:56 +03:00
Nadav Har'El	3afe078d24	test/alternator: fix test_health_only_works_for_root_path on DynamoDB test_health.py::test_health_only_works_for_root_path checks that while http://ourserver/ is a valid health-check URL, taking other silly strings at the end, like http://ourserver/abc - is NOT valid and results in an error. It turns out that for one of the silly strings we chose to test, "/health", DynamoDB started recently NOT to return an error, and instead return an empty but successful response. In fact, it does this for every string starting with /health - including "/healthz". Perhaps they did this for some sort of Kubernetes compatibility, but in any case this behavior isn't documented and we don't need to emulate it. For now, let's just remove the string "/health" from our test so the test doesn't fail on DynamoDB. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-18 12:48:42 +03:00
Nadav Har'El	5bd503ad43	test/alternator: reproducer tests for faux GSI range key problem In issue #5320 we noticed that when we have a GSI with a hash key only (no range key) but the implementation's MV needs to add a clustering key for the original base key columns, the output of DescribeTable wrongly lists that exta "range key" - which isn't a real range key of the GSI. It turns out that the fact that the extra attribute is not a real GSI range key has another implication: It should not be allowed in KeyConditions or KeyConditionExpression - which should allow only real key columns of the GSI. This patch adds two reproducing tests for this issue (issue #2601), both pass on DynamoDB but xfail on Alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-18 12:17:21 +03:00
Avi Kivity	f6b6312cf4	Merge 'sstables/trie: prepare for integrating BTI indexes with sstable readers and writers' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25626 Next parts: introducing the new components, Partitions.db and Rows.db This is the preparatory, uncontroversial part of https://github.com/scylladb/scylladb/pull/26039, which has been split out to a separate PR to make the main part (which, after a revision, will be posted later) smaller. This series contains several small fixes and changes to BTI-related code added earlier, which either have to be done (i.e. propagating `reader_permit` to IO calls in index reads) or just deserved to be done. There's no single theme for the changes in this PR, refer to the individual commits for details. The changes are for the sake of new and unreleased code. No backporting should be done. Closes scylladb/scylladb#26075 * github.com:scylladb/scylladb: sstables/mx/reader: remove mx::make_reader_with_index_reader test/boost/bti_index_test: fix indentation sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file sstables/trie: support reader_permit and trace_state properly sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header sstables/trie/bti_index_reader: support BYPASS CACHE test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate sstables/trie: change the signature of bti_partition_index_writer::finish sstables/bti_index: improve signatures of special member functions in index writers streaming/stream_transfer_task: coroutinize `estimate_partitions()` types/comparable_bytes: add a missing implementation for date_type_impl sstables: remove an outdated FIXME storage_service: delete `get_splits()` sstables/trie: fix some comment typos in bti_index_reader.cc sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone	2025-09-18 12:10:27 +03:00
Botond Dénes	edaf67edcb	tools/scylla-sstable: remove writetime-histogram command This command was written for an investigation and was used exactly once. This would have been a perfect candidate for the (also rarely used) scylla-sstable script command, but it didn't exist yet. Drop this command from the tool, such super-specific commands should be written as sstable-scripts nowadays, which is what we will do if we ever need this again. Closes scylladb/scylladb#26062	2025-09-18 12:05:54 +03:00
Nadav Har'El	0b30688641	test/alternator: fix test "test_17119a" to pass on DynamoDB As noticed in issue #26079 the Alternator test test_gsi.py::test_17119a fails on DynamoDB. The problem was that the test added to KeyConditions reading from a GSI an unnecessary attribute - one which was accidentally allowed by Alternator (Refs #26103) but not allowed by DynamoDB. This is easy to fix - just remove the unnecessary attribute from KeyConditions, and the test still works properly and passes on both DynamoDB and Alternator. Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-18 11:38:35 +03:00
Piotr Dulikowski	4ed045a15c	Merge 'db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr`' from Michał Jadwiszczak When a staging sstable is registered to view building worker, it needs to make a round trip from its original shard to shard 0 (in order to create a view building task) and back (to be eventually processed). Until now this was done using plain `sstables::shared_sstable` (= `lw_shared_ptr`) which is not safe to be moved between shards. This patch fixes this by wrapping the pointer in `foreign_ptr` and obtains necessary informations (owner shard, last token) on the original shard (instead of on shard0). Then all of those objects are put into freshly introduced structure `staging_sstable_task_info`, which can be safely moved between shards. Fixes https://github.com/scylladb/scylladb/issues/25859 View building coordinator isn't present in any release yet, no backport needed. Closes scylladb/scylladb#25832 * github.com:scylladb/scylladb: db/view/view_building_worker: fix indent db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr` db/view/view_building_worker: use table id in `register_staging_sstable_tasks()` db/view/view_building_worker: move helper functions higher	2025-09-18 10:24:27 +02:00
Piotr Dulikowski	b71af71ab5	Merge 'db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source`' from Michał Jadwiszczak Previously the sharded abort_sources was stopped at the end of `batch::do_work()`, which is working in parallel to view building worker main loop. This leads to races because the worker may call `batch::abort()`, which access the abort_sources. This patch solves this be changing `sharded<abort_source>` into `abort_source`. Since now `batch::do_work()` is executed on tasks' shard, all abort source checks are also done on tasks' shard. The only place where shard0 uses the abort source is `batch::abort()`, but this method now does `smp::submit_to(replica.shard, [request abort])`, so the abort source is used on tasks' shard exclusively. Fixes https://github.com/scylladb/scylladb/issues/25805 Fixes https://github.com/scylladb/scylladb/issues/26045 View building coordinator hasn't been released yet, so no backport needed. Closes scylladb/scylladb#26059 * github.com:scylladb/scylladb: db/view/view_building_worker: fix indents db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source` db/view/view_building_worker: execute entire `batch::do_work` on tasks shard db/view/view_building_worker: store reference to sharded worker in batch	2025-09-18 10:11:20 +02:00
Michael Litvak	aae91330b0	nodetool: ignore repair request error of colocated tables when cluster repair is run for an entire keyspace, nodetool makes a repair api request for each table. if the keyspace contains colocated tables, then the api request for the colocated tables will fail, because currently scylla doesn't allow making repair requests for specific colocated tables, but only for base tables. if the request is to repair an entire keyspace then we can ignore this, because we will make a repair request for all base tables, and this in turn will repair also all the colocated tables in the keyspace. however if specific tables are requested and some of them are colocated then we should propagate the error to let the user know the request is invalid. Refs scylladb/scylladb#24816	2025-09-18 09:35:53 +02:00
Michael Litvak	eeaa64ca0e	storage_service: improve error message on repair of colocated tables currently repair requests can't be added or deleted on non-base colocated tables. improve the error message and comments to be more clear and detailed.	2025-09-18 09:35:53 +02:00
Andrzej Jackowski	757dca3bc8	docs: workload-prioritization: add driver service level Refs: scylladb/scylladb#24411	2025-09-18 09:29:37 +02:00
Andrzej Jackowski	452313f5a5	test: add test to verify use of `sl:driver` `sl:driver` is expected to be used for new and control connections, but other connections that run user load should not use it after the user is authenticated. Refs: scylladb/scylladb#24411	2025-09-18 09:29:37 +02:00
Andrzej Jackowski	c02535635e	transport: use `sl:driver` to handle driver's control connections Before `sl:driver` was introduced, service levels were assigned as follows: 1. New connections were processed in `main`. 2. After user authentication was completed, the connection's SL was changed to the user's SL (or `sl:default` if the user had no SL). This commit introduces `service_level_state` to `client_state` and implements the following logic in `transport/server`: 1. If `sl:driver` is not present in the system (for example, it was removed), service levels behave as described above. 2. If `sl:driver` is present, the flow is: I. New connections use `sl:driver`. II. After user authentication is completed, the connection's SL is changed to the user's SL (or `sl:default`). III. If a REGISTER (to events) request is handled, the client is processing the control connection. We mark the client_state to permanently use `sl:driver`. The aforementioned state `2.III` is represented by `_control_connection` flag in `client_state`. Fixes: scylladb/scylladb#24411	2025-09-18 09:29:37 +02:00
Andrzej Jackowski	49aa7613ae	transport: whitespace only change in update_scheduling_group The indentation is changed because it will be required in the next commit of this patch series.	2025-09-18 09:29:37 +02:00
Andrzej Jackowski	43472e8633	transport: call update_scheduling_group for non-auth connections Before this change, unauthorized connections stayed in `main` scheduling group. It is not ideal, in such case, rather `sl:default` should be used, to have a consistent behavior with a scenario where users is authenticated but there is no service level assigned to the user. This commit adds a call to `update_scheduling_group` at the end of connection creation for an unauthenticated user, to make sure the service level is switched to `sl:default`. Fixes: scylladb/scylladb#26040	2025-09-18 09:29:37 +02:00
Andrzej Jackowski	1ad483749a	generic_server: transport: start using `sl:driver` for new connections Before this change, new connections were handled in a default scheduling group (`main`), because before the user is authenticated we do not know which service level should be used. With the new `sl:driver` service level, creation of new connections can be moved to `sl:driver`. We switch the service level as early as possible, in `do_accepts`. There is a possibility, that `sl:driver` will not exist yet, for instance, in specific upgrade cases, or if it was removed. Therefore, we also switch to `sl:driver` after a connection is accepted. Refs: scylladb/scylladb#24411	2025-09-18 09:29:29 +02:00
Andrzej Jackowski	e1b4a338ba	test: add test_desc_* for driver service level Driver service level is a special service level that is created automatically by the system. Therefore, it requires special handling in DESC SCHEMA WITH INTERNALS and those test verifies the special behavior. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	43a0eb7b0b	test: service_levels: add tests for sl:driver creation and removal Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	4af270a271	test: add reload_raft_topology_state() to ScyllaRESTAPIClient To encapsulate `/storage_service/raft_topology/reload` API call	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	6f678a2d1f	service_level_controller: automatically create `sl:driver` This commit: - Increases the number of allowed scheduling groups to allow the creation of `sl:driver`. - Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating `sl:driver` until all nodes have increased the number of scheduling groups. - Starts using `get_create_driver_service_level_mutations` to unconditionally create `sl:driver` on `raft_initialize_discovery_leader`. The purpose of this code path is ensuring existence of `sl:driver` in new system and tests. - Starts using `migrate_to_driver_service_level` to create `sl:driver` if it is not already present. The creation of `sl:driver` is managed by `topology_coordinator`, similar to other system keyspace updates, such as the `view_builder` migration. The purpose of this code path is handling upgrades. - Modifies related tests to pass after `sl:driver` is added. Later in this patch series, `sl:driver` will be used by `transport/server` to handle selected traffic, such as the driver's schema and topology fetches. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	6a911bff3f	service_level_controller: methods to create driver service level This commit implements `get_create_driver_service_level_mutations` and `migrate_to_driver_service_level` in service_level_controller. Both methods create `sl:driver` with shares=200 and store this fact in `system.scylla_local`. Both methods will be used later in this patch series for automatic creation of sl:driver. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	5cb4577800	service_level_controller: handle special sl:driver in DESC output Later in this patch series, `sl:driver` will be added as a special service level created automatically by the system. It needs special handling in `DESC SCHEMA ...` to ensure that during backup restore: 1. CREATE SERVICE LEVEL does not fail if `sl:driver` already exists 2. If `sl:driver` exists, its configuration is fully restored (emit ALTER SERVICE LEVEL). 3. If `sl:driver` was removed, the information is retained (emit DROP SERVICE LEVEL instead of CREATE/ALTER). Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	09c8f67e69	topology_coordinator: add service_level_controller reference This adds a reference to sl_controller so that, later in this patch series, topology_coordinator can manage creating `sl:driver` once group0 is fully operational. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	dd9b4c64d2	system_keyspace: add service_level_driver_created This commit extends sytem.scylla_local table with an additional key/value pair that can be used later in this patch series to keep an information that `sl:driver` was already created. The purpose of storing this information is to ensure that `sl:driver` is not recreated after being intentionally removed. A new mutation is included in `register_raft_pull_snapshot` to keep `service_level_driver_created` in state machine shapshot, which is required for proper propagation of the value when a new node is added to the cluster. Refs: scylladb/scylladb#24411	2025-09-18 09:28:32 +02:00
Andrzej Jackowski	d30590c1d0	test: add MAX_USER_SERVICE_LEVELS Previously, tests used the hardcoded value 7 for the maximum number of user service levels. This commit introduces a named variable that can be shared across tests to avoid cases where this magic number goes out of sync.	2025-09-18 09:28:32 +02:00
Nadav Har'El	22f88bff30	test/alternator: fix test to pass on DynamoDB As noticed in issue #26079, the Alternator test test_number.py::test_invalid_numbers failed on DynamoDB, because one of the things it did, as a "sanity check", was to check that the number 0e1000 was a valid number. But it turns out it isn't allowed by DynamoDB. So this patch removes 0e1000 from the list of valid numbers in test_invalid_numbers, and instead creates a whole new test for the case of 0e1000. It turns out that DynamoDB has a bug (it appears to be a regression, because test_invalid_numbers used to pass on DynamoDB!) where it allows 0.0e1000 (since it's just zero, really!) but forbids 0e1000 which is incorrectly considered to have a too-large magnitude. So we introduce a test that confirms that Alternator correctly allows both 0.0e1000 and 0e1000. DynamoDB fails this test (it allows the first, forbidding the second), making it the first Alternator test tagged as a "dynamodb_bug". Signed-off-by: Nadav Har'El <nyh@scylladb.com>	2025-09-18 10:28:01 +03:00
Pawel Pery	12f04edf22	vector_store_client: rename embedding into vs_vector According to the changes in Vector Store API (VECTOR-148) the `embedding` term should be changed to `vector`. As `vector` term is used for STL class the internal type or variable names would be changed to `vs_vector` (aka vector store vector). This patch changes also the HTTP ann json request payload according to the Vector Store API changes. Fixes: VECTOR-229 Closes scylladb/scylladb#26050	2025-09-18 08:45:46 +03:00
Ferenc Szili	de5dab8429	docs: add capacity based balancing explanation Capacity based balancing was introduced in 2025.1. It computes balance based on a node's capacity: the number of tablets located on a node should be directly proportional to that node's storage capacity. This change adds this explanation to the docs. Fixes: #25686 Closes scylladb/scylladb#25687	2025-09-18 08:14:04 +03:00
Yauheni Khatsianevich	adc4a4e15a	test: new test for LWT testing during tablets migration E2E test runs multi-column CAS workload (LOCAL_QUORUM/LOCAL_SERIAL) while tablets are repeatedly migrated between nodes. Uncertainty timeouts are resolved via LOCAL_SERIAL reads; guards use max(row, lower_bound). Final assertion: s{i} per (pk,i) equals the count of confirmed CAS by worker i (no lost/phantom updates) despite tablet moves. Closes scylladb/scylladb#25402	2025-09-18 08:11:12 +03:00
Ernest Zaslavsky	ddf2588985	treewide: Move replica related files to `replica` directory As requested in #22099, moved the files and fixed other includes and build system. Moved files: - cache_temperature.hh - cell_locking.hh Fixes: #22099 Closes scylladb/scylladb#25079	2025-09-18 08:00:35 +03:00
Pavel Emelyanov	65638232e8	Merge 'utils: azure: Catch system errors when probing IMDS and bump the verbosity of logs' from Nikos Dragazis This PR fixes a bug in the Azure default credential provider that would cause the `test_azure_provider_with_incomplete_creds` unit test to be flaky. The provider would assume that an unreachable IMDS endpoint would always result in a timeout, but network errors are also possible (e.g., ICMP "host unreachable"). The issue is triggered by this particular test because it sets the IMDS endpoint to a non-routable address. Some routers choose to silently drop such packets, while others return ICMP errors. To fix it, the default credential provider has been updated to catch system errors as well. This PR also raises the log level of the default credential provider from DEBUG to INFO, making it easier for operators to diagnose authentication issues. More details in the commit messages. Fixes #25641. Closes scylladb/scylladb#25696 * github.com:scylladb/scylladb: utils: azure: Catch system errors when detecting IMDS utils: azure: Bump default credential logs from DEBUG to INFO	2025-09-18 07:43:00 +03:00
Nadav Har'El	3c969e2122	cql: document and test permissions on materialized views and CDC We were recently surprised (in pull request #25797) to "discover" that Scylla does not allow granting SELECT permissions on individual materialized views. Instead, all materialized views of a base table are readable if the base table is readable. In this patch we document this fact, and also add a test to verify that it is indeed true. As usual for cqlpy tests, this test can also be run on Cassandra - and it passes showing that Cassandra also implemented it the same way (which isn't surprising, given that we probably copied our initial implementation from them). The test demonstrates that neither Scylla nor Cassandra prints an error when attempting to GRANT permissions on a specific materialized view - but this GRANT is simply ignored. This is not ideal, but it is the existing behavior in both and it's not important now to change it. Additionally, because pull request #25797 made CDC-log permissions behave the same as materialized views - i.e., you need to make the base table readable to allow reading from the CDC log, this patch also documents this fact and adds a test for it also. Fixes #25800 Closes scylladb/scylladb#25827	2025-09-18 07:41:35 +03:00
Botond Dénes	839056b648	docs/operating-scylla: scylla-sstable.rst: update write docs scylla-sstable write (and scrub) moved to UUID generations in `514f59d157`, but said patch forgot to update the docs. This is fixed here. Closes scylladb/scylladb#25965	2025-09-18 07:39:50 +03:00
Ernest Zaslavsky	c9c245c756	rest_client: set `version` on http::request to avoid invalid state Upcoming changes in Seastar cause `rest::simple_send` to move the `http::request` into `seastar::http::experimental::client::make_request` when called multiple times. This leaves the original request in an invalid state. Specifically, the `_version` field becomes empty, causing request validation to fail. This patch ensures `version` is explicitly set to prevent such failures. Fixes: https://github.com/scylladb/scylladb/issues/26018 Closes scylladb/scylladb#26066	2025-09-18 07:36:25 +03:00
Michał Jadwiszczak	1d8b41a51d	db/view/view_building_worker: fix indents	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	99db5a6c30	db/view/view_building_worker: change `sharded<abort_source>` to local `abort_source` Previously the sharded abort_sources was stopped at the end of batch::do_work(), which is working in parallel to view building worker main loop. This leads to races because the worker may call batch::abort(), which access the abort_sources. This patch solves this be changing `sharded<abort_source>` into `abort_source`. Since now `batch::do_work()` is executed on tasks' shard, all abort source checks are also done on tasks' shard. The only place where shard0 uses the abort source is `batch::abort()`, but this method now does `smp::submit_to(replica.shard, [request abort])`, so the abort source is used on tasks' shard exclusively. Fixes scylladb/scylladb#25805 Fixes scylladb/scylladb#26045	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	7b9db335c0	db/view/view_building_worker: execute entire `batch::do_work` on tasks shard This change will allow us to get rid of problematic `sharded<abort_source>` and use local `abort_source` instead.	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	2f65af8aa7	db/view/view_building_worker: store reference to sharded worker in batch Change reference to view building worker in batch to sharded container. In next commits, I'm going to execute `do_work()` exclusively on tasks target shard and sharded reference will be more useful.	2025-09-18 03:24:43 +02:00
Michał Jadwiszczak	e4a0de53ea	db/view/view_building_worker: fix indent	2025-09-18 02:57:36 +02:00
Michał Jadwiszczak	7dfb76f9a7	db/view/view_building_worker: wrap `shared_sstable` in `foreign_ptr` When a staging sstable is registered to view building worker, it needs to make a round trip from its original shard to shard 0 (in order to create a view building task) and back (to be eventually processed). Until now this was done using plain `sstables::shared_sstable` (= `lw_shared_ptr`) which is not safe to be moved between shards. This patch fixes this by wrapping the pointer in `foreign_ptr` and obtains necessary informations (owner shard, last token) on the original shard (instead of on shard0). Then all of those objects are put into freshly introduced structure `staging_sstable_task_info`, which can be safely moved between shards. Fixes scylladb/scylladb#25859	2025-09-18 02:57:36 +02:00
Michał Jadwiszczak	50678030c0	db/view/view_building_worker: use table id in `register_staging_sstable_tasks()` There is no need to pass the pointer only to get id of the table.	2025-09-18 02:57:35 +02:00
Michał Jadwiszczak	b44c223d47	db/view/view_building_worker: move helper functions higher So they can be used in `view_building_worker::register_staging_sstable_tasks()`.	2025-09-18 02:57:35 +02:00
Wojciech Mitros	f17beba834	load_balancer: include dead nodes when calculating rack load Load balancer aims to preserve a balance in rack loads when generating tablet migrations. However, this balance might get broken when dead nodes are present. Currently, these nodes aren't include in rack load calculations, even if they own tablet replicas. As a result, load balancer treats racks with dead nodes as racks with a lower load, so I generates migrations to these racks. This is incorrect, because a dead node might come back alive, which would result in having multiple tablet replicas on the same rack. It's also inefficient even if we know that the node won't come back - when it's being replaced or removed. In that case we know we are going to rebuild the lost tablet replicas so migrating tablets to this rack just doubles the work. Allowing such migrations to happen would also require adjustments in the materialized view pairing code because we'd temporarily allow having multiple tablet replicas on the same rack. So in this patch we include dead nodes when calculating rack loads in the load balancer. The dead nodes still aren't treated as potential migration sources or destinations. We also add a test which verifies that no migrations are performed by doing a node replace with a mv workload in parallel. Before the patch, we'd get pairing errors and after the patch, no pairing errors are detected. Fixes https://github.com/scylladb/scylladb/issues/24485 Closes scylladb/scylladb#26028	2025-09-17 20:49:18 +02:00
Avi Kivity	3acfc577d8	Merge 'tools/scylla-sstable: extract json mutation stream parser into own hh,cc' from Botond Dénes tools/scylla-sstable.cc has 3.5k SLOC, out of which this class alone is 1K. Extract into own hh and cc. Since this class was already using pimpl, the header remains nice and small. Code cleanup, no backport needed. Closes scylladb/scylladb#26064 * github.com:scylladb/scylladb: tools: extract json_mtuation_stream_parser to its own hh,cc files tools/scylla-sstable: fix indentation tools/scylla-sstable: prepare for extracting json_mutation_stream_parser	2025-09-17 18:30:30 +03:00
Ernest Zaslavsky	54aa552af7	treewide: Move type related files to a `type` directory As requested in #22110 , moved the files and fixed other includes and build system. Moved files: - duration.hh - duration.cc - concrete_types.hh Fixes: #22110 This is a cleanup, no need to backport Closes scylladb/scylladb#25088	2025-09-17 17:32:19 +03:00
Ernest Zaslavsky	a1f18a8883	treewide: Move schema related files to a `schema` directory As requested in #22111 , moved the files and fixed other includes and build system. Moved files: - frozen_schema.hh - frozen_schema.cc - schema_mutations.hh - schema_mutations.cc - column_computation.hh Fixes: #22111 Closes scylladb/scylladb#25089	2025-09-17 17:31:05 +03:00
Botond Dénes	bde7d8ddbd	Merge 'service: pass current session_id to repair rpc' from Aleksandra Martyniuk Currently, in repair_tablet we retrieve session_id from tablet map (and throw if it isn't specified). In case of topology coordinator failover, we may end up in a situation where a node runs outdated repair, treating session of a different operation as the repair's session: - topology coordinator starts repair transition (A); - topology coordinator sends tablet repair rpc to node1; - topology coordinator is separated from the cluster; - new topology coordinator is elected; - new topology coordinator sees waiting repair request (A_2) and executes it; - new repair of the same tablet is requested (B); - new topology coordinator starts repair transition (B); - new topology coordinator sends tablet repair rpc to node2; - node2 starts repair (B) as repair master; - node1 starts repair (A), checks the current session (B), proceeds with repair (B) as repair master. Send current session_id in repair_tablet rpc. If this session_id and session id got from tablet map don't match, an exception is thrown. Fixes: https://github.com/scylladb/scylladb/issues/23318. No backport; changes in rpc signatures Closes scylladb/scylladb#25369 * github.com:scylladb/scylladb: test: check that repair with outdated session_id fails service: pass current session_id to repair rpc	2025-09-17 17:28:35 +03:00
Botond Dénes	cc5153ef8c	Merge 'db: cache: consider preempting after each partition' from Aleksandra Martyniuk Currently, during cache invaldation we check if we need to preempt only after the partition gets invaldaited. This may lead to stalls if we have a chain of filtered out partitions. Check for preemption even if the partition does not get invaldated. Refs: https://github.com/scylladb/scylladb/issues/9136. Optimization; no backport Closes scylladb/scylladb#26053 * github.com:scylladb/scylladb: db: fix indentation db: cache: consider preempting after each partition	2025-09-17 17:26:29 +03:00
Botond Dénes	a8d22a66fa	Merge 'Improve Encryption at Rest documentation' from Nikos Dragazis This PR introduces a major rewrite of the EaR document. The initial motivation for this PR was to fully cover all our supported key providers with working examples, and to add instructions for key rotation. However, many other improvements were made along the way. Main changes in this PR: * Add a high-level description for every key provider. Mention limitations. * Better organize existing provider-specific instructions by placing them into clearly separated, tabbed sections. * Add instructions for the replicated key provider. Mention explicitly that it cannot be used as default option for user or system encryption, and that it does not support key rotation. * Add more examples for KMS and GCP to cover all credential types. * Document missing configuration options. * Add a new section for key rotation. Notes: * Some of the patches in this series have been cherry-picked from Laszlo's wip branch. * This PR is expected to conflict with the Azure Key Vault PR, which should be merged first. (https://github.com/scylladb/scylladb/pull/23920/) * Support for KMIP system keys in the Replicated Key Provider is currently broken. (https://github.com/scylladb/scylladb/issues/24443) Fixes scylladb/scylla-enterprise#3535. Refs scylladb/scylla-enterprise#3183. Only doc changes. No backport is needed. Closes scylladb/scylladb#24558 * github.com:scylladb/scylladb: encryption-at-rest.rst: add "Rotate Encryption Keys" section encryption-at-rest.rst: rewrite "Encrypt System Resources" section encryption-at-rest.rst: rewrite "Update Encryption Properties of Existing Tables" section encryption-at-rest.rst: rewrite "Encrypt a Single Table" section encryption-at-rest.rst: rewrite "Encrypt Tables" section encryption-at-rest.rst: update "Set the Azure Host" section encryption-at-rest.rst: update "Set the GCP Host" section encryption-at-rest.rst: update "Set the KMS Host" section encryption-at-rest.rst: update "Set the KMIP Host" section encryption-at-rest.rst: rewrite "Create Encryption Keys" section encryption-at-rest.rst: rewrite "Key Providers" section encryption-at-rest.rst: hoist and update "Cipher Algorithm Descriptors" encryption-at-rest.rst: rewrite/replace section "Encryption Key Types" encryption-at-rest.rst: About: describe high-level operation more precisely encryption-at-rest.rst: improve wording / formatting in About intro encryption-at-rest.rst: users (plural) typo fix encryption-at-rest.rst: rewrap encryption-at-rest.rst: strip trailing whitespace	2025-09-17 17:25:25 +03:00
Nadav Har'El	d63fdd1e8b	test/cqlpy: fix run-cassandra to run with Java 21 The script test/cqpy/run-cassandra aims to make it easy to run any version of Cassandra using whatever version of Java the user has installed. Sadly, the fact that Java keeps changing and the Cassandra developers are very slow to adapt to new Javas makes doing this non-trivial. This patch makes it possible for run-cassandra to run Cassandra 5 on the Java 21 that is now the default on Fedora 42. Fedora 42 no longer carries antique version of Java (like Java 8 or 11), not even as an optional package. Sadly, even with this patch it is not possible to run older versions of Cassandra (4 and 3) with Java 21, because the new Java is missing features such as Netty that the older Cassandra require. But at least it restores the ability to run our cqlpy tests against Cassandra 5. Also, this patch adds to test/cqlpy/README.md simple instructions on how to install Java 11 (in addition to the system's default Java 21) on Fedora 42. Doing this is very easy and very recommended because it restores the ability to run Cassandra 3 and 4, not just Cassandra 5. Fixes #25822. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25825	2025-09-17 17:24:47 +03:00
Botond Dénes	85f6eeda30	Merge 'compaction/scrub: register sstables for compaction before validation' from Lakshmi Narayanan Sreethar compaction/scrub: register sstables for compaction before validation When `scrub --validate` runs, it collects all candidate sstables at the start and validates them one by one in separate compaction tasks. However, scrub in validate mode does not register these sstables for compaction, which allows regular compaction to pick them up and potentially compact them away before validation begins. This leads to scrub failures because the sstables can no longer be found. This patch fixes the issue by first disabling compaction, collecting the sstables, and then registering them for compaction before starting validation. This ensures that the enqueued sstables remain available for the entire duration of the scrub validation task. Fixes #23363 This reported scrub failure occurs on all versions that have the checksum/digest validation feature for uncompressed sstables. So, backport it to older versions. Closes scylladb/scylladb#26034 * github.com:scylladb/scylladb: compaction/scrub: register sstables for compaction before validation compaction/scrub: handle exceptions when moving invalid sstables to quarantine	2025-09-17 17:22:00 +03:00
Piotr Smaron	bdb90ee15c	set ssl_* columns in system.clients Depends on https://github.com/scylladb/seastar/pull/2651 Missing columns have been present since probably forever - they were added to the schema but never assigned any value: ``` cqlsh> select * from system.clients; ------------------+------------------------ ... ssl_cipher_suite \| null ssl_enabled \| null ssl_protocol \| null ... ``` This patch sets values of these columns: - with a TLS connection, the 3 TLS-related fields are filled in, - without TLS, `ssl_enabled` is set to `false` and other columns are `null`, - if there's an error while inspecting TLS values, the connection is dropped. We want to save the TLS info of a connection just after accepting it, but without waiting for a TLS handshake to complete, so once the connection is accepted, we're inspecting it in the background for the server to be able to accept next connections immediately. Later, when we construct system.clients virtual table, the previously saved data can be instantaneously assigned to client_data, which is a struct representing a row in system.clients table. This way we don't slow down constructing this table by more than necessary, which is relevant for cases with plenty of connections. Fixes: #9216 Closes scylladb/scylladb#22961	2025-09-17 16:29:55 +03:00
Nadav Har'El	3c0032deb4	alternator: fix bug in combination of AttributeUpdates + ReturnValues In test/alternator/test_returnvalues.py we had tests for the ReturnValues feature on UpdateItem requests - but we only tested UpdateItem requests with the "modern" UpdateExpression, and forgot to test the combination of ReturnValues with the old AttributeUpdates API. It turns out this combination is buggy: when both ReturnValues=ALL_OLD and AttributeUpdates need the previous value of the item, we may wrongly std::move() the value out, and the operation will fail with a strange error: An error occurred (ValidationException) when calling the UpdateItem operation: JSON assert failed on condition 'IsObject()' The fix in this patch is trivial - just move the std::move() to the correct place, after both UpdateExpression and AttributeUpdates handling is done. This patch also includes a reproducing test, which fails before this patch and passes with it - and of course passes on DynamoDB. This test reproduces two cases where the bug happened, as well as one case where it didn't (to make sure we don't regress in what already worked). Fixes #25894 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25900	2025-09-17 16:04:01 +03:00
Michael Litvak	ef0b34ec9d	docs/dev: update CDC dev docs for tablets	2025-09-17 14:47:13 +02:00
Michael Litvak	acd0eebd54	doc: update CDC docs for tablets Now that CDC is enabled for tablets-based keyspaces, update the docs and added explanations about the differences.	2025-09-17 14:47:13 +02:00
Michael Litvak	221a687ec9	test: cluster_events: enable add_cdc and drop_cdc add_cdc and drop_cdc were skipped because CDC wasn't supported with tablets. now that CDC is supported with tablets we should unskip it.	2025-09-17 14:47:13 +02:00
Michael Litvak	73d05c7214	test/cql: enable cql cdc tests to run with tablets	2025-09-17 14:47:13 +02:00
Michael Litvak	001c55c213	test: test_cdc_with_alter: adjust for cdc with tablets previously the test set tablets to disabled because cdc wasn't supported with tablets. now we can change this to use the default to enable it to run with either tablets or vnodes.	2025-09-17 14:47:13 +02:00
Michael Litvak	778dec2630	test/cqlpy: adjust cdc tests for tablets update cdc-related tests in test/cqlpy for cdc with tablets. * test_cdc_log_entries_use_cdc_streams: this test depends on the implementation of the cdc tables, which is different for tablets, so it's changed to run for both vnodes and tablets keyspaces, and we add the implementation for tablets. * some cdc-related are unskipped for tablets so they will be run with both tablets and vnodes keyspaces. these are tests where the implementation may be different between tablets and vnodes and we want to have converage of both. * other cdc-related tests do not depend on the implementation differences between tablets and vnodes, so we can just enable them to run with the default configuration. previously they were disabled for tablets keyspaces because it wasn't supported, so now we remove this.	2025-09-17 14:47:13 +02:00
Michael Litvak	5a87d0f6c9	test/cluster/test_cdc_with_tablets: introduce cdc with tablets tests Introduce basic tests creating CDC tables in tablets-enabled keyspaces, verifying we can create and drop CDC tables, write and consume CDC log entries, and consume the log while splitting streams.	2025-09-17 14:47:13 +02:00
Michael Litvak	1fc3273b27	cdc: enable cdc with tablets Allow to create CDC tables in a tablets-enabled keyspace when all nodes in the cluster support the cdc_with_tablets feature. Fixes scylladb/scylladb#22576	2025-09-17 14:47:12 +02:00
Michael Litvak	98de3f0e86	topology coordinator: change streams on tablet split/merge on tablet split/merge finalization, generate a new CDC timestamp and stream set for the table with a new stream for each tablet in the new tablet map, in order to maintain synchronization of the CDC streams with the tablets. We pick a new timestamp for the streams with a small delay into the future so that all nodes can learn about the new streams in time, in the same way it's done for vnodes. the new timestamp and streams are published by adding a mutation to the cdc_streams_history table that contains the timestamp and the sets of closed and opened streams from the current timestamp.	2025-09-17 14:47:12 +02:00
Michael Litvak	b8442c6087	cdc: virtual tables for cdc with tablets Define two new virtual tables in system keyspace: cdc_timestamps and cdc_streams. They expose the internal cdc metadata for tablets-enabled keyspace to be consumed by users consuming the CDC log. cdc_timestamps lists all timestamps for a table where a stream change occured. cdc_streams list additionally the current streams sets for each table and timestamp, as well as difference - closed and opened streams - from the previous stream set.	2025-09-17 14:47:12 +02:00
Michael Litvak	67410cac4d	cdc: generate_stream_diff helper function This helper functions receives two sets of streams and constructs their difference - closed and opened streams.	2025-09-17 14:47:12 +02:00
Michael Litvak	5f6bb0af9d	cdc: choose stream in tablets enabled keyspaces When choosing a CDC stream to generate CDC log writes to, if the keyspace uses tablets, we need to choose a stream according to the relevant metadata which is specific to tablets-enabled keyspaces. We define the method get_tablet_stream that given a table, write timestamp, and token, returns the stream that the log entry should be written to. The method works by looking up the stream metadata of the table, then finding the relevant stream set by timestamp, and finally finding the stream that covers the token range that contains the token.	2025-09-17 14:47:12 +02:00
Michael Litvak	28cdd81ef0	cdc: rename get_stream to get_vnode_stream the get_stream method is relevant only for vnode-based keyspaces. next we will introduce a new method to get a stream in a tablets-based keyspace. prepare for this by renaming get_stream to get_vnode_stream.	2025-09-17 14:47:12 +02:00
Michael Litvak	9ec4b6ccb1	cdc: load tablet streams metadata from tables Read the CDC stream metadata from the internal system tables, and store it in the cdc metadata data structures. The metadata is stored in the tables as diffs which is more storage efficient, but when in-memory we store it as full stream sets for each timestamp. This is more useful because we need to be able to find a stream given timestamp and token.	2025-09-17 14:47:12 +02:00
Michael Litvak	aa61f074b5	cdc: helper functions for reading metadata from tables Define functions in system_keyspace that read from the internal cdc tables and construct the data into internal cdc data structures.	2025-09-17 14:47:12 +02:00
Michael Litvak	650ae30c97	cdc: colocate cdc table with base When creating a tablet map for a CDC table, make it be co-located with its base table. We modify db::get_base_table_for_tablet_colocation to return the base table id of a CDC table, handling both cases that the base table is a new table that's created in the same operation, or is an existing table in the db. This function is used by the tablet allocator to decide whether to create a co-located tablet map or allocate new tablets.	2025-09-17 14:47:12 +02:00
Michael Litvak	9ef0862155	cdc: remove streams when dropping CDC table When dropping a CDC log table in a tablets-enabled keyspace, remove all metadata about the table's CDC streams from the internal CDC tables, since the streams can't be read anymore. Similarly, when dropping a tablets-enabled keyspace, remove metadata of all streams belonging to tables in the keyspace.	2025-09-17 14:47:12 +02:00
Michael Litvak	ed25e420f8	cdc: create streams when allocating tablets When allocating tablets for a CDC table, create the initial CDC stream set. We create one stream per each tablet, each stream covering the corresponding token range.	2025-09-17 14:47:12 +02:00
Michael Litvak	7f2cd06bdc	migration_listener: add on_before_allocate_tablet_map notification Add a new notification on_before_allocate_tablet_map that is called when creating a tablet map for a new table and passes the tablet map. This will be useful next for CDC for example. when creating tablets for a new table we want to create CDC streams for each tablet in the same operation, and we need to have the tablet map with the tablet count and tokens for each tablet, because the CDC streams are based on that. We need to change slightly the tablet allocation code for this to work with colocated tables, because previously when we created the tablet map of a colocated table we didn't have a reference to the base tablet map, but now we do need it so we can pass it to the notification.	2025-09-17 14:47:11 +02:00
Michael Litvak	fdfe9ebb4c	cdc: notify when creating or dropping cdc table When creating a CDC table by updating an existing base table and enabling CDC, notify about the table creation so subscribers can act on it. This is needed in particular for notifying the tablet allocator when creating a CDC table so that it will allocate tablets for the CDC table. Also, when dropping a CDC table, notifying about the dropped table. This is needed for the tablet allocator to remove the tablet map of the CDC table.	2025-09-17 14:47:11 +02:00
Michael Litvak	fed1048059	cdc: move cdc table creation to pre_create When creating a new table with CDC enabled, we create also a CDC log table by adding the CDC table's mutations in the same operation. Previously, it works by the CDC log service subscribing to on_before_create_column_family and adding the CDC table's mutations there when being notified about a new created table. The problem is that when we create the tables we also create their tablet maps in the tablet allocator, and we want to created the two tables as co-located tables: we allocate a tablet map for the base table, and the CDC table is co-located with the base table. This doesn't work well with the previous approach because the notification that creates the CDC table is the same notification that the tablet allocator creates the base tablet map, so the two operations are independent, but really we want the tablet allocator to work on both tables together, so that we have the base table's schema and tablet map when we create the CDC table's co-located tablet map. In order to achieve this, we want to create and add the CDC table's schema, and only after that notify using before_create_column_families with a vector that contains both the base table and CDC table. The tablet allocator will then have all the information it needs to create the co-located tablet map. We move the creation of the CDC log table - instead of adding the table's mutations in on_before_create_column_family, we create the table schema and add it to the new tables vector in on_pre_create_column_families, which is called by the migration manager in do_prepare_new_column_families_announcement. The migration manager will then create and add all mutations for creating the tables, and notify about the tables being created together.	2025-09-17 14:47:11 +02:00
Michael Litvak	b9ee28eaab	cdc: add internal tables for cdc with tablets Add new group0-based tables in system keyspace to be used for cdc with tablets: * cdc_streams_state - describing "base" state of CDC streams for each table - an initial timestamp and a stream set. * cdc_streams_history - describing following committed stream sets by diffs (opened / closed streams) from the previous set.	2025-09-17 14:47:11 +02:00
Michael Litvak	5f1caebcc7	cdc: add cdc_with_tablets feature flag add a new feature flag cdc_with_tablets to protect the schema changes that are required for the CDC with tablets feature. we will also use it to allow start using CDC in tablets-based keyspaces only once all nodes are upgraded and support this feature.	2025-09-17 14:47:11 +02:00
Michael Litvak	daf200facb	cdc: add is_log_schema helper In few places we need to check whether a schema represents a CDC log table, and we do so by checking whether the table's partitioner is the CDC partitioner. Extract this logic to a new utility function to reduce code duplication and allow reuse.	2025-09-17 14:47:11 +02:00
Piotr Dulikowski	6a90a1fd29	Merge 'db/view/view_building_worker: split batch's data preparation and execution' from Michał Jadwiszczak The view building batch lives on shard0 but it might be doing work on shard which owns the tablet replica. Until now the batch data was accessed from multiple shards (shard0 and where the batch was executed). This patch fixes this by splitting tasks execution into: - preparation which is always happening on shard0 - actual execution of the tasks on relevant shard, but all necessary data is copied to the shard and batch object isn't accessed. Fixes https://github.com/scylladb/scylladb/issues/25804 View building coordinator hasn't been released yet, so no backport needed. Closes scylladb/scylladb#26058 * github.com:scylladb/scylladb: db/view/view_building_worker: move try-catch outside `invoke_on()` db/view/view_building_worker: split batch's data preparation and execution	2025-09-17 14:17:25 +02:00
Botond Dénes	30a3f61fa0	Merge 'compaction: handle exception in expected_total_workload' from Aleksandra Martyniuk expected_total_workload methods of scrub compaction tasks create a vector of table_info based on table names. If any table was already dropped, then the exception is thrown. It leaves table_info in corrupted state and node crashes with `free(): invalid size`. Return std::nullopt if an exception was thrown to indicate that total workload cannot be found. Fixes: #25941. No release branches affected Closes scylladb/scylladb#25944 * github.com:scylladb/scylladb: tasks: get progress of failed task based on children compaction: handle exception in expected_total_workload	2025-09-17 15:10:19 +03:00
Nadav Har'El	e322902506	Merge 'index, metrics: add per-index metrics' from Michał Hudobski This patch adds the possibility to track metrics per secondary index. Currently, only a histogram of query latencies is tracked, but more metrics can be added in the future. To add a new metric, it needs to be added to the index_metrics struct in index/secondary_index_manager.hh and then initialized in index/secondary_index_manager.cc in the constructor of the index_metrics struct. The metrics are created when the index is created and removed when the index is dropped. First lines of the new metric: \# HELP scylla_index_query_latencies Index query latencies \# TYPE scylla_index_query_latencies histogram scylla_index_query_latencies_sum{idx="test_i_idx",ks="test"} 640 scylla_index_query_latencies_count{idx="test_i_idx",ks="test"} 1 scylla_index_query_latencies_bucket{idx="test_i_idx",ks="test",le="640.000000"} 1 scylla_index_query_latencies_bucket{idx="test_i_idx",ks="test",le="768.000000"} 1 Fixes: https://github.com/scylladb/scylladb/issues/25970 Closes scylladb/scylladb#25995 * github.com:scylladb/scylladb: test: verify that the index metric is added index, metrics: add per-index metrics	2025-09-17 14:54:12 +03:00
Michał Chojnowski	b7afda5030	sstables/mx/reader: remove mx::make_reader_with_index_reader When `mx::make_reader` is used to construct an sstable reader, it constructs its own index reader internally. `mx::make_reader_with_index_reader` was originally added as a variant of `mx::make_reader` which can be used to inject a custom `index_reader` for testing that the mx Data reader tolerates inexact indexes. But now we want the ability to choose between BIG index readers and BTI index readers if both are present. And at this point, it seems to me that it makes sense to just construct the index reader in the caller and pass it via argument to `mx::make_reader` instead of putting the index selection inside it. So that's what we do in this patch. And we remove `mx::make_reader_with_index_reader` because it's no longer different from `mx::make_reader`.	2025-09-17 12:22:41 +02:00
Michał Chojnowski	f7d7722baa	test/boost/bti_index_test: fix indentation Fix an indentation mishap. Purely cosmetic patch.	2025-09-17 12:22:41 +02:00
Michał Chojnowski	191405fc51	sstables/trie/bti_index_reader: in last_block_offset(), return offset from the beginning of partition, not file Before this patch, `bti_index_reader::last_block_offset()` returns the offset of the last block within the file. But the old `index_reader::last_block_offset()` returns the offset within the partition, and that's what the callers (i.e. reversed sstable reader) expect. Fix `bti_index_reader::last_block_offset()` (and the corresponding comment and test) to match `index_reader::last_block_offset()`.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	1f85069389	sstables/trie: support reader_permit and trace_state properly Before this patch, `reader_permit` taken by `bti_index_reader`. wasn't actually being passed down to disk reads. In this patch, we fix this FIXME by propagating the permit down to the I/O operations on the `cached_file`. Also, it didn't take `trace_state_ptr` at all. In this patch, we add a `trace_state_ptr` argument and propagate it down to disk reads. (We combine the two changes because the permit and the trace state are passed together everywhere anyway).	2025-09-17 12:22:40 +02:00
Michał Chojnowski	c9b0dbc580	sstables/trie/bti_node_reader: avoid calling into `cached_file` if the target position is already cached Small optimization. In some places we call `load` on a position that is in the currently-held page. In those cases we are needlessly calling `cached_file::get_shared_page` for the same page again, adding some work and some noise in CQL tracing. This patch adds an `if` against that.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	caecc02d75	sstables/trie/bti_index_reader: get rid of the seastar::file wrapper in read_row_index_header Before this patch, the "stack" of wrappers in `read_row_index_header` is: - row_index_header_parser - continuous_data_consumer - input_stream - file_data_source_impl - cached_file_impl - cached_file::stream - cached_file The `cached_file_impl` and `file_data_source_impl` are superfluous. We don't need to pretent the `cached_file` is a `seastar::file`, in particular we don't need random reads. We can use `cached_file::stream` to provide buffers for `input_stream` directly. Note: we use the `cached_file::stream` without any size hints (readahead). This means that parsing a large partition key -- which spans many pages -- might require multiple serialized disk fetches. This could be improved (e.g. if the first two bytes of the entry, which contain the partition key, are in the cached page, we could make a size hint out of them) but we ignore this for now, under the assumption that multi-page partition keys are a fringe use case.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	98b7655d2b	sstables/trie/bti_index_reader: support BYPASS CACHE Before this patch, `bti_index_reader` doesn't have a good way to implement BYPASS CACHE. In this patch we add a way, similar to what `index_reader` does: we allow the caller to pass in the `cached_file` via a shared pointer. If the caller wants the loads done by the index reader to remain cached, he can pass in the `cached_file` owned by the `sstable`, shared by all caching index readers. If the caller doesn't want the loads to remain cached, he can pass in a fresh `cached_file` which will be privately owned by the index reader, and will be evicted when the index reader dies.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	95c93568f7	test/boost/bti_index_test: use read_bti_partitions_db_footer where appropriate Use the helper instead of of reading the needed footer field "manually".	2025-09-17 12:22:40 +02:00
Michał Chojnowski	5934639c4b	sstables/trie: change the signature of bti_partition_index_writer::finish Let's return `bti_partitions_db_footer` so that it can be directly saved to `sstables::shareable_components` after the index write is finished, without re-reading the footer from the file. Let's take `const sstables::key&` arguments instead of `disk_string_view<uint16_t>`, that's more natural.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	0b0f6a1cc4	sstables/bti_index: improve signatures of special member functions in index writers Just a bit of constructor boilerplate: noexcept, move constructors, `bool` operators for `optimized_optional`.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	421fb8e722	streaming/stream_transfer_task: coroutinize `estimate_partitions()` In preparation for making `sstable::estimated_keys_for_range` asynchronous.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	bf90018b8e	types/comparable_bytes: add a missing implementation for date_type_impl date_type_impl is like timestamp_type_impl, but unsigned.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	46d8fd5bbd	sstables: remove an outdated FIXME Bloom filters were implemented 10 years ago. We can remove this FIXME now.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	6cb6c1e400	storage_service: delete `get_splits()` Dead code. Thrift API leftovers.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	e8f86315c7	sstables/trie: fix some comment typos in bti_index_reader.cc Spell checkers are complaining.	2025-09-17 12:22:40 +02:00
Michał Chojnowski	416f5d64d4	sstables/mx/writer: rename _pi_write_m.tomb to partition_tombstone With `tomb` it's not obvious enough what tombstone this field is about.	2025-09-17 12:22:40 +02:00
Szymon Malewski	776f90e2f8	alternator/expressions.g: Fix antlr3 missing token leak This patch overrides the antlr3 function that allocates the missing tokens that would eventually leak. The override stores these tokens in a vector, ensuring memory is freed whenever the parser is destroyed. Solution is copied from CQL implementation. A unit test to reproduce the issue is added - leak would be reported by ASAN, when running this test in debug mode - the test passed but the leak is discovered when the test file exits. Fixes #25878 Closes scylladb/scylladb#25930	2025-09-17 13:05:24 +03:00
Abhinav Jha	43656371cf	raft_topology: Modify the conditional logic in remove node operation to enhance concurrency for raft enabled clusters. In the current scenario, the shard receiving the remove node REST api request performs condional lock depending on whether raft is enabled or not. Since non-zero shard returns false for `raft_topology_change_enabled()`, the requests routed to non zero shards are prone to this lock which is unnecessary and hampers the ability to perform concurrent operations, which is possible for raft enabled nodes. This pr modifies the conditional lock logic and orchestrates the remove node execution logic directly to the shard0, hence the `raft_topology_change_enabled()` is now checked on the shard0 and execution is performed accordingly. A test is also added to confirm the new behaviour, where concurrent remove node operations are now being performed seamlessly. This pr doesn't fix a critical bug. No need to backport it. Fixes: scylladb/scylladb#24737	2025-09-17 15:23:32 +05:30
Botond Dénes	2fa0f82910	tools: extract json_mtuation_stream_parser to its own hh,cc files tools/scylla-sstable.cc has 3.5k SLOC, out of which this class alone is 1K. Extract into own hh and cc, just a copy-paste after the preparation commit.	2025-09-17 12:18:07 +03:00
Botond Dénes	ffe8918522	tools/scylla-sstable: fix indentation Left broken by previous patch.	2025-09-17 12:16:22 +03:00
Botond Dénes	8c36a983cc	tools/scylla-sstable: prepare for extracting json_mutation_stream_parser Make methods out-of-line, so class declaration stands on its own, without definition of impl. Move auxiliary structures, used only by impl, out of the class scope. Move parser to tools namespace, and auxiliaries to anonymous namespace within the tools one. Pass down logger ref to parser impl and below, to prepare for sst_log not being available in scope. Add comment to parser class explaining what it does.	2025-09-17 12:16:21 +03:00
Benny Halevy	3a6208b319	utils: stall_free: clear_gently: release wrapped objects As discussed in https://github.com/scylladb/scylladb/pull/24606#discussion_r2281870939 clear_gently of shared pointers should release the wrapped object reference and when the object's use_count reaches 1, the object itself would be cleared_gently, before it's destroyed. This behavior is similar to the way we clear gently containers like arrays or vectors, and so it is extended in this patch to smart pointers like unique_ptr and foreign_ptr. The unit tests are adjusted respectively to expect the smart pointers to be reset after clear_gently, plus the use of `reset()` for `foreign_ptr<shared_ptr<>>` was replaced by `clear_gently().get()` which now ensures the reference to a shared object is released, and awaited for, if it happens on a foreign owner shard, unlike reset of a foreign_ptr that kicks off destroy of that shared object in the background on the owner shard - causing flakiness. Fixes #25723 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25759	2025-09-17 11:44:26 +03:00
Patryk Jędrzejczak	454eb08cb4	Merge 'group0: remove obsolete "stop_before_becoming_raft_voter" error injection' from Emil Maskovsky The Raft topology workflow was changed by the limited voters feature: nodes no longer request votership themselves. As a result, the "stop_before_becoming_raft_voter" error injection is now obsolete and has been removed. Fixes: scylladb/scylladb#23418 No backport: This re-enables a test, only needed for master. Closes scylladb/scylladb#26042 * https://github.com/scylladb/scylladb: group0: remove obsolete "stop_before_becoming_raft_voter" error injection test/random_failures: preserve test repeatability when removing error injections	2025-09-17 10:38:32 +02:00
Michał Jadwiszczak	d98237b33c	db/view/view_building_worker: move try-catch outside `invoke_on()` It's just stylist change, to me doing `invoke_on()` in try-catch block looks better than the other way.	2025-09-16 23:15:44 +02:00
Michał Jadwiszczak	9458ceff8f	db/view/view_building_worker: split batch's data preparation and execution The view building batch lives on shard0 but it might be doing work on shard which owns the tablet replica. Until now the batch data was accessed from multiple shards (shard0 and where the batch was executed). This patch fixes this by splitting tasks execution into: - preparation which is always happening on shard0 - actual execution of the tasks on relevant shard, but all necessary data is copied to the shard and batch object isn't accessed. Fixes scylladb/scylladb#25804	2025-09-16 23:13:36 +02:00
Patryk Jędrzejczak	368d70ee15	Merge 'LWT: implement fencing' from Petr Gusev This PR consists of three parts: * Small refactoring of the fencing APIs in storage_proxy (renames + comments + some functions were extracted) * Implement the fencing for LWT verbs itself. This includes checking the fencing token before and after local replica data accesses. * Two new `test.py` tests in `test_fencing.py`, which check the fencing in some real-world scenarios. Backport: no need -- fencing for LWT requests is needed primarily for LWT over tablets, which is not released yet. Fixes scylladb/scylladb#22332 Closes scylladb/scylladb#25550 * https://github.com/scylladb/scylladb: test_tablets_lwt: eliminate redundant disable_tablet_balancing test_fencing: add test_lwt_fencing_upgrade pylib: extract upgrade helpers from test_sstable_compression_dictionaries_upgrade.py test_fencing: add test_fenced_out_on_tablet_migration_while_handling_paxos_verb test_fencing: test_fence_lwt_during_bootstap pylib/rest_client.py: encode injection name storage_proxy_stats: add fenced_out_requests metric storage_proxy: add fencing to Paxos verbs storage_proxy::apply_fence: add overload that throws on failure storage_proxy: extract apply_fence_result sp::apply_fence: rename to apply_fence_on_ready sp::apply_fence: rename to check_fence sp::apply_fence: make non-generic	2025-09-16 23:40:48 +03:00
Ernest Zaslavsky	d624413ddd	treewide: Move query related files to a new `query` directory As requested in #22120, moved the files and fixed other includes and build system. Moved files: - query.cc - query-request.hh - query-result.hh - query-result-reader.hh - query-result-set.cc - query-result-set.hh - query-result-writer.hh - query_id.hh - query_result_merger.hh Fixes: #22120 This is a cleanup, no need to backport Closes scylladb/scylladb#25105	2025-09-16 23:40:47 +03:00
Pavel Emelyanov	6fb66b796a	s3: Add metrics to show S3 prefetch bytes The chunked download source sends large GET requests and then consumes data as it arrives. Sometimes it can stop reading from socket early and drop the in-flight data. The existing read-bytes metrics show only the number of consumed bytes, we we also want to know the number of requested bytes Refs #25770 (accounting of read-bytes) Fixes #25876 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25877	2025-09-16 23:40:47 +03:00
Sergey Zolotukhin	2640b288c2	raft: disable caching for raft log. This change disables caching for raft log table due to the following reasons: * Immediate reason is a deficiency in handling emerging range tombstones in the cache, which causes stalls. * Long-term reason is that sequential reads from the raft log do not benefit from the cache, making it better to bypass it to free up space and avoid stalls. Fixes scylladb/scylladb#26027 Closes scylladb/scylladb#26031	2025-09-16 23:40:47 +03:00
Pavel Emelyanov	d69a51f42a	compaction: Use function when filtering compaction tasks for stopping The compaction_manager::stop_compaction() method internally walks the list of tasks and compares each task's compacting_table (which is compaction group view pointer) with the given one. In case this stop_compaction() method is called via API for a specific table, the method walks the list of tasks for every compaction group from the table, thus resulting in nr_groups * nr_tasks complexity. Not terrible, but not nice either. The proposal is to pass filtering function into the inner do_stop_ongoing_compactions() method. Some users will pass a simple "return true" lambda, but those that need to stop compactions for a specitif table (e.g. -- the API handler) will effectively walk the list of tasks once comparing the given compaction group's schema with the target table one (spoiler: eventually this place will also be simplified not to mess with replica::table at all). One ugliness with the change is the way "scope" for logging message is collected. If all tasks belong to the same table, then "for table ..." is printed in logs. With the change the scope is no longer known instantly and is evaluated dynamically while walking the list of tasks. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25846	2025-09-16 23:40:47 +03:00
Michał Chojnowski	68e6141211	scylla-gdb: add `scylla prepared-statements` Add a helper which prints all prepared statements currently present in the query processor. Example output: ``` (gdb) scylla prepared-statements (cql3::cql_statement)(0x600003d71050): SELECT FROM ks.ks WHERE pk = ? (cql3::cql_statement*)(0x600003972b50): SELECT pk FROM ks.ks WHERE pk = ? ``` Closes scylladb/scylladb#26007	2025-09-16 23:40:47 +03:00
Botond Dénes	0cf6a648bb	Merge 'Default create keyspace syntax' from Dario Mirovic Allow for the following CQL syntax: ``` CREATE KEYSPACE [IF NOT EXISTS] <name>; ``` for example: ``` CREATE KEYSPACE test_keyspace; ``` With this syntax all the keyspace's parameters would be defaulted to: replication strategy = `NetworkTopologyStrategy`, replication factor = number of racks , but excluding racks that only have arbiter nodes storage options, durable writes = defaults we normally would use, tablets enabled if they are enabled in the db configuration, e.g. scylla.yaml or db/config.cc by default. Options besides `replication` already have defaults. `replication` had to be specified, but it could be an empty set, where defaults for sub-options (replication strategy and replication factor) would be used - `replication = {}`. Now there is no need for specifying an empty set - omitting `replication = {}` has the same effect as `replication = {}`. Since all the options now have defaults, `WITH` is optional for `CREATE KEYSPACE` statement. Fixes #25145 This is an improvement, no backport needed. Closes scylladb/scylladb#25872 * github.com:scylladb/scylladb: docs: cql: default create keyspace syntax test: cqlpy: add test for create keyspace with no options specified cql: default `CREATE KEYSPACE` syntax	2025-09-16 23:40:47 +03:00
Emil Maskovsky	943af1ef1c	topology_coordinator: consistently rethrow `raft::request_aborted` for direct/global commands Ensure all direct and global topology commands rethrow the `raft::request_aborted` exception when aborted, typically due to leadership changes. This makes abortion explicit to callers, enabling proper handling such as retries or workflow termination. This change completes the work started in PR scylladb/scylladb#23962, covering all remaining cases where the exception was not rethrown. Fixes: scylladb/scylladb#23589 No backport: No related issues observed in previous versions; backport not required. Closes scylladb/scylladb#26021	2025-09-16 23:40:47 +03:00
Emil Maskovsky	87bd328873	group0: remove obsolete "stop_before_becoming_raft_voter" error injection The Raft topology workflow was changed by the limited voters feature: nodes no longer request votership themselves. As a result, the "stop_before_becoming_raft_voter" error injection is now obsolete and has been removed. Fixes: scylladb/scylladb#23418	2025-09-16 18:24:27 +02:00
Emil Maskovsky	0453052d66	test/random_failures: preserve test repeatability when removing error injections The order of entries in the ERROR_INJECTIONS list determines test repeatability for a given random seed. To allow removing error injections without affecting the order of the remaining ones, removed injections are now renamed with a "REMOVED_" prefix instead of being deleted. This ensures they are ignored by the tests, while the sequence of active injections—and thus test reproducibility—remains unchanged.	2025-09-16 18:22:45 +02:00
Michał Hudobski	3364cc96f5	test: verify that the index metric is added This commit adds a test that performs a sanity check that the implemented metric is actually being added to Scylla's metrics and has the correct value.	2025-09-16 18:10:01 +02:00
Aleksandra Martyniuk	3324f08e9c	tasks: get progress of failed task based on children Currently, for failed tasks task_manager::task::impl::get_progress attempts to find expected_total_workload. However, if the task has finished long time ago, the state might have totally changed, e.g. some tables might have been dropped or have changed their sizes. Due to that, the result of expected_total_workload might be irrelevant. Count the progress of a finish task based on children only, regardless whether the task has succeeded or failed.	2025-09-16 17:15:01 +02:00
Aleksandra Martyniuk	17e9ec11d7	db: fix indentation	2025-09-16 14:49:54 +02:00
Aleksandra Martyniuk	0024339a71	db: cache: consider preempting after each partition Currently, during cache invaldation we check if we need to preempt only after the partition gets invaldaited. This may lead to stalls if we have a chain of filtered out partitions. Check for preemption even if the partition does not get invaldated. Refs: #9136.	2025-09-16 14:45:28 +02:00
Cezar Moise	e9be1e7b35	test: cleanup big mutation commitlog tests - fix typos - improve comments - remove false and misleading comments - remove `disableautocompaction` as it did nothing for the test and the comment with it was false	2025-09-16 15:33:23 +03:00
Cezar Moise	492b4cf71c	test: fix test_one_big_mutation_corrupted_on_startup The commitlog in the tests with big mutations were corrupted by overwriting 10 chunks of 1KB with random data, which could not be enough due to randomness and the big size of the commitlog (~65MB). Change `corrupt_file` to overwrite based on a percentage of the file's size instead of fixed number of chunks. refs: #25627	2025-09-16 15:32:44 +03:00
Nikos Dragazis	58e8142a06	utils: azure: Catch system errors when detecting IMDS When the default credential provider probes IMDS to check its availability, it assumes that application-level connection timeouts are the only error that can occur when the node is not an Azure VM, i.e., the packets will be silently dropped somewhere in the network. However, this has proven not always true for the `test_azure_provider_with_incomplete_creds` unit test, which overrides the default IMDS endpoint with a non-routeable IP from TEST-NET-1 [1]. This test has been reported to fail in some local setups where routers respond with ICMP "host unreachable" errors instead of silently dropping the packets. This error propagates to user space as an EHOSTUNREACH system error, which is not caught by the default credential provider, causing the test to fail. The reason we use a non-routeable address in this test is to ensure that IMDS probing will always fail, even if running the test on an Azure VM. Theoretically, the same problem applies to the default IMDS endpoint as well (169.254.169.254). The RFC 3927 [2] mandates that packets targeting link-local addresses (169.254/16) must not be forwarded, but the exact behavior is left to implementation. Since we cannot predict how routers will behave, fix this by catching all relevant system errors when probing IMDS. [1] https://datatracker.ietf.org/doc/html/rfc5735 [2] https://datatracker.ietf.org/doc/html/rfc3927 Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-16 15:27:59 +03:00
Nikos Dragazis	78bcecd570	utils: azure: Bump default credential logs from DEBUG to INFO The default credential provider produces diagnostic logs on each step as it walks through the credential chain. These logs are useful for operators to diagnose authentication problems as they expose information about which credential sources are being evaluated, in which order, why they fail, and which source is eventually selected. Promote them from DEBUG to INFO level. Additionally, concatenate the logs for environment credentials into a single log statement to avoid interleaving with other logs. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-09-16 15:20:52 +03:00
Michał Hudobski	b09d1f0a98	index, metrics: add per-index metrics This patch adds the possibility to track metrics per secondary index. Currently, only a histogram of query latencies is tracked, but more metrics can be added in the future. To add a new metric, it needs to be added to the index_metrics struct in index/secondary_index_manager.hh and then initialized in index/secondary_index_manager.cc in the constructor of the index_metrics struct. The metrics are created when the index is created and removed when the index is dropped. First lines of the new metric: \# HELP scylla_index_query_latencies Index query latencies \# TYPE scylla_index_query_latencies histogram scylla_index_query_latencies_sum{idx="test_i_idx",ks="test"} 640 scylla_index_query_latencies_count{idx="test_i_idx",ks="test"} 1 scylla_index_query_latencies_bucket{idx="test_i_idx",ks="test",le="640.000000"} 1 scylla_index_query_latencies_bucket{idx="test_i_idx",ks="test",le="768.000000"} 1	2025-09-16 14:03:43 +02:00
Lakshmi Narayanan Sreethar	7cdda510ee	compaction/scrub: register sstables for compaction before validation When `scrub --validate` runs, it collects all candidate sstables at the start and validates them one by one in separate compaction tasks. However, scrub in validate mode does not register these sstables for compaction, which allows regular compaction to pick them up and potentially compact them away before validation begins. This leads to scrub failures because the sstables can no longer be found. This patch fixes the issue by first disabling compaction, collecting the sstables, and then registering them for compaction before starting validation. This ensures that the enqueued sstables remain available for the entire duration of the scrub validation task. Fixes #23363 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-16 15:29:57 +05:30
Abhinav Jha	00327bdfd0	storage_service: remove assumptions and checks for ignore_nodes to be normal. The ignore nodes param should not be required to contain only normal nodes. This commit removes such assumptions and checks. Although checks to ensure that ignore nodes are present in topology is still there and error is thrown if such irrelevant unrelated node is added in ignore_nodes.	2025-09-16 15:00:11 +05:30
Patryk Jędrzejczak	9efe250a8f	Merge 'gossiper: ensure gossiper operations are executed in gossiper scheduling group' from Sergey Zolotukhin Sometimes gossiper operations invoked from storage_service and other components run under a non-gossiper scheduling group. If these operations acquire gossiper locks, priority inversion can occur: higher-priority gossiper tasks may wait behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness or even failures. This patch ensures that gossiper operations requiring locks on gossiper structures are explicitly executed in the gossiper scheduling group. To help detect similar issues in the future, a warning is logged whenever a gossiper lock is acquired under a non-gossiper scheduling group. Fixes scylladb/scylladb#25907 Refs: scylladb/scylladb#25702 Backport: this patch fixes an issue with gossiper operations scheduling group, that might affect topology operations, therefore backport is needed to 2025.1, 2025.2, 2025.3 Closes scylladb/scylladb#25981 * https://github.com/scylladb/scylladb: gossiper: ensure gossiper operations are executed in gossiper scheduling group gossiper: fix wrong gossiper instance used in `force_remove_endpoint`	2025-09-16 10:14:15 +02:00
Asias He	54162a026f	scylla-nodetool: Add --incremental-mode option to cluster repair The `--incremental-mode` option specifies the incremental repair mode. Can be 'disabled', 'regular', or 'full'. 'regular': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to regular. Fixes #25931 Closes scylladb/scylladb#25969	2025-09-16 10:23:22 +03:00
Botond Dénes	ee7c85919e	Revert "treewide: seastar module update and fix broken rest client" This reverts commit `44d34663bc` of PR https://github.com/scylladb/scylladb/pull/25915. Breaks articact tests on ARM, blocking us from building new images from master.	2025-09-16 08:31:08 +03:00
Lakshmi Narayanan Sreethar	84f2e99c05	compaction/scrub: handle exceptions when moving invalid sstables to quarantine In validate mode, scrub moves invalid sstables into the quarantine folder. If validation fails because the sstable files are missing from disk, there is nothing to move, and the quarantine step will throw an exception. Handle such exceptions so scrub can return a proper compaction_result instead of propagating the exception to the caller. This will help the testcase for #23363 to reliably determine if the scrub has failed or not. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-09-15 23:10:23 +05:30
Sergey Zolotukhin	6c2a145f6c	gossiper: ensure gossiper operations are executed in gossiper scheduling group Sometimes gossiper operations invoked from storage_service and other components run under a non-gossiper scheduling group. If these operations acquire gossiper locks, priority inversion can occur: higher-priority gossiper tasks may wait behind lower-priority tasks (e.g. streaming), which can cause gossiper slowness or even failures. This patch ensures that gossiper operations requiring locks on gossiper structures are explicitly executed in the gossiper scheduling group. To help detect similar issues in the future, a warning is logged whenever a gossiper lock is acquired under a non-gossiper scheduling group. Fixes scylladb/scylladb#25907	2025-09-15 12:49:07 +02:00
Petr Gusev	1d270020f2	test_tablets_lwt: eliminate redundant disable_tablet_balancing This is a refactoring commit.	2025-09-15 12:40:10 +02:00
Petr Gusev	7060265d5f	test_fencing: add test_lwt_fencing_upgrade This test verifies that upgrading to a Scylla version with LWT fencing does not disrupt existing LWT workloads.	2025-09-15 12:34:45 +02:00
Petr Gusev	49b036cf2b	pylib: extract upgrade helpers from test_sstable_compression_dictionaries_upgrade.py We want to reuse them to test upgade for LWT fencing	2025-09-15 12:34:45 +02:00
Petr Gusev	82f0235e4b	test_fencing: add test_fenced_out_on_tablet_migration_while_handling_paxos_verb This test verifies that the fencing token is checked on replicas after the local Paxos state is updated. This ensures that if we failed to drain an LWT request during topology changes the replicas where paxos verbs got stuck won't contributed to the target CLs.	2025-09-15 12:34:45 +02:00
Petr Gusev	0156850605	test_fencing: test_fence_lwt_during_bootstap	2025-09-15 12:09:08 +02:00
Dawid Mędrek	18cb748268	docs/snitch: Document default DC and rack The existing article is already extensive and covers pretty much all of the details useful to the user. However, the document lacked minute information like the default names of the DC and rack in case of SimpleSnitch or it didn't explicitly specify the behavior of RackInferringSnitch (though arguably the existing example was more than sufficient). Fixes scylladb/scylladb#23528 Closes scylladb/scylladb#25700	2025-09-15 11:47:22 +02:00
Petr Gusev	92b165b8c0	pylib/rest_client.py: encode injection name Sometimes it's convenient to use slashes in injection names, for example my_component/my_method/my_condition. Without quote() we get 'handler not found' error from Scylla.	2025-09-15 11:24:53 +02:00
Petr Gusev	819d59eeba	storage_proxy_stats: add fenced_out_requests metric We have to drop const qualifiers because now check_fence needs to mutate this metric.	2025-09-15 11:24:53 +02:00
Petr Gusev	6d7af84fed	storage_proxy: add fencing to Paxos verbs This commit adds fencing support to all Paxos verbs: * Pass an optional (for backward compatibility) fencing_token as a parameter to the prepare, accept, learn, and prune verbs. * Call apply_fence twice — before and after accessing local data. This ensures that if the coordinator is fenced out mid-request, the replica does not return success, which would otherwise incorrectly contribute to achieving the target CL. Without this, a user might observe successful writes that become unreadable after the topology operation completes. * For prune, call apply_fence only once because it does not return a response to the LWT coordinator. Fixes scylladb/scylladb#22332	2025-09-15 11:24:53 +02:00
Petr Gusev	ab750af711	storage_proxy::apply_fence: add overload that throws on failure This new apply_fence overload checks the fence and reports a failure by throwing a regular exception.	2025-09-15 11:24:53 +02:00
Petr Gusev	a2bde28efe	storage_proxy: extract apply_fence_result This commit refactors a repeated pattern that applies the fence and embeds the exception into the exception_variant class by extracting it into a separate method.	2025-09-15 11:24:53 +02:00
Petr Gusev	bdfea2fa4c	sp::apply_fence: rename to apply_fence_on_ready This overload performs the fence check only when the future is ready. In this commit, we give it a more descriptive name to better reflect its behavior. Additionally, we add extensive comments explaining the overall fencing scheme and the motivation behind this specific overload.	2025-09-15 11:24:53 +02:00
Petr Gusev	4a5c856d44	sp::apply_fence: rename to check_fence We plan to introduce several additional apply_fence overloads in upcoming commits. To avoid ambiguity, this change renames the existing base function to check_fence.	2025-09-15 10:56:20 +02:00
Petr Gusev	7fb5b2006b	sp::apply_fence: make non-generic It's simpler and more consistent to always use locator::host_id for caller_address. We also slightly reformulate the comment for sp.apply_fence here.	2025-09-15 10:56:20 +02:00
Michał Jadwiszczak	dc1ffd2c10	service/storage_service: drain `view_building_worker` earlier Similarly to view builder, view building worker needs to be drained in `storage_service::do_drain()`. Storage service drain is happening at the same beginning of shutdown procedure. Before this patch, the worker was still building views after the storage service was drained and this caused errors like: `Error applying view update to (named_gate_closed_exception)` and `locator::no_such_tablet_map`. Fixes scylladb/scylladb#25908 Closes scylladb/scylladb#25984	2025-09-15 11:29:19 +03:00
Gleb Natapov	d3badf7406	storage_service: change node_ops_info::ignore_nodes to host id It drop useless translation from id to ip during removenode through topology coordinator. Closes scylladb/scylladb#25958	2025-09-15 10:18:24 +02:00
Sergey Zolotukhin	340413e797	gossiper: fix wrong gossiper instance used in `force_remove_endpoint` `gossiper::force_remove_endpoint` is always executed on shard 0 using `invoke_on`. Since each shard has its own `gossiper` instance, if `force_remove_endpoint` is called from a shard other than shard 0, `my_host_id()` may be invoked on the wrong `gossiper` object. This results in undefined behavior due to unsynchronized access to resources on another shard.	2025-09-15 08:54:59 +02:00
Aleksandra Martyniuk	55fde70f8d	api: tasks: task_manager: keep children identities in chunked_{array,vector} task_status contains a vector of children identities. If the number of children is large, we may hit oversized allocation. Change all types of children-related containers to chunked_vector. Modify the children type returned from task manager API. Fixes: scylladb#25795. Closes scylladb/scylladb#25923	2025-09-15 08:44:16 +03:00
Nadav Har'El	b4e3d4ac2f	alternator: nicer error message for integer overflow in list index In the DynamoDB API, when "a" is a list attribute, a[999] returns the 1000th element. But if the list isn't that long (e.g., it only has 5 elements), a[999] returns nothing - it's not an error. But it turns out that when the index is so long that it can't even be parsed as an integer, e.g., 99999999999999, DynamoDB does report an error: Invalid ProjectionExpression: List index is not within the allowable range; index: [99999999999999] Before this patch, Alternator also returned an error in this case, with the right type (ValidationException), but with a strange low-level error text: Failed parsing ProjectionExpression 'a[99999999999999]': std::out_of_range (stoi) The problem was that the code (in alternator/expressions.g) ran stoi() without converting its std::out_of_range exception to a better user-facing message. We do this in this patch, and the error message now looks like: Failed parsing ProjectionExpression 'a[99999999999999]': list index out of integer range This patch also includes a test reproducing this error, which passes on DynamDB and on Alternator it fails before this patch and passes with the patch. Fixes #25947 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25951	2025-09-15 08:43:00 +03:00
Nadav Har'El	208d3986a7	alternator: add explanation of internal tags Alternator needs to store a few pieces of information for each table that it can't store in the existing CQL schema. We decided to store this information in hidden tags - tags named with the prefix "system:" - and we already have four of those: Provisioned RCU and WCU, table creation time, and TTL's expiration-time attribute. This patch moves the definition of all four tags to one place in executor.cc, adds a short comment about the content of each tag, and adds a longer comment explaining why we have these hidden tags at all. It is expected that more hidden tags will follow - e.g., to solve issue #5320. So we expect more tags to be added later in the same place in the code. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25980	2025-09-15 08:41:39 +03:00
Emil Maskovsky	99db980899	gossiper: eliminate duplicate code in do_shadow_round Remove a redundant code block inadvertently introduced in commit `4b3d160f34`. While the duplicate did not affect functionality, its presence could cause confusion and maintenance issues. This change does not alter behavior and is purely a cleanup. Fixes: scylladb/scylladb#25999 Backport: The issue exists in all 2025 branches, so it should be backported accordingly. Closes scylladb/scylladb#26001	2025-09-15 08:35:04 +03:00
Jenkins Promoter	c63b335819	Update pgo profiles - aarch64	2025-09-15 05:17:07 +03:00
Jenkins Promoter	e97a0c8b42	Update pgo profiles - x86_64	2025-09-14 21:23:37 -04:00
Aleksandra Martyniuk	75b772adfb	db: optimize cache invalidation following repair/streaming Currently, if a new sstable is created during repair/streaming, we invalidate its whole token range in cache. If the sstable is sparse, we unnecessarily clear too much data. Modify cache invalidation, so that only the partitions present in the sstable are cleared. To check whether a partition is present in the sstable, we use bloom filters. Bloom filters may return false positives and show that an sstable contains a partition, even though it does not. Due to that we may invalidate a bit more than we need to, but the cache will be in valid state. An issue arises when we do not invalidate two consecutive partitions that are continuous. The sstable may contain a token that falls between these partitions, breaking the continuity. To check that, we would need to scan sstable index. However, such a change would noticeably complicate the invalidation, both performance and code. In this change, sstable index reader isn't used. Instead, the continuity flag is unset for all scanned partitions. This comes at a cost of heavier reads, as we will need to verify continuity when reading more than one partition from cache. Fixes: https://github.com/scylladb/scylladb/issues/9136. Closes scylladb/scylladb#25996	2025-09-14 19:48:14 +03:00
Lakshmi Narayanan Sreethar	1d1e572962	sstables: skip bloom filter rebuilds with minimal savings If a bloom filter was built with a bad partition estimate, it is rebuilt right before the sstable is sealed. The rebuild is already skipped if the current bitset size results in a false-positive rate within 75%–125% of the configured value. This patch adds additional conditions to prevent rebuilds when the savings are minimal. It also skips rebuilding for garbage collected sstables, since they will be dropped soon anyway. Also updated and added more test cases to cover these new criteria for bloom filter rebuilds. Fixes #25464 Fixes #25468 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#25968	2025-09-14 18:19:50 +03:00
Nadav Har'El	5307d1b9a8	Merge 'vector_index: add version to index options' from Dawid Pawlik Since creating the vector index does not lead to creation of a view table [#24438] (whose version info had been logged in `system_schema.scylla_tables`) we lacked the information about the version of the index. The solution we arrived at is to add the version as a field in options column of `system_schema.indexes`. It requires few changes and seems unintruitive for existing infrastructure. This patch implements the solution described above. Refs: VECTOR-142 Closes scylladb/scylladb#25614 * github.com:scylladb/scylladb: cqlpy/test_vector_index: add vector index version test vector_index, index_prop_defs: add version to index options create_index_statement: rename `validator` to `custom_index_factory` custom index: rename `custom_index_option_name` vector_index: rename `supported_options` to `vector_index_options`	2025-09-14 15:35:53 +03:00
Radosław Cybulski	30306c3375	Remove const & from tags_extension constructor `tags_extension` constructor unnecesarily takes `std::map` by const ref, forcing a copy. This patch removes const ref for performance reasons. Closes scylladb/scylladb#25977	2025-09-14 13:32:21 +03:00
Ernest Zaslavsky	44d34663bc	treewide: seastar module update and fix broken rest client start using `write_body` in `rest/client` to properly set headers due to changes applied to seastar's http client Seastar module update ``` b6be384e Merge 'http: generalize Content-Type setting' from Nadav Har'El 74472298 http: generalize request's Content-Type setting 9fd5a1cc http: generalize reply's Content-Type setting a2665f38 memory: Remove deprecated enable_abort_on_allocation_failure() d2a5a8a9 resource.cc: Remove some dead code 7ad9f424 http: Add support of multiple key repetitions for the request a636baca task: Move task::get_backtrace() definition in its class a0101efa Fixed "doxygen" spelling in error message db969482 Merge 'http/reply: introduce set_cookie()' from Botond Dénes 5357b434 http/reply: introduce set_cookie() 1ddcf05f http/reply: make write_reply*() public 4b782d73 http/connection: start_response(): fix indentation 720feca0 http/reply: encapsulate reply writing in write_reply() 3e19917d Merge 'exceptions: log thrown and propagated exception with distinct log levels' from Botond Dénes db9aea93 Merge 'Correctly wrap up abandoned yielding directory lister' from Pavel Emelyanov dbb2bf3f test: Add test for input_stream::read_exactly() a5308ec9 file/directory_lister: Correctly wrap up fallback generator 4f0811f4 file/directory_lister: Convert on-stack queue to shared pointer 59801da7 tests: Add directory lister early drop cases 33233032 http/reply: s/write_reply_to_connection/write_reply/ 69b93620 http/reply: write_reply_{to_connection,headers}(): pass output stream 56e9bda7 test: Convert directory_test into seastar test 96782358 Merge 'Improve io_tester's seqwrite and append workloads' from Pavel Emelyanov 8b46e3d4 SEASTAR_ASSERT: assert to stderr and flush stream 3370e22a tutorial.md: use current_exception_as_future() e977453a Add fixture support for seastar::testing 3e70d7f7 io_tester: Do not set append_is_unlikely unconditionally 2a4ae7b4 io_tester: Count file size overflows 5e678bb5 io_tester: Tuneup size overflow check d5dad8ce io_tester: Move position management code to io_class_data 5586a056 io_tester: Rename seqwrite -> overwrite 92df2fb2 io_tester: Relax return value of create_and_fill_file() 03d9500d io_tester: Dont fill file for APPEND d6844a7b io_tester: Indentation fix after previous patch fb9e0088 io_tester: Coroutinize create_and_fill_file() 2f802f57 exceptions: log thrown and propagated exception with distinct log levels 4971fa70 util: move log-level into own header 39448fc1 Merge 'Fix and tune http::request setup by client' from Pavel Emelyanov 52d0c4fb iostream: Move output_stream::write(scattered_message) lower 7a52f734 Merge 'read_first_line: Missing pragma and licence' from Ernest Zaslavsky d0881b7e read_first_line: Add missing license boilerplate 988a0e99 read_first_line:: Add missing `#pragma once` 42675266 http: Make client::make_request accept const request& c7709fb5 http: Make request making API return exceptional future not throw b68ed89b http: Move request content length header setup 1d96dac6 http: Move request version configuration 072e86f6 http: Setup request once ``` Closes scylladb/scylladb#25915	2025-09-13 17:14:28 +03:00
Asias He	9bca90be0d	repair: Fix repair_row_level_stop verb idl The version keyword is missed for the optional mark_as_repaired parameter. This causes the new node to expect more data to come: INFO 2025-09-01 19:23:05,332 [shard 0:strm] rpc - client 127.0.7.6:50116 msg_id 8: caught exception while processing a message: std::out_of_range (deserialization buffer underflow) When the sender is an old node in a mixed cluster, the data will never come. To fix, add the missing version keyword. Our idl-compiler.py should have caught the typo since the keyword was missing in the [[]] tag. Fixes #25666 Closes scylladb/scylladb#25782	2025-09-12 15:58:19 +03:00
Avi Kivity	ef7babda3d	Merge 'test: deflake test_restart_leaving_replica_during_cleanup' from Patryk Jędrzejczak The test started hitting #21779 recently. We deflake it in this commit by disabling the tablet load balancing before dropping the keyspace at the end of the test. We still have to understand why the test started hitting #21779, so we keep #25938 open. Refs #25938 The test was flaky only on master, so no backport needed. Closes scylladb/scylladb#25975 * github.com:scylladb/scylladb: test: enable load balancing on a single node in test_restart_leaving_replica_during_cleanup test: deflake test_restart_leaving_replica_during_cleanup	2025-09-12 15:58:19 +03:00
Sayanta Banerjee	6092520631	Small grammatical changes Closes scylladb/scylladb#24667	2025-09-12 15:58:19 +03:00
Radosław Cybulski	436150eb52	treewide: fix spelling errors Fix spelling errors reported by copilot on github. Remove single use namespace alias. Closes scylladb/scylladb#25960	2025-09-12 15:58:19 +03:00
Patryk Jędrzejczak	aaab71c14e	test: enable load balancing on a single node in test_restart_leaving_replica_during_cleanup Doing it on more than one node is redundant.	2025-09-11 13:19:56 +02:00
Patryk Jędrzejczak	4c9efc08d8	test: deflake test_restart_leaving_replica_during_cleanup The test started hitting #21779 recently. We deflake it in this commit by disabling the tablet load balancing before dropping the keyspace at the end of the test. We still have to understand why the test started hitting #21779, so we keep #25938 open. Refs #25938	2025-09-11 13:19:51 +02:00
Patryk Jędrzejczak	eae12c1717	test: cluster: add a test for restarts with no group 0 quorum We don't have such a test, and we could add a group 0 quorum requirement on the restart path by mistake. A new test, no backport. Closes scylladb/scylladb#25623	2025-09-11 08:56:34 +03:00
Raphael S. Carvalho	b607b1c284	compaction: Fix stop of sstable cleanup The interface suggests the whole sstable cleanup is aborted with 'nodetool stop CLEANUP', but it is currently stopping only the ongoing cleanup task, and the compaction manager will retry the task since the error is not propagated all the way back to the caller. With raft topology, the coordinator should retry it though since cleanup became mandatory with automatic cleanup. So it's only fixing the usage where cleanup is issued manually. The stop exception is only propagated to the caller of cleanup. When stopping tasks during shutdown, the exception is swallowed and the error only returned to the caller. Fixes #20823. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#24996	2025-09-11 08:55:10 +03:00
Yaron Kaikov	902d139c80	tools: toolchain: dbuild: add setuptools_scm as dependency this package was added as a dependnancy to `cqlsh` in `216d8b0658` Fixes: https://github.com/scylladb/scylladb/issues/25613 [Yaron: regenerate frozen toolchain with optimized clang from https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz ] Closes scylladb/scylladb#25932	2025-09-11 08:51:28 +03:00
Cezar Moise	20ba8d4e8c	test: skip flaky test test_one_big_mutation_corrupted_on_startup The test is flaky since it tries to corrupt the commitlog in a non-deterministic way that sometimes allows the tested mutation to escape and be replayed anyhow. refs: #25627 Closes scylladb/scylladb#25950	2025-09-11 08:39:24 +03:00
Avi Kivity	c91b326d5a	Merge 'transport: replace throwing protocol_exception with returns' from Dario Mirovic Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance. Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_eptr`. This change is then propagated to the callers. The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_eptr_throw_policy`. This means that the behavior of commitlog module stays the same. transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_eptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved. cql3 module changes do the same as transport server module. Benchmark that is not yet merged has commit `67fbe35833e2d23a8e9c2dcb5e04580231d8ec96`, [GitHub diff view](https://github.com/scylladb/scylladb/compare/master...nuivall:scylladb:perf_cql_raw). It uses either read or write query. Command line used: ``` ./build/release/scylla perf-cql-raw --workdir ~/tmp/scylladir --smp 1 --developer-mode 1 --workload write --duration 300 --concurrency 1000 --username cassandra --password cassandra 2>/dev/null ``` The only thing changed across runs is `--workload write`/`--workload read`. Built and run on `release` target. <details> ``` throughput: mean= 36946.04 standard-deviation=1831.28 median= 37515.49 median-absolute-deviation=1544.52 maximum=39748.41 minimum=28443.36 instructions_per_op: mean= 108105.70 standard-deviation=965.19 median= 108052.56 median-absolute-deviation=53.47 maximum=124735.92 minimum=107899.00 cpu_cycles_per_op: mean= 70065.73 standard-deviation=2328.50 median= 69755.89 median-absolute-deviation=1250.85 maximum=92631.48 minimum=66479.36 ⏱ real=5:11.08 user=2:00.20 sys=2:25.55 cpu=85% ``` ``` throughput: mean= 40718.30 standard-deviation=2237.16 median= 41194.39 median-absolute-deviation=1723.72 maximum=43974.56 minimum=34738.16 instructions_per_op: mean= 117083.62 standard-deviation=40.74 median= 117087.54 median-absolute-deviation=31.95 maximum=117215.34 minimum=116874.30 cpu_cycles_per_op: mean= 58777.43 standard-deviation=1225.70 median= 58724.65 median-absolute-deviation=776.03 maximum=64740.54 minimum=55922.58 ⏱ real=5:12.37 user=27.461 sys=3:54.53 cpu=83% ``` ``` throughput: mean= 37107.91 standard-deviation=1698.58 median= 37185.53 median-absolute-deviation=1300.99 maximum=40459.85 minimum=29224.83 instructions_per_op: mean= 108345.12 standard-deviation=931.33 median= 108289.82 median-absolute-deviation=55.97 maximum=124394.65 minimum=108188.37 cpu_cycles_per_op: mean= 70333.79 standard-deviation=2247.71 median= 69985.47 median-absolute-deviation=1212.65 maximum=92219.10 minimum=65881.72 ⏱ real=5:10.98 user=2:40.01 sys=1:45.84 cpu=85% ``` ``` throughput: mean= 38353.12 standard-deviation=1806.46 median= 38971.17 median-absolute-deviation=1365.79 maximum=41143.64 minimum=32967.57 instructions_per_op: mean= 117270.60 standard-deviation=35.50 median= 117268.07 median-absolute-deviation=16.81 maximum=117475.89 minimum=117073.74 cpu_cycles_per_op: mean= 57256.00 standard-deviation=1039.17 median= 57341.93 median-absolute-deviation=634.50 maximum=61993.62 minimum=54670.77 ⏱ real=5:12.82 user=4:10.79 sys=11.530 cpu=83% ``` This shows ~240 instructions per op increase for reads and ~180 instructions per op increase for writes. Tests have been run multiple times, with almost identical results. Each run lasted 300 seconds. Number of operations executed is roughly 38k per second 300 seconds = 11.4m ops. Update: I have repeated the benchmark with clean state - reboot computer, put in performance mode, rebuild, closed other apps that might affect CPU and disk usage. run count: 5 times before and 5 times after the patch duration: 300 seconds Average write throughput median before patch: 41155.99 Average write throughput median after patch: 42193.22 Median absolute deviation is also lower now, with values in range 350-550, while the previous runs' values were in range 750-1350. </details> Built and run on `release` target. <details> ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null ``` throughput: mean= 14910.90 standard-deviation=477.72 median= 14956.73 median-absolute-deviation=294.16 maximum=16061.18 minimum=13198.68 instructions_per_op: mean= 659591.63 standard-deviation=495.85 median= 659595.46 median-absolute-deviation=324.91 maximum=661184.94 minimum=658001.49 cpu_cycles_per_op: mean= 213301.49 standard-deviation=2724.27 median= 212768.64 median-absolute-deviation=1403.85 maximum=225837.15 minimum=208110.12 ⏱ real=5:19.26 user=5:00.22 sys=15.827 cpu=98% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null ``` throughput: mean= 93345.45 standard-deviation=4499.00 median= 93915.52 median-absolute-deviation=2764.41 maximum=104343.64 minimum=79816.66 instructions_per_op: mean= 65556.11 standard-deviation=97.42 median= 65545.11 median-absolute-deviation=71.51 maximum=65806.75 minimum=65346.25 cpu_cycles_per_op: mean= 34160.75 standard-deviation=803.02 median= 33927.16 median-absolute-deviation=453.08 maximum=39285.19 minimum=32547.13 ⏱ real=5:03.23 user=4:29.46 sys=29.255 cpu=98% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null ``` throughput: mean= 206982.18 standard-deviation=15894.64 median= 208893.79 median-absolute-deviation=9923.41 maximum=232630.14 minimum=127393.34 instructions_per_op: mean= 35983.27 standard-deviation=6.12 median= 35982.75 median-absolute-deviation=3.75 maximum=36008.24 minimum=35952.14 cpu_cycles_per_op: mean= 17374.87 standard-deviation=985.06 median= 17140.81 median-absolute-deviation=368.86 maximum=26125.38 minimum=16421.99 ⏱ real=5:01.23 user=4:57.88 sys=0.124 cpu=98% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null ``` throughput: mean= 16198.26 standard-deviation=902.41 median= 16094.02 median-absolute-deviation=588.58 maximum=17890.10 minimum=13458.74 instructions_per_op: mean= 659752.73 standard-deviation=488.08 median= 659789.16 median-absolute-deviation=334.35 maximum=660881.69 minimum=658460.82 cpu_cycles_per_op: mean= 216070.70 standard-deviation=3491.26 median= 215320.37 median-absolute-deviation=1678.06 maximum=232396.48 minimum=209839.86 ⏱ real=5:17.33 user=4:55.87 sys=18.425 cpu=99% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null ``` throughput: mean= 97067.79 standard-deviation=2637.79 median= 97058.93 median-absolute-deviation=1477.30 maximum=106338.97 minimum=87457.60 instructions_per_op: mean= 65695.66 standard-deviation=58.43 median= 65695.93 median-absolute-deviation=37.67 maximum=65947.76 minimum=65547.05 cpu_cycles_per_op: mean= 34300.20 standard-deviation=704.66 median= 34143.92 median-absolute-deviation=321.72 maximum=38203.68 minimum=33427.46 ⏱ real=5:03.22 user=4:31.56 sys=29.164 cpu=99% ``` ./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null ``` throughput: mean= 223495.91 standard-deviation=6134.95 median= 224825.90 median-absolute-deviation=3302.09 maximum=234859.90 minimum=193209.69 instructions_per_op: mean= 35981.41 standard-deviation=3.16 median= 35981.13 median-absolute-deviation=2.12 maximum=35991.46 minimum=35972.55 cpu_cycles_per_op: mean= 17482.26 standard-deviation=281.82 median= 17424.08 median-absolute-deviation=143.91 maximum=19120.68 minimum=16937.43 ⏱ real=5:01.23 user=4:58.54 sys=0.136 cpu=99% ``` </details> Fixes: #24567 This PR is a continuation of #24738 [transport: remove throwing protocol_exception on connection start](https://github.com/scylladb/scylladb/pull/24738). This PR does not solve a burning issue, but is rather an improvement in the same direction. As it is just an enhancement, it should not be backported. Closes scylladb/scylladb#25408 * github.com:scylladb/scylladb: test/cqlpy: add protocol exception tests test/cqlpy: `test_protocol_exceptions.py` refactor message frame building test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code transport: replace `make_frame` throw with return result cql3: remove throwing `protocol_exception` transport: replace throw in validate_utf8 with result_with_exception_ptr return transport: replace throwing protocol_exception with returns utils: add result_with_exception_ptr test/cqlpy: add unknown compression algorithm test case	2025-09-10 21:54:15 +03:00
Yaron Kaikov	0a025d121f	packaging: Add `adduser` as dependnacy As `adduser` command is being used by `/var/lib/dpkg/info/scylla-server.postinst` and similar during rpm post-install. Fixes: https://github.com/scylladb/scylladb/issues/23722 Closes scylladb/scylladb#25928	2025-09-10 21:51:25 +03:00
Avi Kivity	fc64333040	Merge 'sstables/trie: add BTI index readers and writers' from Michał Chojnowski This is yet another part in the BTI index project. Overarching issue: https://github.com/scylladb/scylladb/issues/19191 Previous part: https://github.com/scylladb/scylladb/pull/25506/ Next part: plugging the BTI index readers and writers into sstable readers and writers. The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability. This series implements, on top of the key translation logic, and abstract trie writing and traversal logic, a writer and a reader of sstable index files (which map primary keys to positions in Data.db), as described in `f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md`. Caveats: 1. I think the added test has reasonable coverage, but that depends on running it multiple times. (Though it shouldn't need more than a few runs to catch any bug it covers). It's somewhat awkward as a test meant for running in CI, it's better as something you run many times after a relevant change. 2. These readers and writers are intended to be compatible with Cassandra, but I did NOT do any compatibility testing. The writers and readers added here have only been tested against each other, not against Cassandra's readers and writers. 3. This didn't undergo any proper benchmarking and optimization work. I was doing some measurements in the past, but everything was rewritten so much since then that the my old measurements are effectively invalidated. Frankly I have no idea what the performance of all this branchy-branchy logic is now. No backports needed, new functionality. Closes scylladb/scylladb#25626 * github.com:scylladb/scylladb: test/manual: add bti_cassandra_compatibility_test test/lib/random_schema: add some constraints for generated uuid and time/date values test/lib/random_utils: add a variant of get_bytes which takes an `engine&` test/boost: add bti_index_test sstables/writer: add an accessor for the current write position in Data.db sstables/trie: introduce bti_index_reader sstables/trie: add bti_partition_index_writer.cc sstables/trie: add bti_row_index_writer.cc utils/bit_cast: add a new overload of write_unaligned() sstables/trie: add trie_writer::add_partial() sstables/consumer: add read_56() sstables/trie: make bti_node_reader::page_ptr copy-constructible sstables: extract abstract_index_reader from index_reader.hh to its own header sstables/trie: add an accessor to the file_writer under bti_node_sink sstables/types: make `deletion_time::operator tombstone()` const sstables/types: add sstables::deletion_time::make_live() sstables/trie: fix a special case in max_offset_from_child sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding sstables/trie: rewrite lcb_mismatch to handle fragment invalidation test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr`	2025-09-10 21:48:52 +03:00
Pavel Emelyanov	88a01308e7	api: Move /storage_service/keyspaces handler to database module The handler uses database service, not storage_service, and should belong to the corresponding API module from column_family.cc Once moved, the handler can use captured sharded<database> reference and forget about http_context::db. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25834	2025-09-10 17:01:11 +02:00
Nadav Har'El	ce4592d8fc	Merge 'test: cluster: deflake consistency checks after decommission' from Patryk Jędrzejczak In the Raft-based topology, a decommissioning node is removed from group 0 after the decommission request is considered finished (and the token ring is updated). Therefore, `check_token_ring_and_group0_consistency` called just after decommission might fail when the decommissioned node is still in group 0 (as a non-voter). We deflake all tests that call `check_token_ring_and_group0_consistency` after decommission in this PR. Fixes #25809 This PR improves CI stability and changes only tests, so it should be backported to all supported branches. Closes scylladb/scylladb#25927 * github.com:scylladb/scylladb: test: cluster: deflake consistency checks after decommission test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency	2025-09-10 17:57:02 +03:00
Dawid Pawlik	1ce76a6ca2	cqlpy/test_vector_index: add vector index version test Test if the index version is the same as the base table version before the index was created. Test if recreating the index with the same parameters changes the version. Test if altering the base table does not change the version. Test if the user cannot specify the index version option by themself.	2025-09-10 15:19:36 +02:00
Dawid Pawlik	909a51e524	vector_index, index_prop_defs: add version to index options Since creating the vector index does not lead to creation of a view table [#24438] (whose version info had been logged in `system_schema.scylla_tables`) we lack the information about the version of the index. The mentioned version is used to recognize the quick-drop-create index with the same parameters that needs to be rebuild. The case is mainly experienced while testing, benchmarking or experimenting with Vector Search. Nevertheless it is important to have it considered, as it is really weird having seen that DROP and CREATE commands did not change anything. Although being nice "optimization" to use the same old index, the rebuild feels more natural for the get-to-know-VS-users. Should not change anything in a real production environment. The solution we arrived at is to add the version as a field in options column of `system_schema.indexes`. The version of vector index is a base table's schema version on which the index was created. The table's schema version changes everytime a table is changed meaning that CREATE INDEX or DROP INDEX statement also change it. Every index has a different index version, so it allows to identify them easily. This patch implements the solution described above.	2025-09-10 15:16:54 +02:00
Michał Chojnowski	47c2d09c22	test/manual: add bti_cassandra_compatibility_test Adds a heavy test which tests compatibility of BTI index files between Cassandra and Scylla. It's composed from a C++ part, used to read and write BTI files with Scylla's readers and writers, and a Python part which uses a Cassandra node and the C++ executable to make them read and write each other's files. The stages of the test are: 1. Use the C++ part to generate a random BIG sstable, and matching BTI index files. 2. Import the BIG files into Cassandra, let it generate its own BTI index files. 3. Read both Scylla's BTI and Cassandra's BTI index files using the C++ part. Check that they return the right positions and tombstones for each partition and row. 4. Sneakily swap Cassandra's BTI files for Scylla's BTI files, and query Cassandra (via CQL) for each row. Check that each query returns the right result. Not much can be inferred about the index via CQL queries, so the check we are doing on Cassandra is relatively weak. But in conjunction with the checks done on the Scylla part, it's probably good enough. The test is weird enough, and with heavy-enough dependencies (it uses a podman container to run the Cassandra) that ith has been put in test/manual. To run the test, build `build/$build_mode/test/manual/bti_cassandra_compatibility_test_g`, and run `python test/manual/bti_cassandra_compatibility_test.py`. Note: there's a lot of things that could go wrong in this test. (E.g. file permission issues or port mapping issues due to the container usage, incompatibilities between the Python driver and the random CQL values generated by generate_random_mutations, etc). I hope it works everywhere, but I only tested it on my machine, running it inside the dbuild container.	2025-09-10 13:04:42 +02:00
Botond Dénes	514f59d157	tools/scylla-sstable: write: move to UUID generation We are moving away from integer generations, so stop using them. Also drop the --generation command-line parameter, UUID generations don't have be provided by the caller, because random UUIDs will not collide with each other. To help the caller still know what generation the output sstable has (previously they provided it via --generation), print the generation to stdout. Closes scylladb/scylladb#25166	2025-09-10 13:47:26 +03:00
Aleksandra Martyniuk	20f55ea1b8	compaction: handle exception in expected_total_workload expected_total_workload methods of scrub compaction tasks create a vector of table_info based on table names. If any table was already dropped, then the exception is thrown. It leaves table_info in corrupted state and node crashes with `free(): invalid size`. Return std::nullopt if an exception was thrown to indicate that total workload cannot be found. Fixes: #25941.	2025-09-10 12:13:37 +02:00
Nadav Har'El	5e7251cd40	secondary index: fix xfailing test to pass on Cassandra We have an xfailing test test_secondary_index.py::test_limit_partition which reproduces a Scylla bug in LIMIT when scanning a secondary index (Refs #22158). The point of such a reproducer is to demonstrate the bug by passing on Cassandra but failing on Scylla - yet this specific test doesn't pass on Cassandra because it expects the wrong 3 out of 4 results to be returned: The test begins with LIMIT 1 and sees the first result is (2,1), so we expect when we increase the LIMIT to 3 to see more results from the same partition (2) - and yet the test mistakenly expected the next results to come from partition 1, which is not a reasonable expectation, and doesn't happen in Cassandra (I checked both Cassandra 5 and 4). After this patch, the test passes on Cassandra (I tried 4 and 5), and continues to fail on Scylla - which returns 4 rows despite the LIMIT 3. Note that it is debatable whether this test should insist at all on which 3 items are returned by "LIMIT 3" - In Cassandra the ordering of a SELECT with a secondary index is not well defined (see discussion in Refs #23392). So an alternative implementation of this test would be to just check that LIMIT 3 returns 3 items without insisting which: # In Cassandra the ordering of a SELECT with a secondary index is not # defined (see discussion in #23392), so we don't know which three # results to expect - just that it must be a 3-item subset. rows = list(rs) assert len(rows) == 3 assert set(rows).issubset({(1,1), (1,2), (2,1), (2,2)}) However, as of yet, I did not modify this test to do this. I still believe there is value in secondary index scans having the same order as a scan without a secondary index has - and not an undefined order, and if both Scylla and Cassandra implement that in practice, it's useful for tests to validate this so we'll know if this guarantee is ever broken. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25676	2025-09-10 08:48:52 +03:00
Wojciech Mitros	1f9be235b8	mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the view. Ghost rows are rows in the view with no corresponding row in the base table. Before this patch, only rows whose primary key columns of the base table had different values than any of the base rows were treated as ghost rows by the PRUNE statement. However, view rows which have a column in their primary key that's not in the base primary can also be ghost rows if this column has a different value than the base row with the same values of remaining primary key columns. That's because these rows won't be deleted unless we change value of this column in the base table to this specific value. In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic. If this column isn't the same in the base table and the view, these rows are also deleted. Fixes https://github.com/scylladb/scylladb/issues/25655 Closes scylladb/scylladb#25720	2025-09-10 07:35:00 +02:00
Patryk Jędrzejczak	520cc0eeaa	Merge 'test: fix race condition in test_long_join' from Emil Maskovsky The test could trigger gossiper API calls before the API was properly registered, causing intermittent 404 errors. Previously the test waited for the "init - starting gossiper" log, but this appears before API registration completes. Add explicit wait for gossiper API registration to ensure the endpoint is available before making requests, eliminating test flakiness. Fixes: scylladb/scylladb#25582 No backport needed: Issue only observed in master so far. Closes scylladb/scylladb#25583 * https://github.com/scylladb/scylladb: test: improve async execution in test_long_join test: fix race condition in test_long_join	2025-09-09 19:12:59 +02:00
Patryk Jędrzejczak	bb9fb7848a	test: cluster: deflake consistency checks after decommission In the Raft-based topology, a decommissioning node is removed from group 0 after the decommission request is considered finished (and the token ring is updated). Therefore, `check_token_ring_and_group0_consistency` called just after decommission might fail when the decommissioned node is still in group 0 (as a non-voter). We deflake all tests that call `check_token_ring_and_group0_consistency` after decommission in this commit. Fixes #25809	2025-09-09 19:01:12 +02:00
Patryk Jędrzejczak	e41fc841cd	test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency In the Raft-based topology, a decommissioning node is removed from group 0 after the decommission request is considered finished (and the token ring is updated). `wait_for_token_ring_and_group0_consistency` doesn't handle such a case; it only handles cases where the token ring is updated later. We fix this in this commit. We rely on the new implementation of `wait_for_token_ring_and_group0_consistency` in the following commit to fix flakiness of some tests. We also update the obsolete docstring in this commit.	2025-09-09 19:01:09 +02:00
Avi Kivity	5237a20993	Merge 'replica: Fix split compaction when tablet boundaries change' from Raphael Raph Carvalho Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. All 2025.* versions are vulnerable, so fix must be backported to them. Closes scylladb/scylladb#25690 * github.com:scylladb/scylladb: replica: Fix split compaction when tablet boundaries change replica: Futurize split_compaction_options()	2025-09-09 17:05:32 +03:00
Dawid Mędrek	789a4a1ce7	test/perf: Adjust tablet_load_balancing.cc to RF-rack-validity We modify the logic to make sure that all of the keyspaces that the test creates are RF-rack-valid. For that, we distribute the nodes across two DCs and as many racks as the provided replication factor. That may have an effect on the load balancing logic, but since this is a performance test and since tablet load balancing is still taking place, it should be acceptable. This commit also finishes work in adjusting perf tests to pass with the `rf_rack_valid_keyspaces` configuration option enabled. The remaining tests either don't attempt to create keyspaces or they already create RF-rack-valid keyspaces. We don't need to explicitly enable the configuration option. It's already enabled by default by `cql_test_config`. The reason why we haven't run into any issue because of that is that performance tests are not part of our CI. Fixes scylladb/scylladb#25127 Closes scylladb/scylladb#25728	2025-09-09 12:46:46 +03:00
Botond Dénes	a89d0a747b	Merge 'test.py: add different levels of verbosity for output' from Andrei Chekun Add another level of verbosity: quiet. Before this it was used as a default one, but it provides not enough information. These changes should be coupled with pytest-sugar plugin to have an intended information for each level. Invoke the pytest as a module, instead of a separate process, to get access to the terminal to be able to it interactively. Framework change only, so backporting in to 2025.3 Fixes: #25403 Closes scylladb/scylladb#25698 * github.com:scylladb/scylladb: test.py: add additional level of verbosity for output test.py: start pytest as a module instead of subprocess	2025-09-09 11:49:51 +03:00
Asias He	cb7db47ae1	repair: Add incremental_mode option for tablet repair This patch introduces a new `incremental_mode` parameter to the tablet repair REST API, providing more fine-grained control over the incremental repair process. Previously, incremental repair was on and could not be turned off. This change allows users to select from three distinct modes: - `regular`: This is the default mode. It performs a standard incremental repair, processing only unrepaired sstables and skipping those that are already repaired. The repair state (`repaired_at`, `sstables_repaired_at`) is updated. - `full`: This mode forces the repair to process all sstables, including those that have been previously repaired. This is useful when a full data validation is needed without disabling the incremental repair feature. The repair state is updated. - `disabled`: This mode completely disables the incremental repair logic for the current repair operation. It behaves like a classic (pre-incremental) repair, and it does not update any incremental repair state (`repaired_at` in sstables or `sstables_repaired_at` in the system.tablets table). The implementation includes: - Adding the `incremental_mode` parameter to the `/storage_service/repair/tablet` API endpoint. - Updating the internal repair logic to handle the different modes. - Adding a new test case to verify the behavior of each mode. - Updating the API documentation and developer documentation. Fixes #25605 Closes scylladb/scylladb#25693	2025-09-09 06:50:21 +03:00
Avi Kivity	c4ed7dd814	Merge 'gossiper: fix issues in processing gossip status during the startup and when messages are delayed to avoid empty host ids' from Emil Maskovsky Populate the local state during gossiper initialization in start_gossiping, preventing an empty state from being added to _endpoint_state_map and returned in get_endpoint_states responses, that was causing an 'empty host id issue' on the other nodes during nodes restart. Check for a race condition in do_apply_state_locally In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked. During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash. This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map. Fixes https://github.com/scylladb/scylladb/issues/25831 Fixes https://github.com/scylladb/scylladb/issues/25803 Fixes https://github.com/scylladb/scylladb/issues/25702 Fixes https://github.com/scylladb/scylladb/issues/25621 Ref https://github.com/scylladb/scylla-enterprise/issues/5613 Backport: The issue affects all current releases(2025.x), therefore this PR needs to be backported to all 2025.1-2025.3. Closes scylladb/scylladb#25849 * github.com:scylladb/scylladb: gossiper: fix empty initial local node state gossiper: add test for a race condition in start_gossiping gossiper: check for a race condition in `do_apply_state_locally` test/gossiper: add reproducible test for race condition during node decommission	2025-09-08 20:51:01 +03:00
Andrei Chekun	ea4cd431c9	test.py: add pytest-sugar plugin to the dependencies This plugin allows having better terminal output with progress bar for the tests. Closes scylladb/scylladb#25845 [avi: regenerate frozen toolchain] Closes scylladb/scylladb#25860	2025-09-08 20:50:02 +03:00
Radosław Cybulski	6d150e2d0c	Fix oversized allocation in paxos under pressure When cpu pressured, `_locks` structure in paxos might grow and cause oversized allocations and performance drops. We reserve memory ahead of time. Fixes #25559 Closes scylladb/scylladb#25874	2025-09-08 20:49:00 +03:00
Yaron Kaikov	d57741edc2	build_docker.sh: enable debug symboles installation Adding the latest scylla.repo location to our docker container, this will allow installation scylla-debuginfo package in case it's needed Fixes: https://github.com/scylladb/scylladb/issues/24271 Closes scylladb/scylladb#25646	2025-09-08 18:39:27 +03:00
Emil Maskovsky	f8c297ca27	test: improve async execution in test_long_join Replace list comprehensions with asyncio.gather() to await the injection API calls in fully concurrent manner.	2025-09-08 17:14:37 +02:00
Emil Maskovsky	a86bd06f08	test: fix race condition in test_long_join The test could trigger gossiper API calls before the API was properly registered, causing intermittent 404 errors. Previously the test waited for the "init - starting gossiper" log, but this appears before API registration completes. Add explicit wait for gossiper API registration to ensure the endpoint is available before making requests, eliminating test flakiness. Fixes: scylladb/scylladb#25582	2025-09-08 17:14:37 +02:00
Dario Mirovic	ef83d6b970	docs: cql: default create keyspace syntax This patch updates the create keyspace statement docs. It explains how the `replication` option in the create keyspace statement is now optional, and behaves the same as if we specified an empty set as following: `WITH replication = {}`. An example with no `replication` option specified has also been added. Refs #25145	2025-09-08 15:25:30 +02:00
Dario Mirovic	d92ceed19a	test: cqlpy: add test for create keyspace with no options specified This patch introduces one new test case. It tests that a keyspace can be created without specifying replication options. Since other options already had defaults, this test assures a keyspace can be created with no options specified at all, with the following query: `CREATE KEYSPACE ks;` Refs #25145	2025-09-08 15:25:23 +02:00
Pavel Emelyanov	34d1648d21	main: Properly handle zero allocation warning threshold The --help text says about --large-memory-allocation-warning-threshold: "Warn about memory allocations above this size; set to zero to disable." That's half-true: setting the value to zero spams logs with warnings of allocation of any size, as seastar treats zero threshold literaly. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25850	2025-09-08 12:41:19 +02:00
Asias He	451e1ec659	streaming: Fix use after move in the tablet_stream_files_handler The files object is moved before the log when stream finishes. We've logged the files when the stream starts. Skip it in the end of streaming. Fixes #25830 Closes scylladb/scylladb#25835	2025-09-08 11:59:52 +02:00
Sergey Zolotukhin	b34d543f30	gossiper: fix empty initial local node state This change removes the addition of an empty state to `_endpoint_state_map`. Instead, a new state is created locally and then published via replicate, avoiding the issue of an empty state existing in `_endpoint_state_map` before the preemption point. Since this resolves the issue tested in `test_gossiper_empty_self_id_on_shadow_round`, the `xfail` mark has been removed. Fixes: scylladb/scylladb#25831	2025-09-08 11:38:31 +02:00
Sergey Zolotukhin	775642ea23	gossiper: add test for a race condition in start_gossiping This change adds a test for a race condition in `start_gossiping` that can lead to an empty self state sent in `gossip_get_endpoint_states_response`. Test for scylladb/scylladb#25831	2025-09-08 11:38:30 +02:00
Sergey Zolotukhin	f08df7c9d7	gossiper: check for a race condition in `do_apply_state_locally` In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked. During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash. This change 1. adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map. 2. Removes xfail from the test_gossiper_race test since the issue is now fixed. 3. Adds exception handling in `do_shadow_round` to skip responses from nodes that sent an empty host ID. This re-applies the commit `13392a40d4` that was reverted in `46aa59fe49`, after fixing the issues that caused the CI to fail. Fixes: scylladb/scylladb#25702 Fixes: scylladb/scylladb#25621 Ref: scylladb/scylla-enterprise#5613	2025-09-08 11:38:30 +02:00
Emil Maskovsky	28e0f42a83	test/gossiper: add reproducible test for race condition during node decommission This change introduces a targeted test that simulates the gossiper race condition observed during node decommissioning. The test delays gossip state application and host ID lookup to reliably reproduce the scenario where `gossiper::get_host_id()` is called on a removed endpoint, potentially triggering an abort in `apply_new_states`. There is a specific error injection added to widen the race window, in order to increase the likelihood of hitting the race condition. The error injection is designed to delay the application of gossip state updates, for the specific node that is being decommissioned. This should then result in the server abort in the gossiper. This re-applies the commit `5dac4b38fb` that was reverted in `dc44fca67c`, but modified to relax the check from "on_internal_error" to a just warning log. The more strict can be re-introduced later once we are sure that all remaining problems are resolved and it will not break the CI. Refs: scylladb/scylladb#25621 Fixes: scylladb/scylladb#25721	2025-09-08 11:38:30 +02:00
Dawid Mędrek	bb0255b2fb	tools/scylla-sstable: Enable rf_rack_valid_keyspaces Enabling the configuration option should have no negative impact on how the tool behaves. There is no topology and we do not create any keyspaces (except for trivial ones using `SimpleStrategy` and RF=1), only their metadata. Thanks to that, we don't go through validation logic that could fail in presence of an RF-rack-invalid keyspace. On the other hand, enabling `rf_rack_valid_keyspaces` lets the tool access code hidden behind that option. While that might not be of any consequence right now, in the future it might be crucial (for instance, see: scylladb/scylladb#23030). Note that other tools don't need an adjustment: * scylla-types: it uses schema_builder, but it doesn't reuse any other relevant part of Scylla. * nodetool: it manages Scylla instances but is not an instance itself, and it does not reuse any codepaths. * local-file-key-generator: it has nothing to do with Scylla's logic. Other files in the `tools` directory are auxiliary and are instructed with an already created instance of `db::config`. Hence, no need to modify them either. Fixes scylladb/scylladb#25792 Closes scylladb/scylladb#25794	2025-09-08 11:52:43 +03:00
Yaron Kaikov	b07505a314	auto-backport.py: sync P0 and P1 labels when applied When triggering the backport process, adding a check for P0 and P1 labels, if available add them to backport PR together with force_on_cloud label Implementing first in pkg to test the process, then will move it to scylladb Fixes: PKG-62 Closes scylladb/scylladb#25856	2025-09-08 11:42:36 +03:00
Yaron Kaikov	407b7b0e18	Fix label parsing logic in backport check script Previously, the script attempted to assign GitHub Actions expressions directly within a Bash string using '${{ ... }}', which is invalid syntax in shell scripts. This caused the label JSON to be treated as a literal string instead of actual data, leading to parsing failures and incorrect backport readiness checks. This update ensures the label data is passed correctly via the LABELS_JSON environment variable, allowing jq to properly evaluate label names and conditions. Fixes: PKG-74 Closes scylladb/scylladb#25858	2025-09-08 11:42:16 +03:00
Dario Mirovic	20c173e958	cql: default `CREATE KEYSPACE` syntax Since all the options except `REPLICATION` already have defaults, and `REPLICATION` has defaults for all the fields inside, this patch makes `REPLICATION` optional. More specifically, there is no need for `WITH` in create keyspace statement anymore. This allows for the following syntax: `CREATE KEYSPACE [IF NOT EXISTS] <name>;` For example: `CREATE KEYSPACE test_keyspace;` Fixes #25145	2025-09-08 10:07:40 +02:00
Pawel Pery	61ee630f42	vector_store_client: add timeouts to tests Sometimes `vector_store_client_test_ann_request` test hangs up. It is hard to reproduce. It seems that the problem is that tests are unreliable in case of stalled requests. This patch attaches a timer to the abort_source to ensure that the test will finish with a timeout at least. Fixes: VECTOR-150 Fixes: #25234 Closes scylladb/scylladb#25301	2025-09-08 10:20:48 +03:00
Wojciech Mitros	10b8e1c51c	storage_proxy: send hints to pending replicas Consider the following scenario: - Current replica set is [A, B, C] - write succeeds on [A, B], and a hint is logged for node C - before the hint is replayed, D bootstraps and the token migrates from C to D - hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done - C is cleaned up, replayed data is lost, and D has a stale copy until next repair. In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets, as it can happen for every tablet migration. This issue is particularly detrimental to materialized views. View updates use hints by default and a specific view update may be sent to just one view replica (when a single base replica has a different row state due to reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table. To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original target is still alive. This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't be too common either. The scenarios for them are: 1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will arrive on the pending replica anyway in streaming 2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to the actual source of the migration, the pending replica will get it during streaming 3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending replica for the hint so we'll send it multiple times 4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the pending replica, so we need to retry the entire write This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are in the same rack. We also add a test case reproducing the issue. Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes https://github.com/scylladb/scylladb/issues/19835 Closes scylladb/scylladb#25590	2025-09-08 09:18:20 +02:00
Pavel Emelyanov	9deea3655f	s3: Fix chunked download source metrics calculations In S3 client both read and write metrics have three counters -- number of requests made, number of bytes processed and request latency. In most of the cases all three counters are updated at once -- upon response arrival. However, in case of chunked download source this way of accounting metrics is misleading. In this code the request is made once, and then the obtained bytes are consumed eventually as the data arrive. Currently, each time a new portion of data is read from the socket the number of read requests is incremented. That's wrong, the request is made once, and this counter should also be incremented once, not for every data buffer that arrived in response. Same for read request latency -- it's "added" for every data buffer that arrives, but it's a lenghy process, the _request_ latency should be accounted once per responce. Maybe later we'll want to have "data latency" metrics as well, but for what we have now it's request latency. The number of read bytes is accounted properly, so not touched here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25770	2025-09-08 09:49:03 +03:00
Avi Kivity	03ee862b50	cql3: statement_restrictions: forbid querying a single-column or token restriction on a multi-column restriction In `41880bc893` ("cql3: statement_restrictions: forbid querying a single-column inequality restriction on a multi-column restriction"), we removed the ability to contrain a single column on a tuple inequality, on the grounds that it isn't used and can't be used. Here, we extend this to remove the ability to constrain a single column on a tuple equality, on the grounds that it isn't used and hampers further refactoring. CQL supports multi-column equality restrictions in the form (ck1, ck2, ck3) = (:v1, :v2, :v3) These restriction shape is only allowed on clustering keys, and is translated into a partition_slice allowing the primary index to efficiently select the part of the partition that satisfies the restriction. The possible_lhs_values() values function allows extracting single-column restrictions from this and similar tuple restrictions. For example, the multi-column restriction (ck1, ck2, ck3) = (:v1, :v2, :v3) implies that ck2 = :v2. If we have an index on ck2, and if we don't further have a restriction on the partition key, then it is advantageous to use the index to select rows, and then filter on ck1 and ck3 to satisfy the full restriction. However, we never actually do that. The following sequence ```cql CREATE TABLE ks.t1 ( pk int, ck1 int, ck2 int, PRIMARY KEY (pk, ck1, ck2) ); CREATE INDEX ON ks.t1(ck1); SELECT * FROM ks.t1 WHERE (ck1, ck2) = (1, 2); ``` Could have been used to query a single partition via the index, but instead is used for a full table scan, using the partition slice to skip through unselected rows. We can't easily start using a new query plan via the index, since switching plans mid-query (due to paging and moving from one coordinator to another during upgrade) would cause the sort order to change, therefore causing some rows to be omitted and some rows to be returned twice. Similarly, we cannot extract a token restriction from a tuple, since the grammar doesn't allow for ```cql WHERE (token(pk)) = (:var1) ``` Since it's not used, remove it. This code was first introduced in `d33053b841` ("cql3/restrictions: Add free functions over new classes") It does not directly correspond to pre-expression code. Closes scylladb/scylladb#25757 Closes scylladb/scylladb#25821	2025-09-07 18:36:05 +03:00
Nadav Har'El	040d6e2245	Merge 'interval: specialize for trivially copyable types' from Avi Kivity Interval's copy and move constructors are full of branches since the two payload T:s are optional and therefore have to be optionally-constructed. This can be eliminated for trivially copyable types (like dht::token) by eliminating interval's user-defined special member functions (constructors etc) in that special case. In turn, this enables optimizations in the standard library (and our own containers) that convert moves/copies of spans of such types into memcpy(). Minor optimization, not a candidate to backport. Closes scylladb/scylladb#25841 * github.com:scylladb/scylladb: test: nonwrapping_interval_test: verify an interval of tokens is trivial interval: specialize interval_data<T> for trivial types interval: split data members into new interval_data class	2025-09-07 17:10:32 +03:00
Raphael S. Carvalho	68f23d54d8	replica: Fix split compaction when tablet boundaries change Consider the following: 1) balancer emits split decision 2) split compaction starts 3) split decision is revoked 4) emits merge decision 5) completes merge, before compaction in step 2 finishes After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged. This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart. To fix this, let's make split compaction always work with the state when it started, not a global state. Fixes #24153. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-09-07 05:20:23 -03:00
Raphael S. Carvalho	0c1587473c	replica: Futurize split_compaction_options() Prepararation for the fix of #24153. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-09-07 05:19:09 -03:00
Michał Chojnowski	77dcb2bcda	test/lib/random_schema: add some constraints for generated uuid and time/date values I want to write a test which generates a random table (random schema, random data) and uses the Python driver to query it. But it turns out that some values generated by test/lib/random_schema can't be deserialized by the Python driver. For example, it doesn't unknown uuid versions, dates before year 1 of after year 9999, or `time` values greater or equal to the number of nanoseconds in a day. AFAIK those "driver-illegal" values aren't particularly interesting for tests which use `random_schema`, so we can just not generate them.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	3ce7b761ce	test/lib/random_utils: add a variant of get_bytes which takes an `engine&` I will need it in a test later.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	927c7ecbb0	test/boost: add bti_index_test Adds a fat unit test (integration test?) for BTI index writers and readers. The test generates a small random dataset for the index writer, writes it to a BTI file, and then attempts to run all possible (and legal) sequences (up to a certain length) of index reader operations and check that the results match the expectation (dictated by a "simple" reference index reader). It is currently the only test for this new feature, but I think it's reasonably good. (I validated the coverage by looking at LLVM coverage reports and by manually adding bugs in various places and checking that the test catches it after a reasonably small number of runs). (Note that in a later series, when we hook up BTI indexes to Data.db readers/writers, the feature will also be indirectly tested by the corresponding integration tests). This is a randomized test. As such, its power grows with the number of runs. In particular, a single run has a decently high probability of not exercising parts of the code at all. (E.g. the generated dataset might have no clustering keys). Also this is a slow test. (With the default parameters, ~1s in release mode on my PC, several seconds in debug mode. (And note that this is after a custom parameter downsizing for debug mode, because this test is slowed extremely badly by debug mode, due to the forced preemption after every future)). For those two reasons, I'm not glad to include it in the test suite that runs in CI. Instead of running it once in every CI run, I'd very much rather have it run 10000 times after the tested feature is touched, and before releases. Unfortunately we don't have a precedent for something like that.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	94088f0a41	sstables/writer: add an accessor for the current write position in Data.db It will be used by index tests to know the ground truth for where each partition and row are written, so that this can be checked against the index.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	11a5eec272	sstables/trie: introduce bti_index_reader Implements an index reader (implementing abstract_index_reader, which is the interface between Data.db readers and index readers) for the BTI index written by bti_partition_index_writer and bti_row_index_writer. The underlying trie walking logic is a thin wrapper around the logic added in `trie_traversal.hh` in an earlier patch series.	2025-09-07 00:32:02 +02:00
Michał Chojnowski	0cbf6d8d24	sstables/trie: add bti_partition_index_writer.cc Implements the Partition.db writer, which maps the (BTI-encoded) decorated keys to the position of the partition header in either Rows.db (iff the partition posesses an intra-partition index at all) or Data.db> The format of Partition.db is supposed to be as described in: `f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md (partition-index)` I didn't actually test it for compatibility with Cassandra (yet?). (I followed the docs linked above and Cassandra's source code, but it could be that I have made a mistake somewhere).	2025-09-07 00:32:02 +02:00
Michał Chojnowski	7e8fd13208	sstables/trie: add bti_row_index_writer.cc Implements the Rows.db writer, which for each partition that possesses an intra-partition index writes a trie of separators between clustering key blocks, and a header (footer?) with enough metadata (partition key, partition tombstone) to allow for a direct jump to a clustering row in Data.db. The format of Rows.db is supposed to be as described in: `f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md (row-index)` and the arbitrary details about the choice of clustering block separators follows Cassandra's choices. I didn't actually test it for compatibility with Cassandra (yet?). (I followed the docs linked above and Cassandra's source code, but it could be that I have made a mistake somewhere).	2025-09-07 00:32:02 +02:00
Michał Chojnowski	a800fef633	utils/bit_cast: add a new overload of write_unaligned() Does the same thing as the existing overload, but this one takes `std::byte` instead of `void`, and it additionally returns the pointer to the end position.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	9a4b739b8d	sstables/trie: add trie_writer::add_partial() `trie_writer` has an `add()` method which can add a new branch (key) with a payload at the end. In a later patch, we will want a way to split this addition into several steps, just a way to more naturally deal with fragmented keys. So this commit adds a method which adds new nodes but doesn't attach a payload at the end. This allows for a situation where leaf nodes are added without a payload, which is not supposed to happen. It's the user's responsibility to avoid that. Note: this might be overengineered. Maybe we should force the user to linearize the key. Maybe caring so much about fragmented keys as first-class citizens is the wrong thing to do.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	62843ceac9	sstables/consumer: add read_56() In a later commit, we want to use sstables/consumer.hh to implement a parser of BTI row index headers read from Rows.db. Partition tombstones in this file have an encoding which uses the first bit of the first byte to determine if the tombstone is live or not. If yes, then the timestamp is in the remaining 7 bits of the first byte, and the next 7 bytes, and the deletion_time is in the 4 bytes after that. So I need some way to read 1 byte, and then, depending on its value, maybe read the next 7 bytes and then 4 bytes. This commits adds a helper for reading a 7-byte int. Now that I'm typing this out, maybe that's not the smartest idea. Maybe I should just "manually" read the 11 bytes as 8, 2, 1. But I've already written this, so I might as well post it, it can always be replaced later.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	c6ad5511db	sstables/trie: make bti_node_reader::page_ptr copy-constructible In later commits, a trie cursor will be holding a `page_ptr`. Sometimes we will want to copy a cursor, in particular to do reset the upper bound of the index reader with `_lower = _upper`. But currently `page_ptr` is non-copyable -- it's a shared pointer, but with an explicit `share()` method -- so a default operator= won't work for this. Let's add a copy-assignable for `page_ptr` for this purpose.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	25e98f1ff7	sstables: extract abstract_index_reader from index_reader.hh to its own header In later parts of the series, we add a second implementation of `abstract_index_reader`. To do that, we want a header with `abstract_index_reader`, but we don't need to pull in everything else from `index_reader.hh`. So let's extract `abstract_index_reader` to its own header.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	475cb18e90	sstables/trie: add an accessor to the file_writer under bti_node_sink The row index writer will need both a trie_writer and direct access to its underlying file_writer (to write partition headers, which are "outside" of the trie). And it would be weird to keep a separate reference to the file_writer if the trie_writer already does that. So let's add accessors needed to get to the file_writer&.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	d2fd0f9592	sstables/types: make `deletion_time::operator tombstone()` const No reason no to make it const.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	1fdac048d3	sstables/types: add sstables::deletion_time::make_live() A small helper, will be useful in later commits.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	0684dbb5bd	sstables/trie: fix a special case in max_offset_from_child A `writer_node` contains a chain of multiple BTI nodes. `writer_node::_node_size` is the size (in bytes) of the entire chain. But the parent of the `writer_node` wants to know the offset to the rootmost node in the chain. This can be deduced from the `writer_node::_transition_length` and the `writer_node::_node_size`. But there's an error in the logic for that, for the special case when there are two nodes in the chain. The rootmost node will be SINGLE_NOPAYLOAD_12 if and only if the leafmost node is smaller than 16 bytes, which is true if and only if `_node_size` is smaller than 19 bytes. But the current logic compares `_node_size` against 16. That's incorrect. This patch fixes that. There was a test for this branch, but it was not good enough. It only tested payloads of size 1 and 20, but the problem is only revealed by payloads of size 13-14. The test has been extended to cover all sizes between 1 and 20.	2025-09-07 00:30:15 +02:00
Michał Chojnowski	b2d793552f	sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding As far as I know, positions outside the clustering region are never passed to sstable indexes. But since they are representable within the argument type of lazy_comparable_bytes_from_clustering_position, let's make it handle them.	2025-09-07 00:30:08 +02:00
Avi Kivity	49b0751980	test: nonwrapping_interval_test: verify an interval of tokens is trivial Since dht::token is trivial, an interval<dht::token> ought to be trivial too. Verify that.	2025-09-06 18:41:00 +03:00
Avi Kivity	ed483647a4	interval: specialize interval_data<T> for trivial types C++ data movement algorithms (std::uninitialized_copy()) and friends and the containers that use them optimize for trivially copyable and destructible types by calling memcpy instead of using a loop around constructors/destructors. Make intervals of trivially copyable and destructible types also trivially copyable and destructible by specializing interval_data<T> not to have user-defined special member functions. This requires that T have a default constructor since we can't skip construction when !_start_exists or !_end_exists. To choose whether we specialize or not, we look at default constructiblity (see above) and trivial destructibility. This is wider than trivial copyablity (a user-defined copy constructor can exist) but is still beneficial, since the generated copy constructor for interval_data<T> will be branch-free. We don't implement the poison words in debug mode; nor are they necessary, since we no don't manage the lifetime of _start_value and _end_value manually any more but let the compiler do that for us. Note [1] prevents full conversion to memcpy for now, but we still get branch free code. [1] https://gcc.gnu.org/bugzilla/show_bug.cgi?id=121789	2025-09-06 18:38:24 +03:00
Avi Kivity	20751517a4	interval: split data members into new interval_data class Prepare for specialized handling of trivial types by extracting the data members of wrapping_internal<T> and the special member functions (constructors/destructors/assignment) into a new interval_data<T> template. To avoid having to refer to data member with a this-> prefix, add using declarations in wrapping_interval<T>.	2025-09-06 18:31:58 +03:00
Pavel Emelyanov	b26816f80d	s3: Export memory usage gauge (metrics) The memory usage is tracked with the help of a semaphore, so just export its "consumed" units. One tricky place here is the need to skip metrics registration for scylla-sstable tool. The thing is that the tools starts the storage manager and sstables manager on start and then some of tool's operations may want to start both managers again (via cql environment) causing double metrics registration exception. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25769	2025-09-05 18:25:34 +03:00
Botond Dénes	a96d31e684	Merge 'Update workflow trigger to pull_request_target - fixing fork PR bug' from Dani Tweig The previous version had a problem: Fork PRs didn't pass the Jira credentials to the main code, which updates the Jira key status. No need for backport. This is not the Scylla code, but a fix to GitHub Actions. Closes scylladb/scylladb#25833 * github.com:scylladb/scylladb: Change pull_request event to pull_request_target - ready for merge Update workflow to use pull_request_target event - in review Change pull_request event to pull_request_target - in progress	2025-09-05 18:23:19 +03:00
Anna Stuchlik	f66580a28f	doc: add support for i7i instances This commit adds currently supported i7i and i7ie instances to the list of instance recommendations. Fixes https://github.com/scylladb/scylladb/issues/25808 Closes scylladb/scylladb#25817	2025-09-05 14:14:58 +02:00
Andrei Chekun	da4990e338	test.py: add additional level of verbosity for output Add another level of verbosity: quiet. Before this it was used as a default one, but it provides not enough information. These changes should be coupled with pytest-sugar plugin to have an intended information for each level.	2025-09-05 11:54:49 +02:00
Andrei Chekun	7e34d5aa28	test.py: start pytest as a module instead of subprocess Invoke the pytest as a module, instead of a separate process, to get access to the terminal to be able to it interactively.	2025-09-05 11:54:49 +02:00
Pavel Emelyanov	dc44fca67c	Revert "test/gossiper: add reproducible test for race condition during node decommission" This reverts commit `5dac4b38fb` as per request from #25803	2025-09-05 09:56:46 +03:00
Pavel Emelyanov	46aa59fe49	Revert "gossiper: check for a race condition in `do_apply_state_locally`" This reverts commit `13392a40d4` as per request from #25803	2025-09-05 09:56:21 +03:00
Anna Mikhlin	21ee24f7cd	trigger-scylla-ci: ignore comment from scylladbbot ignore comments posted by scylladbbot, to allow adding instruction in CI completion report of how to re-trigger CI Closes scylladb/scylladb#25838	2025-09-05 06:18:51 +03:00
dependabot[bot]	862f965196	build(deps): bump sphinx-scylladb-theme from 1.8.7 to 1.8.8 in /docs Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.8.7 to 1.8.8. - [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases) - [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.8.7...1.8.8) --- updated-dependencies: - dependency-name: sphinx-scylladb-theme dependency-version: 1.8.8 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Closes scylladb/scylladb#25823	2025-09-04 18:24:09 +03:00
Nadav Har'El	a1ed2c9d4b	Merge 'Allow users to SELECT from CDC log tables they created.' from Dawid Pawlik Before the patch, user with CREATE access could create a table with CDC or alter the table enabling CDC, but could not query a SELECT on the CDC table they created. It was due to the fact, the SELECT permission was checked on the CDC log, and later it's "parent" - the keyspace, but not the base table, on which the user had SELECT permission automatically granted on CREATE. This patch matches the behavior of querying the CDC log to the one implemented for Materialized Views: 1. No new permissions are granted on CREATE. 2. When querying SELECT, the permissions on base table SELECT are checked. Fixes: https://github.com/scylladb/scylladb/issues/19798 Fixes: VECTOR-151 Closes scylladb/scylladb#25797 * github.com:scylladb/scylladb: cqlpy/test_permissions: run the reproducer tests for #19798 select_statement: check for access to CDC base table	2025-09-04 16:56:52 +03:00
Piotr Wieczorek	db8b7d1c89	alternator: Store LSI keys in :attrs for newly created tables Before this patch, LSI keys were stored as separate, top-level columns in the base table. This patch changes this behavior for newly created tables, so that the key columns are stored inside the `:attrs` map. Then, in the LSI's materialized view, we create a computed column for each LSI's range key that is not a key in the base table. This makes LSI storage consistent with GSIs and allows the use of a collection tombstone on `:attrs` to delete all fields in a row, except for keys, in new tables. Refs https://github.com/scylladb/scylladb/pull/24991 Refs https://github.com/scylladb/scylladb/issues/6930	2025-09-04 15:02:37 +02:00
Michał Chojnowski	b3c20e8cc3	sstables/trie: rewrite lcb_mismatch to handle fragment invalidation When cleaning up `lcb_mismatch` for review, I managed to forget that in the follow-up series I want to use it with iterators for which `it` points to data which is invalidated by `++it`. (The data in `lazy_comparable_bytes_` generators is kept a vector of `bytes`, so `it` can point to the internal storage of `bytes`. Generating a new fragment via `++it` can resize the vector, move the `bytes`, and invalidate the `it`.) So during the pre-review cleanup, `lcb_mismatch` ended up in a shape which isn't prepared for that. This commits shuffles the control flow around so that `++it` is delayed after the span obtained with `*it` is exhausted.	2025-09-04 15:02:29 +02:00
Michał Chojnowski	3c3ed867e6	test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr` The branch inside the `if constexpr (debug)` contains a piece of template code that doesn't typecheck properly. (I used this code before committing it, but apparently I let it become outdated when some changes around it happened). Fix that.	2025-09-04 15:02:29 +02:00
Piotr Wieczorek	089591a3db	alternator/test: Add LSI tests based mostly on the existing GSI tests This patch adds LSI tests that correspond to already existing tests for GSI, filling up some gaps in LSI: - `_wrong_type_attribute_`, - `test_lsi_duplicate_with_different_name`, - `test_lsi_empty_value_in_batch_write`, - `test_lsi_null_index`. As well as some new ones: - `test_lsi_empty_value_binary`, - `test_lsi_put_overwrites_lsi_column`, - `test_lsi_*_modifies_index`.	2025-09-04 15:02:22 +02:00
Dani Tweig	ddac32b656	Change pull_request event to pull_request_target - ready for merge Fix the fork PRs bug	2025-09-04 12:47:25 +03:00
Dani Tweig	eb0bb0f3a0	Update workflow to use pull_request_target event - in review Fix a fork PRs bug.	2025-09-04 12:42:52 +03:00
Dani Tweig	4c460464b8	Change pull_request event to pull_request_target - in progress Fix fork PRs bug.	2025-09-04 12:41:29 +03:00
Botond Dénes	db72430d82	Merge 'Don't leave pre-scrub snapshot on API error' from Pavel Emelyanov The pre-srcub snapshot is taken in the middle of parsing options from the request. In case post-snapshot part of the parsing throws (it can do so if "quarantine_mode" value is not recognized), the snapshot remains on disk, but the API call fails. The fix is to move snapshot taking out of the parse_scrub_options() helper. It could be moved at the end of it, but the helper name doesn't tell that it also takes a snapshot, so no. After the fix the helper in question can be simplified further. The issue exists in older versions, but likely doesn't reveal itself for real, so it doesn't look worthwhile to backport it. Closes scylladb/scylladb#25824 * github.com:scylladb/scylladb: api: Simplify parse_scrub_options() helper api: Take snapshot after parsing scrub options	2025-09-04 12:13:16 +03:00
Avi Kivity	169092b340	Merge 'pgo: add auth connections stress workload' from Marcin Maliszkiewicz This series improves the pgo workloads by enabling authentication and authorization and adding new stress scenarios. - Enables auth in training clusters All training workloads now run with auth enabled, following best practices and avoiding config proliferation. - Adds auth connections stress workload Introduces a workload that uses derived roles and permissions, stressing auth code paths while also creating a new connection per request to exercise server transport handling. - Enables counters workload The counters workload is re-enabled without introducing extra dependencies on cqlsh. Instead, a lightweight exec_cql.py wrapper (shared with the auth workload) handles preparation statements. Backport: no, it's not a bug fix ---------------------------------------------------------- Performance results for auth PGO there seems to be no difference, or to small to measure: scylladb pgo_auth ≡ ◦ ⤖ python3 ./pgo/auth_conns_stress.py localhost cassandra cassandra 10000 100 & scylladb pgo_auth ≡ ◦ ⤖ perf stat -e instructions --timeout 5000 -p 51591 on both before and after instructions counter varies from 179,818,558,011 to 180,664,528,198. ---------------------------------------------------------- Performance results for counters PGO is notably improved with write workload 16-22% and read 4-5%: scylladb pgo_auth ≡ ◦ ⤖ ./scylla_master perf-simple-query 2> /dev/null --counters --write random-seed=3839439576 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=yes} Disabling auto compaction 2413435.37 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 51167 insns/op, 33157 cycles/op, 0 errors) 2413009.40 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 51053 insns/op, 33009 cycles/op, 0 errors) 2403794.31 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 50867 insns/op, 32899 cycles/op, 0 errors) 2384572.52 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 50562 insns/op, 32811 cycles/op, 0 errors) 2195388.31 tps (122.6 allocs/op, 8.0 logallocs/op, 14.1 tasks/op, 51818 insns/op, 34504 cycles/op, 0 errors) throughput: mean= 2362039.98 standard-deviation=93892.61 median= 2403794.31 median-absolute-deviation=50969.42 maximum=2413435.37 minimum=2195388.31 instructions_per_op: mean= 51093.44 standard-deviation=465.18 median= 51052.98 median-absolute-deviation=226.61 maximum=51818.04 minimum=50562.30 cpu_cycles_per_op: mean= 33275.85 standard-deviation=698.65 median= 33008.58 median-absolute-deviation=377.16 maximum=34504.13 minimum=32811.18 scylladb pgo_auth ≡ ◦ ⤖ ./scylla_master perf-simple-query 2> /dev/null --counters random-seed=1134551638 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=yes} Disabling auto compaction Creating 10000 partitions... 5499534.56 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 21463 insns/op, 14902 cycles/op, 0 errors) 5478913.87 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 21385 insns/op, 14839 cycles/op, 0 errors) 5346525.04 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.1 tasks/op, 21454 insns/op, 15082 cycles/op, 0 errors) 5467947.74 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 21275 insns/op, 14775 cycles/op, 0 errors) 5454894.98 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 21250 insns/op, 14766 cycles/op, 0 errors) throughput: mean= 5449563.24 standard-deviation=59878.80 median= 5467947.74 median-absolute-deviation=29350.63 maximum=5499534.56 minimum=5346525.04 instructions_per_op: mean= 21365.28 standard-deviation=98.95 median= 21384.65 median-absolute-deviation=90.57 maximum=21463.17 minimum=21250.33 cpu_cycles_per_op: mean= 14872.93 standard-deviation=129.26 median= 14838.65 median-absolute-deviation=97.52 maximum=15082.44 minimum=14766.13 scylladb pgo_auth ≡ ◦ ⤖ ./scylla_pgo_counters perf-simple-query 2> /dev/null --counters --write random-seed=437758611 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=yes} Disabling auto compaction 2950968.10 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 41540 insns/op, 27097 cycles/op, 0 errors) 2923325.10 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 41366 insns/op, 27017 cycles/op, 0 errors) 2928666.67 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 41274 insns/op, 26929 cycles/op, 0 errors) 2918378.39 tps (122.1 allocs/op, 8.0 logallocs/op, 14.0 tasks/op, 41165 insns/op, 26880 cycles/op, 0 errors) 2209053.17 tps (128.4 allocs/op, 8.0 logallocs/op, 14.6 tasks/op, 48176 insns/op, 34726 cycles/op, 0 errors) throughput: mean= 2786078.28 standard-deviation=322807.25 median= 2923325.10 median-absolute-deviation=142588.38 maximum=2950968.10 minimum=2209053.17 instructions_per_op: mean= 42704.41 standard-deviation=3062.05 median= 41366.40 median-absolute-deviation=1430.69 maximum=48176.45 minimum=41165.23 cpu_cycles_per_op: mean= 28529.92 standard-deviation=3464.99 median= 27016.51 median-absolute-deviation=1601.02 maximum=34726.49 minimum=26880.18 scylladb pgo_auth ≡ ◦ ⤖ ./scylla_pgo_counters 2> /dev/null --counters random-seed=4277130772 enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=yes} Disabling auto compaction Creating 10000 partitions... 5691320.62 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 20656 insns/op, 14279 cycles/op, 0 errors) 5708878.25 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 20486 insns/op, 14104 cycles/op, 0 errors) 5727060.22 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 20439 insns/op, 14044 cycles/op, 0 errors) 5700157.92 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 20416 insns/op, 14054 cycles/op, 0 errors) 5610730.84 tps ( 73.0 allocs/op, 0.0 logallocs/op, 18.0 tasks/op, 20459 insns/op, 14195 cycles/op, 0 errors) throughput: mean= 5687629.57 standard-deviation=44972.99 median= 5700157.92 median-absolute-deviation=21248.68 maximum=5727060.22 minimum=5610730.84 instructions_per_op: mean= 20491.27 standard-deviation=95.74 median= 20459.35 median-absolute-deviation=52.13 maximum=20656.27 minimum=20415.95 cpu_cycles_per_op: mean= 14135.25 standard-deviation=100.27 median= 14104.47 median-absolute-deviation=81.48 maximum=14278.97 minimum=14043.76 Closes scylladb/scylladb#25651 * github.com:scylladb/scylladb: pgo: add links to issues about tablet missing features pgo: enable counters workload pgo: add auth connections stress workload pgo: enable auth in training clusters	2025-09-04 11:46:39 +03:00
Pavel Emelyanov	b86b4fc251	api: Simplify parse_scrub_options() helper It no longer needs to be a coroutine, nether it needs the snapshot_ctl reference argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-03 19:06:31 +03:00
Pavel Emelyanov	ee4197fa80	api: Take snapshot after parsing scrub options Parsiong scrub options may throw after a snapshot is taken thus leaving it on disk even though an operation reported as "failed". Not, probably, critical, but not nice either. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-09-03 19:05:50 +03:00
Marcin Maliszkiewicz	2109110037	pgo: add links to issues about tablet missing features	2025-09-03 15:43:52 +02:00
Marcin Maliszkiewicz	8aa2825caa	pgo: enable counters workload It was not enabled due to some cqlsh dependency missing. After 3 years it's hard to say if the thing is fixed or not, but anyway we don't need another big dependecy while we already have python driver used exstensively in tests. We use simple wrapper file exec_cql.py, shared with auth_conns workload to conveniently read needed preparation statements from the file. Additionally we switch tablets off as counters don't support it yet.	2025-09-03 15:43:51 +02:00
Marcin Maliszkiewicz	09476a4df8	pgo: add auth connections stress workload It uses some derived roles and permissions to exercise auth code paths and also creates new connection with each stress request to exercise also transport/server.cc connection handling code.	2025-09-03 15:43:51 +02:00
Marcin Maliszkiewicz	f2270034ec	pgo: enable auth in training clusters As it's best practice to use auth and we don't want to have 2^n configs to train we just enable auth for every workload.	2025-09-03 15:29:27 +02:00
Dawid Mędrek	d2c5268196	cql3: Produce CREATE MATERIALIZED VIEW statement when describing MV of index Before this change, executing `DESCRIBE MATERIALIZED VIEW` on the underlying materialized view of a secondary index would produce a `CREATE INDEX` statement. It was not only confusing, but it also prevented from learning about the definition of the view. The only way to do so was to query system tables. We change that behavior and produce a `CREATE MATERIALIZED VIEW` statement instead. The statement is printed as a comment to implicitly convey that the user should not attempt to execute it to restore the view. A short comment is provided to make it clearer. Before this commit: ``` cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int); cqlsh> CREATE INDEX i ON ks.t(v); cqlsh> DESCRIBE MATERIALIZED VIEW ks.i; CREATE INDEX i ON ks.t(v); ``` After this commit: ``` cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int); cqlsh> CREATE INDEX i ON ks.t(v); cqlsh> DESCRIBE MATERIALIZED VIEW ks.i; /* Do NOT execute this statement! It's only for informational purposes. This materialized view is the underlying materialized view of a secondary index. It can be restored via restoring the index. CREATE MATERIALIZED VIEW ks.i_index [...]; */ ``` Note that describing the base table has not been affected and still works as follows: ``` cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int); cqlsh> CREATE INDEX i ON ks.t(v); cqlsh> DESCRIBE TABLE ks.t; CREATE TABLE ks.t ( p int, v int, PRIMARY KEY (p) ) WITH bloom_filter_fp_chance = 0.01 AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'} AND comment = '' AND compaction = {'class': 'IncrementalCompactionStrategy'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND crc_check_chance = 1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND speculative_retry = '99.0PERCENTILE' AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'}; CREATE INDEX i ON ks.t(v); ``` We also provide two reproducers of scylladb/scylladb#24610. Fixes scylladb/scylladb#24610 Closes scylladb/scylladb#25697	2025-09-03 15:21:37 +02:00
Piotr Dulikowski	f95808cbe7	Merge 'cdc/generation: Clone `topology_description` asynchronously' from Dawid Mędrek An instance of `cdc::topology_description` can be quite big. The vector it consists of stores as many `token_range_description`s as there are vnodes, and the size of each `token_range_description` is O(#shards). Because of that, copying an instance of the type can lead to reactor stalls. To prevent that, we introduce an asynchronous function copying the contents on the object. Reactor stalls were detected in the call to `map_reduce` in `generation_service::legacy_do_handle_cdc_generation`, so let's start using the new function there. A similar scenario occurs in `generation_service::handle_cdc_generation`, so we modify it too. Unfortunately, it doesn't seem viable to provide a reproducer of said problem. Fixes scylladb/scylladb#24522 Backport: none. Reactor stalls are not critical. Closes scylladb/scylladb#25730 * github.com:scylladb/scylladb: cdc/generation: Delete copy constructors of topology_description cdc/generation: Clone topology_description asynchronously	2025-09-03 13:41:58 +02:00
Dawid Pawlik	5e72d71188	cqlpy/test_permissions: run the reproducer tests for #19798 Since the previous commit fixes the issue, we can remove the xfail mark. The tests should pass now.	2025-09-03 13:20:39 +02:00
Dawid Pawlik	be54346846	select_statement: check for access to CDC base table Before the patch, user with CREATE access could create a table with CDC or alter the table enabling CDC, but could not query a SELECT on the CDC table they created. It was due to the fact, the SELECT permission was checked on the CDC log, and later it's "parent" - the keyspace, but not thebase table, on which the user had SELECT permission automatically granted on CREATE. This patch matches the behaviour of querying the CDC log to the one implemented for Materialized Views: 1. No new permissions are granted on CREATE. 2. When querying SELECT, the permissions on base table SELECT are checked. Fixes: #19798	2025-09-03 13:20:39 +02:00
Botond Dénes	6116f9e11b	Merge 'Compaction tasks progress' from Aleksandra Martyniuk Determine the progress of compaction tasks that have children. The progress of a compaction task is calculated using the default get_progress method. If the expected_total_workload method is implemented, the default progress is computed as: (sum of child task progresses) / (expected total workload) If expected_total_workload is not defined, progress is estimated based on children progresses. However, in this case, the total progress may increase over time as the task executes. All compaction tasks, except for reshape tasks, implement the expected_children_number method. To compute expected_total_workload, iterate over all SSTables covered by the task and sum their sizes. Note that expected_total_workload is just an approximation and the real workload may differ if SStables set for the keyspace/table/compaction group changes. Reshape tasks are an exception, as their scope is determined during execution. Hence, for these tasks expected_total_workload isn't defined and their progress (both total and completed) is determined based on currently created children. Fixes: https://github.com/scylladb/scylladb/issues/8392. Fixes: https://github.com/scylladb/scylladb/issues/6406. Fixes: https://github.com/scylladb/scylladb/issues/7845. New feature, no backport needed Closes scylladb/scylladb#15158 * github.com:scylladb/scylladb: test: add compaction task progress test compaction: set progress unit for compaction tasks compaction: find expected workload for reshard tasks compaction: find expected workload for global cleanup compaction tasks compaction: find expected workload for global major compaction tasks compaction: find expected workload for keyspace compaction tasks compaction: find expected workload for shard compaction tasks compaction: find expected workload for table compaction tasks compaction: return empty progress when compaction_size isn't set compaction: update compaction_data::compaction_size at once tasks: do not check expected workload for done task	2025-09-03 13:23:42 +03:00
Pavel Emelyanov	b0aa2d61d9	Merge 'cql3: add default replication factor to `create_keyspace_statement`' from Dario Mirovic When creating a new keyspace, replication factor must be stated. For example: `CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy', 'replication_factor': 3 };` This patch changes it in the following way - if there is no replication factor specified, use default replication factor. Default replication factor is equal to the number of racks that are not arbiter-only, i.e. racks that have at least one non-arbiter node. The following syntax is now valid: `CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy' };` `CREATE KEYSPACE ks WITH REPLICATION { };` Fixes #16028 Backport is not needed. This is an enhancement for future releases. Closes scylladb/scylladb#25570 * github.com:scylladb/scylladb: docs/cql: update documentation for default replication factor test/cqlpy: add keyspace creation default replication factor tests cql3: add default replication factor to `create_keyspace_statement`	2025-09-03 12:31:53 +03:00
Pavel Emelyanov	c0808c90b0	api: Use validate_table() helper in /storage_service/tokens_endpoint handler The handler validates if the given ks:cf pair exists in the database, then finds the table id to process further. There's a helper that does both. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25669	2025-09-03 11:44:50 +03:00
Pavel Emelyanov	b5610050a1	api: Make GET/storage_service/drain handler work on storage service POSTing on the same URL launches storage_service::drain(), so GETing on it should (not that it's restriced somehow, but still) work on the same service. This changes removes one more user of http_context::database which in turn will allow removding database reference from context eventually. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25677	2025-09-03 11:40:39 +03:00
Radosław Cybulski	7b3d42f83e	Remove unused boost macro definitions Closes scylladb/scylladb#25742	2025-09-03 10:06:33 +03:00
Radosław Cybulski	c242234552	Revert "build: add precompiled headers to CMakeLists.txt" This reverts commit `01bb7b629a`. Closes scylladb/scylladb#25735	2025-09-03 09:46:00 +03:00
Calle Wilund	bc20861afb	system_keyspace: Prune dropped tables from truncation on start/drop Fixes #25683 Once a table drop is complete, there should be no reason to retain truncation records for it, as any replay should skip mutations anyway (no CF), and iff we somehow resurrect a dropped table, this replay-resurrected data is the least problem anyway. Adds a prune phase to the startup drop_truncation_rp_records run, which ignores updating, and instead deletes records for non-existant tables (which should patch any existing servers with lingering data as well). Also does an explicit delete of records on actual table DROP, to ensure we don't grow this table more than needed even in long uptime nodes. Small unit test included. Closes scylladb/scylladb#25699	2025-09-03 07:25:34 +03:00
Sergey Zolotukhin	13392a40d4	gossiper: check for a race condition in `do_apply_state_locally` In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked. During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash. This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map. Fixes scylladb/scylladb#25702 Fixes scylladb/scylladb#25621 Ref scylladb/scylla-enterprise#5613 Closes scylladb/scylladb#25727	2025-09-02 20:44:21 +02:00
Piotr Dulikowski	78ef334333	Merge 'Move "cache" API endpoints registration closer to column_family ones ' from Pavel Emelyanov These two "blocks" of endpoints have different URL prefixes, but work with the same "service", which is sharded<replica::database>. The latter block had already been fixed to carry the sharded<database>& around (#25467), now it's the "cache" turn. However, since these endpoints also work with the database, there's no need in dedicated top-level set/unset machinery (similarly, gossiper has two API set/unset blocks that come together, see #19425), it's enough to just set/unset them next to each other. Ongoing http_context dependency cleanup, no need to backport Closes scylladb/scylladb#25674 * github.com:scylladb/scylladb: api: Capture and use db in cache_service handlers api: Add sharded<database>& arg to set_cache_service() api: Squash (un)set_cache_service into ..._column_family api: Coroutinize set_server_column_family()	2025-09-02 13:59:02 +02:00
Avi Kivity	7ed261fc52	Merge 'Inital GCP object storage support' from Calle Wilund Adds infrastructure and client for interaction with GCP object storage services. Note: this is just a client object usable for creating, listing, deleting and up/downloading of objects to/from said storage service. It makes no attempt at actually inserting it into the sstable storage flow. That can come later. This PR breaks out GCP auth and some general REST call functionality into shared routines. Not all code is 100% reused, but at least some. Test is added, though could be more comprehensive (feel free to suggest a test vector). Test can run in either local mock server mode (default), or against actual GCP. See `test/boost/gcp_object_storage_test.cc` for explanation on the config environment vars. Default is to run the test against a temporary docker deamon. Closes scylladb/scylladb#24629 * github.com:scylladb/scylladb: test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage proc-utils: Re-export waiting types from seastar proc-utils: Inherit environment from current process utils::gcp::object_storage: Add client for GCP object storage utils::http: Add optional external credentials to dns_connection_factory init utils::rest: Break out request wrapper and send logic encryption::gcp_host: Use shared gcp credentials + REST helpers utils::gcp: Move/add gcp credentials management to shared file utils::rest::client: Add formatter for seastar::http::reply utils::rest::client: Add helper routines for simple REST calls utils::http: Make shared system trust certificates public	2025-09-02 14:38:09 +03:00
Avi Kivity	fe308de8df	Merge 'treewide: Add missing `#pragma once`' from Ernest Zaslavsky Add missing #pragma once and license boilerplate to include headers. Consider adding a CI step to catch missing header guards early. It can be done easily by running `cpplint` like below ``` find . -path ./seastar -prune -o -path ./venv -prune -o -path ./idl -prune -o -type f $ -name ".h" -o -name ".hh" -o -name ".hpp" $ -print0 \| xargs -0 cpplint 2>&1 \| grep "header guard found" ``` No backport is needed, the change is not "functional" Closes scylladb/scylladb#25768 github.com:scylladb/scylladb: treewide: Add missing license boilerplate treewide: Add missing `#pragma once`	2025-09-02 13:18:04 +03:00
Piotr Dulikowski	762d9ef68f	Merge 'cdc: Set tombstone_gc when creating log table' from Dawid Mędrek Normally, when we create a table, MV, etc., we apply `cf_prop_defs` to the schema builder via the function `cf_prop_defs::apply_to_builder`. Unfortunately, that didn't happen when creating CDC log tables, and so we might have missed some of the properties that would normally be set to some value, even if the default one. One particular example of that phenomenon was `tombstone_gc`. For better or worse, it's not a "standalone property" of a table, but rather part of `extensions`. [Somewhat related issue: scylladb/scylladb#9722] That may have and did cause trouble. Consider this scenario: 1. A CDC log table is created. 2. The table does NOT have any value of `tombstone_gc` set. 3. The user edits the table via `ALTER TABLE`. That statement treats the log table just like any other one (at least as far as the relevant portion of the logic is concerned). Among other things, it uses `cf_prop_defs::apply_to_builder`, and as a result, the `tombstone_gc` property is set to some value: * the default one if the user doesn't specify it in the statement, * a custom one if they do. Why is that a problem? First of all, it's confusing. When we perform a schema backup and a table uses CDC, we include an ALTER statement for its corresponding CDC log table (for more context, see issue scylladb/scylladb#18467 or commit scylladb/scylladb@f12edbdd95). There are two consequences for the user here: 1. If the log table had NOT been altered ever since it was created, the statement will miss the `tombstone_gc` property as if it couldn't be set for it at all. That's confusing! 2. If the log table HAD in fact been altered after its creation, the statement will include the `tombstone_gc` property. That's even more confusing (why was it not present the first time, but it is now?). The `tombstone_gc` property should always be set to avoid confusion and problematic edge cases in tests and to simply be consistent with how other schema entities work. The solution we employ is that we always set the property to the default value. That includes the case when we reattach the log table to the base; consider the following scenario: 1. Create a table with CDC enabled. 2. Detach the log table by performing `ALTER TABLE ... WITH cdc = {'enabled': false}`. 3. Change the `tombstone_gc` property of the log table. 4. Reattach the log table to the base in the same way as in step 2. The expected result would be that the new value of `tombstone_gc` would be preserved after reattaching the log table. However, that's not what will happen. We decide to stay consistent with how other properties of a log table behave, and we reset them after every reattachment. We might change that in the future: see issue scylladb/scylladb#25523. Two reproducer tests of scylladb/scylladb#25187 are included in the changes. Backport: The problem is not critical, so it may not be necessary to backport the changes. That's to be discussed. Closes scylladb/scylladb#25521 * github.com:scylladb/scylladb: cdc: Set tombstone_gc when creating log table tombstone_gc: Add overload of get_default_tombstone_gc_mode tombstone_gc: Rename get_default_tombstonesonte_gc_mode	2025-09-02 10:20:11 +02:00
Tomasz Grabiec	a7f10b585e	Merge 'drop table: fix crash on drop table with concurrent cleanup' from Ferenc Szili Consider the following scenario: - A tablet is migrated away from a shard - The tablet cleanup stage closes the storage group's async_gate - A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate - Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash Fixes: #25706 This needs to be backported to all supported versions with tablets Closes scylladb/scylladb#25708 * github.com:scylladb/scylladb: test: reproducer and test for drop with concurrent cleanup truncate: check for closed storage group's gate in discard_sstables	2025-09-02 00:02:14 +02:00
Calle Wilund	21adfd8a60	test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage Allows testing using either local mock server (installed or using docker), or real GCP project (not tested as of writing this). v2: Try podman if docker unavail v3: Ensure we check log output on fake-gcs, because when using podman, the published port will be connectible even though the actual server is not up yet. v4: Use ephermal port forward in docker/podman to allow us running parallel instances. Also adjust credentials and port finding in test. v5: Re-ensure no parallel tests for this: We seem to time out in podman trying to fetch image for X parallel tests v6: Remove the ephermal port stuff. Because of course this does not work with our podman-in-podman. Do brute-force port speculation instead. v7: Up timeout for server start to allow docker pull. v8: Fix string check error v9: Add explicit docker image version	2025-09-01 18:14:20 +00:00
Calle Wilund	5ead6ec420	proc-utils: Re-export waiting types from seastar Just to make directly accessible from wrapper type	2025-09-01 18:03:44 +00:00
Calle Wilund	8169327553	proc-utils: Inherit environment from current process In most cases, when launching a process from tests, we will want to inherit our own env. Add option (default true) to do so.	2025-09-01 18:03:44 +00:00
Calle Wilund	4a5b547a86	utils::gcp::object_storage: Add client for GCP object storage Adds a minial client for GCP object storage operations: * Create buckets * Delete buckets * List bucket content * Copy/move bucket content * Delete bucket content * Upload bucket content * Download bucket content	2025-09-01 18:03:44 +00:00
Calle Wilund	8f54b709ce	utils::http: Add optional external credentials to dns_connection_factory init Also allow creating the object using an endpoint expression. Note: this moves code to the .cc file, because it introduces a few more lines, and I feel we have to much stuff in headers as is.	2025-09-01 18:03:44 +00:00
Calle Wilund	0e9e1f7738	utils::rest: Break out request wrapper and send logic Allows sharing some of the wrapping and logic outside the single-call object/routine paths, using it also with an external seastar::http::client, i.e. caching resources across several calls.	2025-09-01 18:03:44 +00:00
Calle Wilund	fe4ab7f7bf	encryption::gcp_host: Use shared gcp credentials + REST helpers Removes code in favour of transplanted shared util code.	2025-09-01 18:03:44 +00:00
Calle Wilund	2b7ad605b3	utils::gcp: Move/add gcp credentials management to shared file Copied from encryption::gcp_host. Light-weight impl of gcp credentials management.	2025-09-01 18:03:44 +00:00
Calle Wilund	f6d7c7e300	utils::rest::client: Add formatter for seastar::http::reply	2025-09-01 18:03:44 +00:00
Calle Wilund	cc1e659abd	utils::rest::client: Add helper routines for simple REST calls Packing headers and unpacking response to json. Usable for esp. gcp interaction.	2025-09-01 18:03:43 +00:00
Calle Wilund	886fcf1759	utils::http: Make shared system trust certificates public So other clients/factories can share.	2025-09-01 18:03:43 +00:00
Karol Nowacki	3086d15999	cql3: Fix crash on ANN OF query when TRACING ON is enabled Executing a vector search (SELECT with ANN OF ordering) query with `TRACING ON` enabled caused a node to crash due to a null pointer dereference. This occurred because a vector index does not have an associated view table, making its `_view_schema` member null. The implementation attempted to enable tracing on this null view schema, leading to the crash. The fix adds a null check for `_view_schema` before attempting to enable tracing on the view (index) table. A regression test is included to prevent this from happening again. Fixes: VECTOR-179 Closes scylladb/scylladb#25500	2025-09-01 17:26:54 +03:00
Avi Kivity	41880bc893	cql3: statement_restrictions: forbid querying a single-column inequality restriction on a multi-column restriction CQL supports multi-column inequality restrictions in the form (ck1, ck2, ck3) >= (:v1, :v2, :v3) These restriction shape is only allowed on clustering keys, and is translated into a partition_slice allowing the primary index to efficiently select the part of the partition that satisfies the restriction. The possible_lhs_values() values function allows extracting single-column restrictions from this and similar tuple restrictions. For example, the multi-column restriction (ck1, ck2, ck3) = (:v1, :v2, :v3) implies that ck2 = :v2. If we have an index on ck2, and if we don't further have a restriction on the partition key, then it is advantageous to use the index to select rows, and then filter on ck1 and ck3 to satisfy the full restriction. For the inquality restriction, we can only infer a restriction on the first column due to lexicographical comparison. We can see that, given (ck1, ck2, ck3) >= (:v1, :v2, :v3) then ck1 >= :v1 ck2 = unbounded ck3 = unbounded and possible_lhs_values() indeed computes this. However, this is never used in practice, and it makes further refactoring difficult. If we want to convert an boolean factor of the where clause to a predicate on a column or tuple of columns, we cannot do so because we can actually generate two predicates: one on the tuple and one on the first column. Since it's not used, remove it. This code was first introduced in `d33053b841` ("cql3/restrictions: Add free functions over new classes") (search for "if (column_index_on_lhs > 0) {"). It does not directly correspond to pre-expression code. Closes scylladb/scylladb#25757	2025-09-01 17:21:26 +03:00
Artsiom Mishuta	5910ad3c6d	test.py: apply the nightly label on test_topology_recovery_basic This test is for the old gossip-based recovery procedure, which is an almost obsolete feature that won't change anymore. Closes scylladb/scylladb#25694	2025-09-01 14:16:29 +02:00
Emil Maskovsky	5dac4b38fb	test/gossiper: add reproducible test for race condition during node decommission This change introduces a targeted test that simulates the gossiper race condition observed during node decommissioning. The test delays gossip state application and host ID lookup to reliably reproduce the scenario where `gossiper::get_host_id()` is called on a removed endpoint, potentially triggering an abort in `apply_new_states`. There is a specific error injection added to widen the race window, in order to increase the likelihood of hitting the race condition. The error injection is designed to delay the application of gossip state updates, for the specific node that is being decommissioned. This should then result in the server abort in the gossiper. Refs: scylladb/scylladb#25621 Fixes: scylladb/scylladb#25721 Backport: The test is primarily for an issue found in 2025.1, so it needs to be backported to all the 2025.x branches. Closes scylladb/scylladb#25685	2025-09-01 13:59:47 +02:00
Ernest Zaslavsky	0e4292adb4	treewide: Add missing license boilerplate Add missing license boilerplate to include headers	2025-09-01 14:58:32 +03:00
Ernest Zaslavsky	19345e539f	treewide: Add missing `#pragma once` Add missing `#pragma once` to include headers	2025-09-01 14:58:21 +03:00
Petr Gusev	2e757d6de4	cas: pass timeout_if_partially_accepted := write to accept_proposal() Write requests cannot be safely retried if some replicas respond with accepts and others with rejects. In this case, the coordinator is uncertain about the outcome of the LWT: a subsequent LWT may either complete the Paxos round (if a quorum observed the accept) or overwrite it (if a quorum did not). If the original LWT was actually completed by later rounds and the coordinator retried it, the write could be applied twice, potentially overwriting effects of other LWTs that slipped in between. Read requests do not have this problem, so they can be safely retried. Before this commit, handler->accept_proposal was called with timeout_if_partially_accepted := true. This caused both read and write requests to throw an "uncertainty" timeout to the user in the case of the contention described above. After this commit, we throw an "uncertainty" timeout only for write requests, while read requests are instead retried in the loop in sp::cas. Closes scylladb/scylladb#25602	2025-09-01 14:31:04 +03:00
Pavel Emelyanov	840cdab627	api: Move /load and /metrics/load handlers code to column_family.cc Both handlers need database to proceed and thus need to be registered (and unregistered) in a group that captures database for its handlers. Once moved, the used get_cf_stats() method can be marked local to column_family.cc file. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#25671	2025-09-01 08:11:00 +02:00
Dawid Mędrek	fc50e9d0a4	test/perf: Require smp=1 in perf_cache_eviction Trying to run the test with more than one shard results in a failure when generating sharding metadata: ``` ERROR 2025-08-27 16:00:17,551 [shard 0:main] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /tmp/scylla-c9fa42fe/ks/cf-2938a030834e11f0a561ffa33feb022d/me-3gt6_12wh_1gifk2ijgeu1ovc1m5-big-Data.db). Aborting ``` Let's require that the test be run with a single shard. Closes scylladb/scylladb#25703	2025-09-01 08:59:35 +03:00
Nadav Har'El	6d1abc5b2c	utils/base64: fix misleading code and comment (no functional change) utils/base64.cc had some strange code with a strange comment in base64_begins_with(). The code had base.substr(operand.size() - 4, operand.size()) The comment claims that this is "last 4 bytes of base64-encoded string", but this comment is misleading - operand is typically shorter than base (this this whole point of the base64_begins_with()), so the real intention of the code is not to find the last 4 bytes of base, but rather the next four bytes after the (operand.size() - 4) which we already copied. These four bytes that may need the full power of base64_decode_string() because they may or may not contain padding. But, if we really want the next 4 bytes, why pass operand.size() as the length of the substring? operand.size() is at least 4 (it's a mutiple of 4, and if it's 0 we returned earlier), but it could me more. We don't need more, we just need 4. It's not really wrong to take more than 4 (so this patch doesn't fix any bug), but can be wasteful. So this code should be: base.substr(operand.size() - 4, 4) We already have in test/boost/alternator_unit_test.cc a test, test_base64_begins_with that takes encoded base64 strings up to 12 characters in length (corresponding to decoded strings up to 8 chars), and substrings from length 0 to the base string's length, and check that test_base64_begins_with succeeds. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25712	2025-09-01 08:57:50 +03:00
Andrei Chekun	e55c8a9936	test.py: modify run to use different junit output filenames Currently, run will execute twice pytest without modifying the path of the JUnit XML report. This leads that the second execution of the pytest will override the report. This PR fixing this issue so both reports will be stored. Closes scylladb/scylladb#25726	2025-09-01 08:56:48 +03:00
Ernest Zaslavsky	05154e131a	cleanup: Add missing `#pragma once` Add missing `#pragma once` to include header Closes scylladb/scylladb#25761	2025-09-01 06:41:57 +03:00
Botond Dénes	fbff8d3b2d	Merge 'vector_store_client: disable Nagle's algorithm on the http client' from Pawel Pery Nagle’s algorithm and Delayed ACK’s algorithm are enabled by default on sockets in Linux. As a result we can experience 40ms latency on simply waiting for ACK on the client side. Disabling the Nagle’s algorithm (using TCP_NODELAY) should fix the issue (client won’t wait 40ms for ACKs). This change sets `TCP_NODELAY` on every socket created by the `http_client`. Checking for dead peers or network is helpful in maintaining a lifetime of the http client. This change also sets TCP_KEEPALIVE option on the http client's socket. Fixes: VECTOR-169 Closes scylladb/scylladb#25401 * github.com:scylladb/scylladb: vector_store_client: set keepalive for the http client's socket vector_store_client: disable Nagle's algorithm on the http client	2025-09-01 06:26:06 +03:00
Jenkins Promoter	619b4102bd	Update pgo profiles - x86_64	2025-09-01 05:08:56 +03:00
Jenkins Promoter	783f866bd3	Update pgo profiles - aarch64	2025-09-01 05:05:17 +03:00
Dario Mirovic	8e994b3890	test/cqlpy: add protocol exception tests Add protocol exception tests that check errors and exceptions. `test_process_startup_invalid_string_map`: `STARTUP` (0x01) with declared map count, but missing entries - `read_string_map` out-of-range. `test_process_query_internal_malformed_query`: `QUERY` (0x07) long string declared larger than available bytes - `read_long_string_view`. `test_process_query_internal_fail_read_options`: `QUERY` (0x07) with `PAGE_SIZE` flag, but truncated page_size - `read_options` path. `test_process_prepare_malformed_query`: `PREPARE` (0x09) long string declared larger than available bytes - `read_long_string_view` in prepare. `test_process_execute_internal_malformed_cache_key`: `EXECUTE` (0x0A) cache key short bytes declared larger than provided bytes - `read_short_bytes`. `test_process_register_malformed_string_list`: `REGISTER` (0x0B) string list with truncated element - `read_string_list`/`read_string`. Each test asserts an `ERROR` frame is returned and `protocol_error` metrics increase, without causing C++ exceptions. Refs: #24567	2025-08-31 23:40:03 +02:00
Dario Mirovic	84e6979adf	test/cqlpy: `test_protocol_exceptions.py` refactor message frame building Frame building is repetitive and increases verbosity, reducing code readability. This patch solves it by extracting common functionality of frame building into `_build_frame`. Also, helpers `_send_frame` and `_recv_frame` are introduced. While `_recv_frame` is not really useful, it goes well in pair with `_send_frame`. Refs: #24567	2025-08-31 23:40:01 +02:00
Dario Mirovic	19c610d9f7	test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code The code that measures errors and exceptions in `test_protocol_exceptions.py` tests is repetitive. This patch refactors common functionality in a separate `_test_impl` function, improving readability. Refs: #24567	2025-08-31 23:39:58 +02:00
Avi Kivity	dfc7957a73	Merge 'test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints' from Benny Halevy Following up on `6129411a5e` improve test_vnode_keyspace_describe_ring be verifying that the endpoints listed by describe_ring match those returned by the `natural_endpoints` api (for random tokens). The latter are calculated using an independent code path directly from the effective_replication_map. * test exists currently only on master, no backport required Closes scylladb/scylladb#25610 * github.com:scylladb/scylladb: test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints test/pylib/rest_client: add natural_endpoints function	2025-08-31 20:36:15 +03:00
Avi Kivity	bae66cc0d8	Merge 'types: add byte-comparable format support for collections' from Lakshmi Narayanan Sreethar This PR builds on the byte comparable support introduced in #23541 to add byte comparable support for all the collection types. This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md Refs https://github.com/scylladb/scylladb/issues/19407 New feature - backport not required. Closes scylladb/scylladb#25603 * github.com:scylladb/scylladb: types/comparable_bytes: add compatibility testcases for collection types types/comparable_bytes: update compatibility testcase to support collection types types/comparable_bytes: support empty type types/comparable_bytes: support reversed types types/comparable_bytes: support vector cql3 type types/comparable_bytes: support tuple and UDT cql3 type types/comparable_bytes: support map cql3 type types/comparable_bytes: support set and list cql3 types types/comparable_bytes: introduce encode/decode_component types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes	2025-08-31 15:53:27 +03:00
Avi Kivity	600349e29a	Merge 'tasks: return task::impl from make_and_start_task ' from Aleksandra Martyniuk Currently, make_and_start_task returns a pointer to task_manager::task that hides the implementation details. If we need to access the implementation (e.g. because we want a task to "return" a value), we need to make and start task step by step openly. Return task_manager::task::impl from make_and_start_task. Use it where possible. Fixes: https://github.com/scylladb/scylladb/issues/22146. Optimization; no backport Closes scylladb/scylladb#25743 * github.com:scylladb/scylladb: tasks: return task::impl from make_and_start_task compaction: use current_task_type repair: add new param to tablet_repair_task_impl repair: add new params to shard_repair_task_impl repair: pass argument by value	2025-08-31 15:44:37 +03:00
Nadav Har'El	ff91027eac	utils, alternator: fix detection of invalid base-64 This patch fixes an error-path bug in the base-64 decoding code in utils/base64.cc, which among other things is used in Alternator to decode blobs in JSON requests. The base-64 decoding code has a lookup table, which was wrongly sized 255 bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF) was included in an invalid base-64 string, instead of detecting that this is an invalid byte (since the only valid bytes in a base-64 string are A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a nonsense 6-bit part, or even crash on an out-of-bounds read. Besides the trivial fix, this patch also includes a reproducing test, which tries to write a blob as a supposedly base-64 encoded string with a 0xFF byte in it. The test fails before this patch (the write succeeds, unexpectedly), and passes after this patch (the write fails as expected). The test also passes on DynamoDB. Fixes #25701 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25705	2025-08-31 15:38:01 +03:00
Avi Kivity	1f4c9b1528	Merge 'system_keyspace: add peers cache to get_ip_from_peers_table' from Petr Gusev The gossiper can call `storage_service::on_change` frequently (see scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues. This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL. This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620. Fixes scylladb/scylladb#25660 backport: this patch needs to be backported to all supported versions (2025.1/2/3). Closes scylladb/scylladb#25658 * github.com:scylladb/scylladb: storage_service: move get_host_id_to_ip_map to system_keyspace system_keyspace: use peers cache in get_ip_from_peers_table storage_service: move get_ip_from_peers_table to system_keyspace	2025-08-31 15:34:35 +03:00
Piotr Wieczorek	5add43e15c	alternator: streams: Address minor incompatibilities with DynamoDB in GetRecords response. This commit adds missing fields to GetRecords responses: `awsRegion` and `eventVersion`. We also considered changing `eventSource` from `scylladb:alternator` to `aws:dynamodb` and setting `SizeBytes` subfield inside the `dynamodb` field. We set `awsRegion` to the datacenter's name of the node that received the request. This is in line with the AWS documentation, except that Scylla has no direct equivalent of a region, so we use the datacenter's name, which is analogous to DynamoDB's concept of region. The field `eventVersion` determines the structure of a Record. It is updated whenever the structure changes. We think that adding a field `userIdentity` bumped the version from `1.0` to `1.1`. Currently, Scylla doesn't support this field (#11523), hence we use the older 1.0 version. We have decided to leave `eventSource` as is, since it's easy to modify it in case of problems to `aws:dynamodb` used by DynamoDB. Not setting `SizeBytes` subfield inside the `dynamodb` field was dictated by the lack of apparent use cases. The documentation is unclear about how `SizeBytes` is calculated and after experimenting a little bit, I haven't found an obvious pattern. Fixes: #6931 Closes scylladb/scylladb#24903	2025-08-31 14:55:47 +03:00
Avi Kivity	bf9a963582	utils: mark crc barrett tables const They're marked constinit, but constinit does not imply const. Since they're not supposed to be modified, mark them const too. Closes scylladb/scylladb#25539	2025-08-31 11:37:39 +03:00
Avi Kivity	bc5773f777	Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational and recoverable even under extreme conditions. To achieve this, the following proactive measures are implemented: - reject writes - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes - applicable to: user tables, views, CDC log, audit, cql tracing - stop running compactions/repairs and prevent from starting new ones - reject incoming tablet migrations The aforementioned mechanisms are automatically enabled when node's disk utilization reaches the critical level (default: 98%) and disabled when the utilization drop below the threshold. Apart from that, the series add tests that require mounted volumes to simulate out of space. The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`. When not provided, tests are skipped. Test scenarios: 1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization 2. Perform an operation that would take the nodes over 100% 3. The nodes should not exceed the critical disk utilization (98% by default) 4. Scale out the cluster by adding one node per rack 5. Retry or wait for the operation from step 2 The operation is: writing data, running compactions, building materialized views, running repair, migrating tablets (caused by RF change, decommission). The test is successful, if no nodes run out of space, the operation from step 2 is aborted/paused/timed out and the operation from step 5 is successful. `perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency: Read path (before) ``` instructions_per_op: mean= 39661.51 standard-deviation=34.53 median= 39655.39 median-absolute-deviation=23.33 maximum=39708.71 minimum=39622.61 ``` Read path (after) ``` instructions_per_op: mean= 39691.68 standard-deviation=34.54 median= 39683.14 median-absolute-deviation=11.94 maximum=39749.32 minimum=39656.63 ``` Write path (before): ``` instructions_per_op: mean= 50942.86 standard-deviation=97.69 median= 50974.11 median-absolute-deviation=34.25 maximum=51019.23 minimum=50771.60 ``` Write path (after): ``` instructions_per_op: mean= 51000.15 standard-deviation=115.04 median= 51043.93 median-absolute-deviation=52.19 maximum=51065.81 minimum=50795.00 ``` Fixes: https://github.com/scylladb/scylladb/issues/14067 Refs: https://github.com/scylladb/scylladb/issues/2871 No backport, as it is a new feature. Closes scylladb/scylladb#23917 * github.com:scylladb/scylladb: tests/cluster: Add new storage tests test/scylla_cluster: Override workdir when passed via cmdline streaming: Reject incoming migrations storage_service: extend locator::load_stats to collect per-node critical disk utilization flag repair_service: Add a facility to disable the service compaction_manager: Subscribe to out of space controller compaction_manager: Replace enabled/disabled states with running state database: Add critical_disk_utilization mode database can be moved to disk_space_monitor: add subscription API for threshold-based disk space monitoring docs: Add feature documentation config: Add critical_disk_utilization_level option replica/exceptions: Add a new custom replica exception	2025-08-30 18:47:57 +03:00
Petr Gusev	898531fe7c	client_state: decoroutinize check_internal_table_permissions This function is on a hot path, better avoid allocating coroutine frames. Fixes scylladb/scylladb#25501 Closes scylladb/scylladb#25689	2025-08-30 18:46:54 +03:00
Avi Kivity	5c4a8ee134	Update seastar submodule * seastar 0a90f7945...c2d989333 (7): > Add missing `#pragma once` to response_parser.rl > simple-stream: avoid memcpy calls in fragmented streams for constant sizes > reactor: Move stopping activity out of main loop > Add sequential buffer size options to IOTune > disable exception interception when ASAN enabled > file, io_queue: Drop maybe_priority_class_ref{} from internal calls > reactor: Equip make_task() and lambda_task with concepts Closes scylladb/scylladb#25737	2025-08-30 14:53:34 +03:00
Calle Wilund	cc9eb321a1	commitlog: Ensure segment deletion is re-entrant Fixes #25709 If we have large allocations, spanning more than one segment, and the internal segment references from lead to secondary are the only thing keeping a segment alive, the implicit drop in discard_unused_segments and orphan_all can cause a recursive call to discard_unused_segments, which in turn can lead to vector corruption/crash, or even double free of segment (iterator confusion). Need to separate the modification of the vector (_segments) from actual releasing of objects. Using temporaries is the easiest solution. To further reduce recursion, we can also do an early clear of segment dependencies in callbacks from segment release (cf release). Closes scylladb/scylladb#25719	2025-08-30 08:24:57 +02:00
Piotr Dulikowski	7ccb50514d	Merge 'Introduce view building coordinator' from Michał Jadwiszczak This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views. The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table. The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built. The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results. This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch. The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation. If the operation fails at any moment, aborted tasks are rollback. The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149). For detailed description check: `docs/dev/view-building-coordinator.md` Fixes https://github.com/scylladb/scylladb/issues/22288 Fixes https://github.com/scylladb/scylladb/issues/19149 Fixes https://github.com/scylladb/scylladb/issues/21564 Fixes https://github.com/scylladb/scylladb/issues/17603 Fixes https://github.com/scylladb/scylladb/issues/22586 Fixes https://github.com/scylladb/scylladb/issues/18826 Fixes https://github.com/scylladb/scylladb/issues/23930 --- This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942 Closes scylladb/scylladb#23760 * github.com:scylladb/scylladb: test/cluster: add view build status tests test/cluster: add view building coordinator tests utils/error_injection: allow to abort `injection_handler::wait_for_message()` test: adjust existing tests utils/error_injection: add injection with `sleep_abortable()` db/view/view_builder: ignore `no_such_keyspace` exception docs/dev: add view building coordinator documentation db/view/view_building_worker: work on `process_staging` tasks db/view/view_building_worker: register staging sstable to view building coordinator when needed db/view/view_building_worker: discover staging sstables db/view/view_building_worker: add method to register staging sstable db/view/view_update_generator: add method to process staging sstables instantly db/view/view_update_generator: extract generating updates from staging sstables to a method db/view/view_update_generator: ignore tablet-based sstables db/view/view_building_coordinator: update view build status on node join/left db/view/view_building_coordinator: handle tablet operations db/view: add view building task mutation builder service/topology_coordinator: run view building coordinator db/view: introduce `view_building_coordinator` db/view/view_building_worker: update built views locally db/view: introduce `view_building_worker` db/view: extract common view building functionalities db/view: prepare to create abstract `view_consumer` message/messaging_service: add `work_on_view_building_tasks` RPC service/topology_coordinator: make `term_changed_error` public db/schema_tables: create/cleanup tasks when an index is created/dropped service/migration_manager: cleanup view building state on drop keyspace service/migration_manager: cleanup view building state on drop view service/migration_manager: create view building tasks on create view test/boost: enable proxy remote in some tests service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` service/migration_manager: coroutinize `prepare_new_view_announcement()` service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine` service: reload `view_building_state_machine` on group0 apply() service/vb_coordinator: add currently processing base db/system_keyspace: move `get_scylla_local_mutation()` up db/system_keyspace: add `view_building_tasks` table db/view: add view_building_state and views_state db/system_keyspace: add method to get view build status map db/view: extract `system.view_build_status_v2` cql statements to system_keyspace db/system_keyspace: move `internal_system_query_state()` function earlier db/view: ignore tablet-based views in `view_builder` gms/feature_service: add VIEW_BUILDING_COORDINATOR feature	2025-08-29 17:28:44 +02:00
Aleksandra Martyniuk	7fe1ad1f63	tasks: return task::impl from make_and_start_task Currently, make_and_start_task returns a pointer to task_manager::task that hides the implementation details. If we need to access the implementation (e.g. because we want a task to "return" a value), we need to make and start task step by step openly. Return task_manager::task::impl from make_and_start_task. Use it where possible. Fixes: https://github.com/scylladb/scylladb/issues/22146.	2025-08-29 17:12:07 +02:00
Aleksandra Martyniuk	0844a057d1	compaction: use current_task_type	2025-08-29 17:08:00 +02:00
Aleksandra Martyniuk	33a547e740	test: check that repair with outdated session_id fails	2025-08-29 17:00:48 +02:00
Aleksandra Martyniuk	8f967cde5c	service: pass current session_id to repair rpc Currently, in repair_tablet we retrieve session_id from tablet map (and throw if it isn't specified). In case of topology coordinator failover, we may end up in a situation where a node runs outdated repair, treating session of a different operation as the repair's session: - topology coordinator starts repair transition (A); - topology coordinator sends tablet repair rpc to node1; - topology coordinator is separated from the cluster; - new topology coordinator is elected; - new topology coordinator sees waiting repair request (A_2) and executes it; - new repair of the same tablet is requested (B); - new topology coordinator starts repair transition (B); - new topology coordinator sends tablet repair rpc to node2; - node2 starts repair (B) as repair master; - node1 starts repair (A), checks the current session (B), proceeds with repair (B) as repair master. Send current session_id in repair_tablet rpc. If this session_id and session id got from tablet map don't match, an exception is thrown.	2025-08-29 16:46:52 +02:00
Łukasz Paszkowski	e34deea50e	tests/cluster: Add new storage tests The storage submodule contains tests that require mounted volumes to be executed. The volumes are created automatically with the `volumes_factory` fixture. The tests in this suite are executed with the custom launcher `unshare -mr pytest` Test scenarios (when one node reaches critical disk utilization): 1. Reject user table writes 2. Disable/Enabled compaction 3. Reject split compactions 4. New split compactions not triggered 5. Abort tablet repair 6. Disable/Enabled incoming tablet migrations 7. Restart a node while a tablet split is triggered	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	4bb5696a5d	test/scylla_cluster: Override workdir when passed via cmdline Currently, workdir is set in ScyllaCluster constructor and it does not take into accout that the value could be overridden via cmdline arguments. When this happens, then some data (logs, configs) are stored under one path and other (data) is stored under a different. The patch allows overriding the value when passed via cmdline arguments leading to all files being stored under the same path.	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	7cfedb1214	streaming: Reject incoming migrations When a replica operates in the critical disk utilization mode, all incoming migrations are being rejected by rejecting an incoming sstable file. In the topology_coordinator, the rejected tablet is moved into the cleanup_target state in order to revert migration. Otherwise, retry happens and a cluster stays in the tablet_migration transition state preventing any other topology changes to happen, e.g. scaling out. Once the tablet migration is rejected, the load balancer will schedule a new migration.	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	54201960e6	storage_service: extend locator::load_stats to collect per-node critical disk utilization flag This commit extends the TABLE_LOAD_STATS RPC with information whether a node operates in the critical disk utilization mode. This information will be needed to distict between the causes why a table migration/repair was interrupted.	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	9809800aa8	repair_service: Add a facility to disable the service Repair service currently have two functions: stop() and shutdown() that stop all ongoing repairs and prevent any further repairs from being started. It is possible to stop the repair_service once. Once stopped, it cannot be restarted. We would like, however, to enable / disable the repair service many times. Similarly to compaction_manager, the repair service provides two new functions: - drain() - abort all ongoing local repair task and disable the service, i.e. no new local task will be scheduled and data received from the repair master is rejected. It's, though, still possible to schedule a global repair request - enable() - enable the service By default, the repair service is enabled immediately once started. For tablet-based keyspaces, the new facility prevents tablets from being repaired. Whenever the repair_service is disabled and the request to repair a tablet arrives, an exception is returned. Once the exception is thrown, the tablet is moved into the end_repair state and the operation will be retried later. Hence, disabling the service does not fail the global tablet repair request.	2025-08-29 14:56:13 +02:00
Łukasz Paszkowski	9539e80e54	compaction_manager: Subscribe to out of space controller	2025-08-29 14:56:07 +02:00
Aleksandra Martyniuk	f3b43b6384	repair: add new param to tablet_repair_task_impl Currently, sched_info is set immediately after tablet_repair_task_impl is created. Pass this param to constructor instead. It's a preparation for the following changes.	2025-08-29 14:37:00 +02:00
Aleksandra Martyniuk	57b47e282e	repair: add new params to shard_repair_task_impl Currently, neighbors and small_table_optimization_ranges_reduced_factor are set immediately after shard_repair_task_impl is created. Pass these params to constructor instead. It's a preparation for following changes.	2025-08-29 14:27:00 +02:00
Aleksandra Martyniuk	6a0d8728de	repair: pass argument by value shard_repair_task_impl constructor gets some of its arguments by const reference. Due to that those arguments are copied when they could be moved. Get shard_repair_task_impl constructor arguments by value. Use std::move where possible.	2025-08-29 14:24:47 +02:00
Dawid Mędrek	90a2e0d1cc	cdc/generation: Delete copy constructors of topology_description The object might be quite big and lead to reactor stalls. Using its copy constructor is asking for trouble, so let's explicitly delete it.	2025-08-29 13:54:16 +02:00
Dawid Mędrek	508e00319b	cdc/generation: Clone topology_description asynchronously An instance of `cdc::topology_description` can be quite big. The vector it consists of stores as many `token_range_description`s as there are vnodes, and the size of each `token_range_description` is O(#shards). Because of that, copying an instance of the type can lead to reactor stalls. To prevent that, we introduce an asynchronous function copying the contents on the object. Reactor stalls were detected in the call to `map_reduce` in `generation_service::legacy_do_handle_cdc_generation`, so let's start using the new function there. A similar scenario occurs in `generation_service::handle_cdc_generation`, so we modify it too. Unfortunately, it doesn't seem viable to provide a reproducer of said problem. Fixes scylladb/scylladb#24522	2025-08-29 13:54:00 +02:00
Łukasz Paszkowski	40c40be8a6	compaction_manager: Replace enabled/disabled states with running state Using a single state variable to keep track whether compaction manager is enabled/disabled is insufficient, as multiple services may independently request compactions to be disabled. To address this, a counter is introduced to record how many times the compaction manager has been drained. The manager is considered enabled only when this counter reaches zero. Introducing a counter, enabled and disabled states become obsolete. So they are replaced with a single running state.	2025-08-29 13:47:01 +02:00
Łukasz Paszkowski	3d03b88719	database: Add critical_disk_utilization mode database can be moved to When database operates in the critical disk utilization mode, all mutation writes including inserts, updates, deletes, counter updates, hints, read+repair, lwt writes) to user tables and other associated with them tables like views, CDC log, audit are rejected, with a clear error exception returned. The mode is meant to be used with the disk space monitor in order to prevent any user writes when node's disk utilization is too high.	2025-08-29 13:46:45 +02:00
Dawid Pawlik	a70086c781	create_index_statement: rename `validator` to `custom_index_factory` The change is motivated by the fact that indeed the result of `get_custom_class_factory` is a `custom_index_factory`. The name `validator` was a bit misleading as it does not validate anything by itself. Furthermore if we wanted to use the custom index produced by the factory in other operations than validate, the name feels really off.	2025-08-29 10:49:15 +02:00
Dawid Pawlik	873d7dba5c	custom index: rename `custom_index_option_name` Renamed `custom_index_option_name` to `custom_class_option_name` as the late was a bit misleading since we refactored our model of custom indexes to be index class reliant.	2025-08-29 10:49:15 +02:00
Dawid Pawlik	18e4b9d989	vector_index: rename `supported_options` to `vector_index_options` There are a few types of index options abstraction in a code. One is `raw_options` which indicates the options provided by the user via CQL. Another is `options` which includes the real index options after correction checks and addition of system-set options. I believe we do not need another abstraction with undescriptive name. This patch adds a little neatness, describing what should the developer understand by looking at the `supported_options`. This options are only provided for the vector index to setup the external index properly with parameters strongly related to Vector Search.	2025-08-29 10:47:02 +02:00
Lakshmi Narayanan Sreethar	ce0c29e024	types/comparable_bytes: add compatibility testcases for collection types This patch adds compatibility testcases for the following cql3 types : set, list, map, tuple, vector and reversed types. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	4547f6f188	types/comparable_bytes: update compatibility testcase to support collection types The `abstract_type::from_string()` method used to parse the input data doesn't support collections yet. So the collection testdata will be passed as JSON strings to the testcase. This patch updates the testcase to adapt to this workaround. Also, extended the testcase to verify that Scylla's implementation can successfully decode the byte comparable output encoded by Cassandra. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	0997b3533c	types/comparable_bytes: support empty type Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	b799101a09	types/comparable_bytes: support reversed types A reversed type is first encoded using the underlying type and then all the bits are flipped to ensure that the lexicographical sort order is reversed. During decode, the bytes are flipped first and then decoded using the underlying type. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	6c2a3e2c51	types/comparable_bytes: support vector cql3 type The CQL vector type encoding is similar to the lists, where each element is transformed into a byte-comparable format and prefixed with a component marker. The sequence is terminated with a terminator marker to indicate the end of the collection. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	1ccfe522f1	types/comparable_bytes: support tuple and UDT cql3 type The CQL tuple and UDT types share the same internal implementation and therefore use the same byte comparable encoding. The encoding is similar to lists, where each element is transformed into a byte-comparable format and prefixed with a component marker. The sequence is terminated with a terminator marker to indicate the end of the collection. TODO: Add duplicate test items to maps, lists and sets For maps, add more entries that share keys ex map1 : key1 : value1, key2 : value2 map2 : key1 : value4 map3 : key2 : value5 etc Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	ca38c15a97	types/comparable_bytes: support map cql3 type The CQL map type is encoded as a sequence of key-value pairs. Each key and each value is individually prefixed with a component marker, and the sequence is terminated with a terminator marker to indicate the end of the collection. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	4d5e5f0c84	types/comparable_bytes: support set and list cql3 types The CQL set and list types are encoded as a sequence of elements, where each element is transformed into a byte-comparable format and prefixed with a component marker. The sequence is terminated with a terminator marker to indicate the end of the collection. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar	8e46e8be01	types/comparable_bytes: introduce encode/decode_component The components of a collection, such as an element from a list, set, or vector; a key or value from a map; or a field from a tuple, share the same encode and decode logic. During encode, the component is transformed into the byte comparable format and is prefixed with the `NEXT_COMPONENT` marker. During decode, the component is transformed back into its serialized form and is prefixed with the serialized size. A null component is encoded as a single `NEXT_COMPONENT_NULL` marker and during decode, a `-1` is written to the serialized output. This commit introduces few helper methods that implement the above mentioned encode and decode logics. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:21 +05:30
Lakshmi Narayanan Sreethar	47e88be6e0	types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes Added helper functions to_comparable_bytes() and from_comparable_bytes() to let collection encode/decode methods invoke encode/decode of the underlying types. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2025-08-29 12:26:09 +05:30
Dario Mirovic	8120709231	transport: replace `make_frame` throw with return result `cql_transport::response::make_frame` used to throw `protocol_exception`. With this change it will return `result_with_exception_ptr<sstring>` instead. Code changes are propagated to `cql_transport::cql_server::response::make_message` and from there to `cql_transport::cql_server::connection::write_response`. `write_response` continuation calling `make_message` used to transform the exception from `make_message` to an exception future, and now the logic stays the same, just explicitly stated at this code layer, so the behavior is not changed. Refs: #24567	2025-08-28 23:33:33 +02:00
Dario Mirovic	ba178f4c85	cql3: remove throwing `protocol_exception` Remove throwing `protocol_exception` in cql3/query_options.cc` in function `cql3::query_options::check_serial_consistency` as part of an ongoing effort to remove throwing `protocol_exception`. This change only affects code local to the `cql3` module. Refs: #24567	2025-08-28 23:33:15 +02:00
Dario Mirovic	fc123f865e	transport: replace throw in validate_utf8 with result_with_exception_ptr return As part of the effort to replace `protocol_exception` throws, `validate_utf8` from `cql_transport::request_reader` throw is replaced with returning `utils::result_with_exception_ptr`. This change affects only the three places it is called from in the same file `transport/request.hh`. Refs: #24567	2025-08-28 23:32:28 +02:00
Dario Mirovic	51995af258	transport: replace throwing protocol_exception with returns Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance. Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read*` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_exception_ptr`. This change is then propagated to the callers. The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_exception_ptr_throw_policy`. This means that the behavior of commitlog module stays the same. transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_exception_ptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved. Fixes: #24567	2025-08-28 23:31:36 +02:00
Dario Mirovic	f01efd822e	utils: add result_with_exception_ptr Add `result_with_exception_ptr` result type. Successful result has user specified type. Failed result has std::exception_ptr. This approach is simpler than `result_with_exception`. It does not require user to pass exception types as variadic template through all the callstack. Specific exception type can still be accessed without costly std::rethrow_exception(eptr) by using `try_catch`, if configured so via `USE_OPTIMIZED_EXCEPTION_HANDLING`. This means no information loss, but less verbosity when writing result types. Refs: #24567	2025-08-28 23:31:04 +02:00
Łukasz Paszkowski	3e740d25b5	disk_space_monitor: add subscription API for threshold-based disk space monitoring Introduce the `subscribe` method to disk_space_monitor, allowing clients to register callbacks triggered when disk utilization crosses a configurable threshold. The API supports flexible trigger options, including notifications on threshold crossing and direction (above/below). This enables more granular and efficient disk space monitoring for consumers.	2025-08-28 18:06:37 +02:00
Łukasz Paszkowski	c2de678a87	docs: Add feature documentation 1. Adds user-facing page in /docs/troubleshooting/error-messages	2025-08-28 18:06:37 +02:00
Łukasz Paszkowski	535c901e50	config: Add critical_disk_utilization_level option The option defines the threshold at which the defensive mechanisms preventing nodes from running out of space, e.g. rejecting user writes shall be activated. Its default value is 98% of the disk capacity.	2025-08-28 18:06:37 +02:00
Łukasz Paszkowski	132fd1e3f2	replica/exceptions: Add a new custom replica exception The new exception `critical_disk_utilization_exception` is thrown when the user table mutation writes are being blocked due to e.g. reaching a critical disk utilization level. This new exception, is then correctly handled on the coordinator side when transforming into `mutation_write_failure_exception` with a meaningful error message: "Write rejected due to critical disk utilization".	2025-08-28 18:06:37 +02:00
Ferenc Szili	1b8a44af75	test: reproducer and test for drop with concurrent cleanup This change adds a reproducer and test for issue #25706	2025-08-28 16:51:36 +02:00
Ferenc Szili	a0934cf80d	truncate: check for closed storage group's gate in discard_sstables Consider the following scenario: - A tablet is migrated away from a shard - The tablet cleanup stage closes the storage group's async_gate - A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate - Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash This patch makes dicard_sstables check if the storage group's gate is closed whend checking for disabled compaction.	2025-08-28 16:51:25 +02:00
Petr Gusev	4b907c7711	storage_service: move get_host_id_to_ip_map to system_keyspace Reimplemented the function to use the peers cache. It could be replaced with get_ip_from_peers_table, but that would create a coroutine frame for each call.	2025-08-28 12:48:46 +02:00
Petr Gusev	de5dc4c362	system_keyspace: use peers cache in get_ip_from_peers_table The storage_service::on_change method can be called quite often by the gossiper, see scylladb/scylla-enterprise#5613. In this commit we introduce a temporal cache for system.peers so that we don't have to go to the storage each time we need to resolve host_id -> ip. We keep the cache only for a small amount of time to handle the (unlikely) scenario when the user wants to update system.peers table from CQL. Fixes scylladb/scylladb#25660	2025-08-28 12:48:39 +02:00
Avi Kivity	46193f5e79	Merge 'service/qos: Modularize service level controller to avoid invalid access to auth::service' from Dawid Mędrek Move management over effective service levels from `service_level_controller` to a new dedicated type -- `auth_integration`. Before these changes, it was possible for the service level controller to try to access `auth::service` after it was deinitialized. For instance, it could happen when reloading the cache. That HAS happened as described in the following issue: scylladb/scylladb#24792. Although the problem might have been mitigated or even resolved in scylladb/scylladb@10214e13bd, it's not clear how the service will be used in the future. It's best to prevent similar bugs than trying to fix them later on. The logic responsible for preventing to access an uninitialized `auth::service` was also either non-existent, complex, or non-sufficient. To prevent accessing `auth::service` by the service level controller, we extract the relevant portion of the code to a separate entity -- `auth_integration`. It's an internal helper type whose sole purpose is to manage effective service levels. Thanks to that, we were able to nest the lifetime of `auth_integration` within the lifetime of `auth::service`. It's now impossible to attempt to dereference it while it's uninitialized. If a bug related to an invalid access is spotted again, though, it might also be easier to debug it now. There should be no visible change to the users of the interface of the service level controller. We strived to make the patch minimal, and the only affected part of the logic should be related to how `auth::service` is accessed. The relevant portion of the initialization and deinitialization flow: (a) Before the changes: 1. Initialize `service_level_controller`. Pass a reference to an uninitialized `auth::service` to it. 2. Initialize other services. 3. Initialize and start `auth::service`. 4. (work) 5. Stop and deinitialize `auth::service`. 6. Deinitialize other services. 7. Deinitialize `service_level_controller`. (b) After the changes: 1. Initialize `service_level_controller`. Pass a reference to an uninitialized `auth::service` to it. () 2. Initialize other services. 3. Initialize and start `auth::service`. 4. Initialize `auth_integration`. Register it in `service_level_controller`. 5. (work) 6. Unregister `auth_integration` in `service_level_controller` and deinitialize it. 7. Stop and deinitialize `auth::service`. 8. Deinitialize other services. 9. Deinitialize `service_level_controller`. (): The reference to `auth::service` in `service_level_controller` is still necessary. We need to access the service when dropping a distributed service level. Although it would be best to cut that link between the service level controller and `auth::service` too, effectively separating the entities, it would require more work, so we leave it as-is for now. It shouldn't prove problematic as far as accessing an uninitialized service goes. Trying to drop a service level at the point when we're de-initializing auth should be impossible. For more context, see the function `drop_distributed_service_level` in `service_level_controller`. A trivial test has been included in the PR. Although its value is questionable as we only try to reload the service level cache at a specific moment, it's probably the best we can deliver to provide a reproducer of the issue this patch is resolving. Fixes scylladb/scylladb#24792 Backport: The impact of the bug was minimal as it only affected the shutdown. However, since CI is failing because of it, let's backport the change to all supported versions. Closes scylladb/scylladb#25478 * github.com:scylladb/scylladb: service/qos: Move effective SL cache to auth_integration service/qos: Add auth::service to auth_integration service/qos: Reload effective SL cache conditionally service/qos: Add gate to auth_integration service/qos: Introduce auth_integration	2025-08-28 13:42:55 +03:00
Petr Gusev	91c633371e	storage_service: move get_ip_from_peers_table to system_keyspace We plan to add a cache to get_ip_from_peers_table in upcoming commits. It's more convenient to do this from system_keyspace, since the only two methods that mutate system.peers (remove_endpoint and update_peers_info) are already there.	2025-08-28 12:30:41 +02:00
Aleksandra Martyniuk	773ae73704	test: add compaction task progress test	2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk	f3ed852115	compaction: set progress unit for compaction tasks	2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk	046742bd18	compaction: find expected workload for reshard tasks Find expected workload in bytes of reshard tasks. The workload of table_resharding_compaction_task_impl is found at the beginning of its execution. Before that, expected_total_workload() returns std::nullopt, which means that the progress for this task won't be shown.	2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk	a2381380f2	compaction: find expected workload for global cleanup compaction tasks Sum bytes of all sstables of all non local vnode keyspaces.	2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk	cc9e342cb6	compaction: find expected workload for global major compaction tasks Sum bytes of all sstables of all keyspaces.	2025-08-28 12:10:13 +02:00
Aleksandra Martyniuk	b12fc50de7	compaction: find expected workload for keyspace compaction tasks Add compaction_task_impl::get_keyspace_task_workload that sums the bytes in all sstables of this keyspace. This function is used to find the expected workload of the following keyspace compaction types: - major; - cleanup; - offstrategy; - upgrade_sstables; - scrub.	2025-08-28 12:10:07 +02:00
Aleksandra Martyniuk	82e241eb00	compaction: find expected workload for shard compaction tasks Add compaction_task_impl::get_shard_task_workload that sums the bytes in all sstables of this keyspace on this shard. This function is used to find the expected workload of the following shard compaction types: - major; - cleanup; - offstrategy; - upgrade_sstables; - scrub.	2025-08-28 12:04:16 +02:00
Ran Regev	515d9f3e21	docs: backup and restore feature added backup and restore as a feature to documentation Signed-off-by: Ran Regev <ran.regev@scylladb.com> Closes scylladb/scylladb#25608	2025-08-28 13:00:19 +03:00
Calle Wilund	2eccd17e70	system_keyspace: Limit parallelism in drop_truncation_records Fixes #25682 Refs scylla-enterprise#5580 If the truncation table is large in entries, we might create a huge parallel execution, quite possibly consuming loads of resources doing something quite trivial. Limit concurrency to a small-ish number Closes scylladb/scylladb#25678	2025-08-28 12:50:00 +03:00
Nadav Har'El	aa36430ff9	build: make patchelf executable much smaller In our recent binary distributions, we have a pretty big "patchelf" binary: -rwxr-xr-x. 1 nyh nyh 2.5M Jul 30 21:16 build/2025.3.0~rc2/libexec/patchelf Although 2.5 MB isn't what it used to be, it's still surprising that this tiny tool, that doesn't need any libraries beyond standard C++ (it doesn't use Seastar, Boost, or anything) can be this big. And 2.5 MB is over 1% of our entire "relocatable package" size, just for this silly patchelf tool :-( It turns out this was all just a mistake in our configure.py build system - patchelf was built by the exact same code that built the "scylla" executable (it is listed on the "apps" list just like Scylla), so it got links with a bazillion libraries - and in "release" build mode, some of this was against statically linked libraries. So in this patch I move patchelf from the "apps" list to a new list of "cpp_apps" - tools that need to be built with C++ but without libraries like Seastar or abseil or boost. After this patch, the 2.5 MB patchelf is down to just 30 KB. I verified that the Cmake-based build doesn't have this problem, so doesn't need fixing - it already builds patchelf with size around 30 KB. So this patch only needs to modify configure.py. Fixes #25472 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25476	2025-08-28 12:36:53 +03:00
Aleksandra Martyniuk	926753e8bf	compaction: find expected workload for table compaction tasks Add compaction_task_impl::get_table_task_workload that sums the bytes in all sstables in the table. This function is used to find the expected workload of the following compaction types: - major; - cleanup; - offstrategy; - upgrade_sstables; - scrub.	2025-08-28 10:41:22 +02:00
Dario Mirovic	587a877718	docs/cql: update documentation for default replication factor Update create-keyspace-statement section of ddl.rst since replication factor is no longer mandatory. Add an example for keyspace creation without specifying replication factor. Add an example for keyspace creation without specifying both `class` and replication factor. Refs: #16028	2025-08-28 01:42:34 +02:00
Dario Mirovic	fd84da7a50	test/cqlpy: add keyspace creation default replication factor tests Add test cases for create keyspace default replication factor. It is expected that the default replication factor is equal to the number of racks containing at least some non-zero-token nodes in the test suite. Refs: #16028	2025-08-28 01:42:34 +02:00
Dario Mirovic	ca5adf2ac1	cql3: add default replication factor to `create_keyspace_statement` When creating a new keyspace, replication factor must be stated. For example: `CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy', 'replication_factor': 3 };` This patch changes it in the following way - if there is no replication factor specified, use default replication factor. Default replication factor is equal to the number of racks that are not comprised of only zero-token nodes, i.e. racks that have at least one non-zero-token node. The following syntax is now valid: `CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy' };` `CREATE KEYSPACE ks WITH REPLICATION { };` Fixes: #16028	2025-08-28 01:42:29 +02:00
Radosław Cybulski	01bb7b629a	build: add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros Closes #25182	2025-08-27 21:37:54 +03:00
Aleksandra Martyniuk	bd28c50d84	compaction: return empty progress when compaction_size isn't set Currently, progress of compaction task executors is reported in bytes. However, if compaction_size isn't set for compaction task executor, the executor's progress is shown as 1/1 (if it has finished) or 0/1 (otherwise). In the following patches, the progress of executors' parent task will be found based on its children. Hence, to avoid mixing different progress units, the binary progress is no longer used. Return empty progress when compaction_size isn't set. Drop task_manager::task::impl::get_binary_progress as it's no longer used.	2025-08-27 17:51:21 +02:00
Aleksandra Martyniuk	f78dbff814	compaction: update compaction_data::compaction_size at once Currently, in compaction::setup compaction_size is updated in a loop. Due to that the total progress of compaction executors grows during their execution. Add the sstables sizes to a compaction_size variable. Update compaction_data::compaction_size after the loop.	2025-08-27 17:50:36 +02:00
Aleksandra Martyniuk	836159b0c3	tasks: do not check expected workload for done task task_manager::task::impl::get_progress checks the expected total workload of a task to find its progress. If a task has finished successfully then its workload is equal to the sum of total progresses of its children. Do not call expected_total_workload for tasks that have finished successfully.	2025-08-27 17:48:25 +02:00
Dawid Mędrek	646f8bc4cd	cdc: Set tombstone_gc when creating log table Normally, when we create a table, MV, etc., we apply `cf_prop_defs` to the schema builder via the function `cf_prop_defs::apply_to_builder`. Unfortunately, that didn't happen when creating CDC log tables, and so we might have missed some of the properties that would normally be set to some value, even if the default one. One particular example of that phenomenon was `tombstone_gc`. For better or worse, it's not a "standalone property" of a table, but rather part of `extensions`. [Somewhat related issue: scylladb/scylladb#9722] That may have and did cause trouble. Consider this scenario: 1. A CDC log table is created. 2. The table does NOT have any value of `tombstone_gc` set. 3. The user edits the table via `ALTER TABLE`. That statement treats the log table just like any other one (at least as far as the relevant portion of the logic is concerned). Among other things, it uses `cf_prop_defs::apply_to_builder`, and as a result, the `tombstone_gc` property is set to some value: * the default one if the user doesn't specify it in the statement, * a custom one if they do. Why is that a problem? First of all, it's confusing. When we perform a schema backup and a table uses CDC, we include an ALTER statement for its corresponding CDC log table (for more context, see issue scylladb/scylladb#18467 or commit scylladb/scylladb@f12edbdd95). There are two consequences for the user here: 1. If the log table had NOT been altered ever since it was created, the statement will miss the `tombstone_gc` property as if it couldn't be set for it at all. That's confusing! 2. If the log table HAD in fact been altered after its creation, the statement will include the `tombstone_gc` property. That's even more confusing (why was it not present the first time, but it is now?). The `tombstone_gc` property should always be set to avoid confusion and problematic edge cases in tests and to simply be consistent with how other schema entities work. The solution we employ is that we always set the property to the default value. That includes the case when we reattach the log table to the base; consider the following scenario: 1. Create a table with CDC enabled. 2. Detach the log table by performing `ALTER TABLE ... WITH cdc = {'enabled': false}`. 3. Change the `tombstone_gc` property of the log table. 4. Reattach the log table to the base in the same way as in step 2. The expected result would be that the new value of `tombstone_gc` would be preserved after reattaching the log table. However, that's not what will happen. We decide to stay consistent with how other properties of a log table behave, and we reset them after every reattachment. We might change that in the future: see issue scylladb/scylladb#25523. Two reproducer tests of scylladb/scylladb#25187 are included in the changes. Fixes scylladb/scylladb#25187	2025-08-27 13:18:41 +02:00
Anna Mikhlin	03b127082d	trigger scylla-ci Jenkins job by command trigger Scylla-CI-Route job that will trigger the scylla-ci jenkins job with the relevant params by specific command: `'@scylladbbot trigger-ci' Fixes: https://scylladb.atlassian.net/browse/PKG-2 Closes scylladb/scylladb#25695	2025-08-27 14:12:28 +03:00
Dawid Mędrek	2229060992	tombstone_gc: Add overload of get_default_tombstone_gc_mode We add a new overload of the function to avoid accessing the information about a keyspace via `data_dictionary::database` or `replica::database`. We motivate the change by the fact that there are situations when that piece of information might not be available: for instance, Alternator tables reside in separate keyspaces created specifically for them. When we create one, the mutations corresponding to creating the keyspace and the table must be applied together to ensure atomicity. Because of that, during the creation of the table, we will not be able to learn anything about the keyspace as it doesn't exist yet. That scenario is the actual motivation for this commit, and it is a prerequisite for upcoming changes in creation of CDC log tables. For more context on that problem, see issue: scylladb/scylladb#25187.	2025-08-27 13:00:10 +02:00
Dawid Mędrek	fd4e577db0	tombstone_gc: Rename get_default_tombstonesonte_gc_mode The previous identifier was probably a typo that was missed.	2025-08-27 13:00:10 +02:00
Nadav Har'El	401b04f9ea	vector-store: disambiguate call to format() As explained in commit `3e84d43f93` two years ago, using just format() instead of seastar::format() or fmt::format() is frowned upon, because it can cause ambiguities - resulting in compile errors - in some cases. The compile errors seem to crop up randomly depending on the exact version of fmt used, so build can work CI using one specific version, but fail on a developer's machine using a different version. In this patch I fix one such ambiguity that breaks compilation on my development machine's fmt-11.0.2 and clang 19.1.7, but works fine on the slightly newer frozen toolchain. The error I get before this fix is: service/vector_store_client.cc:261:39: error: call to 'format' is ambiguous 261 \| throw configuration_exception(format("Invalid Vector Store service URI: {}", uri)); \| ^~~~~~ Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25691	2025-08-27 13:54:33 +03:00
Nadav Har'El	0a990d2a48	config: split tri_mode_restriction to a separate header Today, any source file or header file that wants to use the tri_mode_restriction type needs to include db/config.hh, which is a large and frequently-changing header file. In this patch we split this type into a separate header file, db/tri_mode_restriction.hh, and avoid a few unnecessary inclusions of db/config.hh. However, a few source files now need to explicitly include db/config.hh, after its transitive inclusion is gone. Note that the overwhelmingly common inclusion of db/config.hh continues to be a problem after this patch - 128 source files include it directly. So this patch is just the first step in long journey. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes scylladb/scylladb#25692	2025-08-27 13:47:04 +03:00
Michał Jadwiszczak	39db90a535	test/cluster: add view build status tests	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	f7ebc7b054	test/cluster: add view building coordinator tests	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	90b5b2c5f5	utils/error_injection: allow to abort `injection_handler::wait_for_message()`	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	cf138da853	test: adjust existing tests - Disable tablets in `test_migration_on_existing_raft_topology`. Because views on tablets are experimental now, we can safely assume that view building coordinator will start with view build status on raft. - Add error injection to pause view building on worker. Used to pause view building process, there is analogous error injection in view_builder. - Do a read barrier in `test_view_in_system_tables` Increases test stability by making sure that the node sees up-to-date group0 state and `system.built_views` is synced. - Wait for view is build in some tests Increases tests stability by making sure that the view is built. - Remove xfail marker from `test_tablet_streaming_with_unbuilt_view` This series fix https://github.com/scylladb/scylladb/issues/21564 and this test should work now.	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	6056b55309	utils/error_injection: add injection with `sleep_abortable()`	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	1e2fa069df	db/view/view_builder: ignore `no_such_keyspace` exception	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	c4288aa1f8	docs/dev: add view building coordinator documentation	2025-08-27 10:23:04 +02:00
Michał Jadwiszczak	454033a773	db/view/view_building_worker: work on `process_staging` tasks	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	233f4dcee3	db/view/view_building_worker: register staging sstable to view building coordinator when needed Change return type of `check_needs_view_update_path()`. Instead of retrning bool which tells whether to use staging directory (and register to `view_update_generator`) or use normal directory. Now the function returns enum with possible values: - `normal_directory` - use normal directory for the sstable - `staging_directly_to_generator` - use staging directory and register to `view_update_generator` - `staging_managed_by_vbc` - use staging directory but don't register it to `view_update_generator` but create view building tasks for later The third option is new, it's used when the table has any view which is in building process currrently. In this case, registering it to `view_update_generator` prematurely may lead to base-view inconsistency (for example when a replica is in a pending state).	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	f61039cbfd	db/view/view_building_worker: discover staging sstables When starting view_building_worker, go through all staging sstables for tablet-tables and register them locally. If there is no associated view building tasks for any sstable, create the task.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	651827cdab	db/view/view_building_worker: add method to register staging sstable The method will be used when a new staging sstable needs to go through the view building coordinator (the coordinator will decide when to process this staging sstable). Callers push new staging sstables to a queue and notifiy the async fiber to create `view_building_task`s from the sstables and commit them to group0.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	9de35ed2a2	db/view/view_update_generator: add method to process staging sstables instantly When the view building coordinator is sending `process_staging` task, we want to skip view_update_generator's staging sstables loop and process them instantly.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	2516993c70	db/view/view_update_generator: extract generating updates from staging sstables to a method	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	46507f76a6	db/view/view_update_generator: ignore tablet-based sstables Staging sstables for tablet-based tables are now handled by view_building_worker, so they need to be ignored by the generator.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	201c4fafec	db/view/view_building_coordinator: update view build status on node join/left Copy view build status for new node for tablet views and remove relevant statuses when a node is leaving the cluster.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	b855786804	db/view/view_building_coordinator: handle tablet operations If the view building coordinator is running, adjust view_building_tasks in case of tablet operations. The mutations are generated in the same batch as tablet mutations. At the start of tablet migration/resize/RF change, started view building tasks are aborted (by setting ABORTED state) if needed. Then, new adjusted tasks are created in group0 batch which ends the tablet operation and aborted tasks are removed from the table. In case the tablet operation fails or is revoked, aborted view building tasks are rollback by creating new copies of them and aborted ones are deleted from the table. View building tasks are not aborted/changed during tablet repair, because in this case, even if vb task is started, a staging sstable will be generated.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	56df5acd77	db/view: add view building task mutation builder	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	9312cd83c5	service/topology_coordinator: run view building coordinator Run view building coordinator alongside topology coordinator once the feature is available.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	08c9e6b9bb	db/view: introduce `view_building_coordinator` The coordinator is responsible for building tablet-based views. It schedules tasks for `view_building_worker` and updates views' statuses. The tasks are scheduled in a way that one shard is processing only one tablet at most (there may be multiple tasks since a base table may have multiple views). Support for tablet operations will be added in next commits.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	2b3e1682d7	db/view/view_building_worker: update built views locally Because `system.built_views` is a node-local table, we cannot mark a view as built directly from the view building coordinator. Instead, view building worker looks at data from `syste.view_build_status_v2` and updates `built_views` table accordingly.	2025-08-27 10:23:03 +02:00
Michał Jadwiszczak	c9e710dca3	db/view: introduce `view_building_worker` The worker is responsible for building tablet-based views by executing tasks scheduled by the view building coordinator. It observes view building state machine and wait on the machine's conditional variable (so the worker is woken up when group0 state is applied). The tasks are executed in batches, all tasks in one batch need to have the same: type, base_id, table_id. One shard can only execute one batch at a time (at least for now, in the future we might want to change that). That worker keeps track of finished and failed tasks in its local state. The state is cleared when `view_building_state::currently_processed_base_table` is changed.	2025-08-27 10:22:59 +02:00
Emil Maskovsky	cfc87746b6	storage: pass host_id as parameter to `maybe_reconnect_to_preferred_ip()` Previously, `maybe_reconnect_to_preferred_ip()` retrieved the host ID using `gossiper::get_host_id()`. Since the host ID is already available in the calling function, we now pass it directly as a parameter. This change simplifies the code and eliminates a potential race condition where `gossiper::get_host_id()` could fail, as described in scylladb/scylla#25621. Refs: scylladb/scylla#25621 Backport: Recommended for 2025.x release branches to avoid potential issues from unnecessary calls to `gossiper::get_host_id()` in subscribers. Closes scylladb/scylladb#25662	2025-08-27 10:35:46 +03:00
Michał Jadwiszczak	a59624c604	db/view: extract common view building functionalities Extract common methods of view builder consumer to an abstract class and `flush_base()` and `make_partition_slice()` functions, so they can be used in view builder (vnode-based views) and view building consumer (tablet-based views; introduced in the next commit).	2025-08-27 08:55:48 +02:00
Michał Jadwiszczak	f71594738e	db/view: prepare to create abstract `view_consumer` In next commit, I'm going to introduce `view_building_worker::consumer`, with very similar functionalities to `view_builder::consumer` but it'll only consume range of one tablet per execution. Since most functions are very similar, I'll create abstract `view_consumer` which will be base for both of the consumers. In order to make the transition more readable, this commit prepares the `view_builder::consumer` by making some functions virtual and next commit will extract part of functions to the abstract class.	2025-08-27 08:55:48 +02:00
Michał Jadwiszczak	e901b6fde4	message/messaging_service: add `work_on_view_building_tasks` RPC The RPC will be used by view building coordinator to attach to and wait for tasks performed by view building worker (introduced in later commit). The RPC gets vector of tasks' ids and returns vector of `view_task_result`s. i-th task result reffers to i-th task id.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	7c73792194	service/topology_coordinator: make `term_changed_error` public View building coordinator may also throw `term_changed_error`.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	6e3e287a39	db/schema_tables: create/cleanup tasks when an index is created/dropped Similarly as in previous commits, create view building tasks when an index is created and cleanup view building status when it's dropped.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	76caaea3f1	service/migration_manager: cleanup view building state on drop keyspace When a keyspace is dropped, remove all unfinished building tasks for all views and remove their entries from `system.view_built_status_v2` and `system.built_views`.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	f10c5c4493	service/migration_manager: cleanup view building state on drop view When a view is dropped, remove all unfinished building tasks, remove entries from `system.view_built_status_v2` and `system.built_views`. If the view is currently being built, removing its tasks means they are also aborted. Finished tasks are already removed from the table.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	6d1fbf06ed	service/migration_manager: create view building tasks on create view Create view building tasks in the same batch as new view mutations. The tasks are created only if `VIEW_BUILDING_COORDINATOR` feature is on and the view is in tablet keyspace.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	19651b4978	test/boost: enable proxy remote in some tests After a few next patches, creating/dropping a view in tablet keyspace will require a remote proxy to obtain references to system keyspace and view building state. Because of this, remote proxy needs to be explicitly enabled in boost tests which create views.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	204f61ffe1	service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()` The reference is needed to get `view_building_state_machine`.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	76a6dd82fd	service/migration_manager: coroutinize `prepare_new_view_announcement()`	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	d2e1b6d44a	service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine` Those references are needed to manage view building tasks while a view is created/dropped.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	f2e7051a84	service: reload `view_building_state_machine` on group0 apply() The state may be also reloaded on `topology_change` or `mixed_change` because topology coordinator may change view building tasks during tablet operations.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	d5d81591db	service/vb_coordinator: add currently processing base The view building coordinator will be building all views of one base table at a time. Select first available base table as currently processing base and save this information to `system.scylla_local`.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	e0184377ca	db/system_keyspace: move `get_scylla_local_mutation()` up So it can be used by all helper functions, keeping the logical order from the header file.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	46a24d960d	db/system_keyspace: add `view_building_tasks` table The table is managed by group0 and uses schema commitlog. The commit also includes helper functions.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	44890a52a8	db/view: add view_building_state and views_state `view_building_state` holds mapping of `view_building_task`s for tablet-based views. The structure is a memory representation of data stored in group0 tables. `views_state` holds information about tablet-based views and their build status.	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	ce1890e512	db/system_keyspace: add method to get view build status map	2025-08-27 08:55:47 +02:00
Michał Jadwiszczak	f90dd522df	db/view: extract `system.view_build_status_v2` cql statements to system_keyspace Until now, all changes to `system.view_build_status_v2` were made from view.cc and the file contained all of the helper methods. This commit introduces a `build_status` enum class to avoid using hardcoded strings and extracts the helper methods to `system_keyspace` class, so they can be later used by the view building coordinator.	2025-08-27 08:55:46 +02:00
Michał Jadwiszczak	18609adfce	db/system_keyspace: move `internal_system_query_state()` function earlier So it can be used in all system keyspace proxy methods while maintaining the same order as in the header file.	2025-08-27 08:55:46 +02:00
Michał Jadwiszczak	d0826e7cb1	db/view: ignore tablet-based views in `view_builder` View building of tablet-based views will be handled by the view building coordinator later in this patch.	2025-08-27 08:55:46 +02:00
Michał Jadwiszczak	7dba3667c9	gms/feature_service: add VIEW_BUILDING_COORDINATOR feature	2025-08-27 08:55:46 +02:00
Nadav Har'El	e2c99436cf	Merge 'cdc, vector_search: enable CDC when the index is created' from Dawid Pawlik When a vector index is created in Scylla, it is initially built using a full scan of the database. After that, it stays up to date by tracking changes through CDC, which should be automatically enabled when the vector index is created. When a user attempts to enable Vector Search (VS), the system checks whether Change Data Capture (CDC) is enabled and properly configured: 1. CDC is not enabled - CDC is automatically enabled with the minimum required TTL (Time-to-Live) for VS (24 hours) and the delta mode set to 'full' or post-image is enabled. - If the user later tries to reduce the CDC TTL below 24 hours or set delta mode to 'keys' with post-image disabled, the action fails. - Error message: Clearly states that CDC TTL must be at least 24 hours and delta mode must be set to 'full' or post-image must be enabled for VS to function. 2. CDC is already enabled - If CDC TTL is ≥ 24 hours and delta mode is set to 'full' or post-image is enabled: VS is enabled successfully. - If CDC TTL is < 24 hours or delta mode is set to 'keys' with post-image disabled: The VS enabling process fails. - Error message: Informs the user that CDC TTL must be at least 24 hours, delta mode must be set to 'full' or post-image must be enabled, and provides a link to documentation on how to update the TTL, delta mode, and post-image. When a user attempts to disable CDC when VS is enabled, the action will fail and the user will be informed by error message that clearly states that VS needs to be disabled (vector indexes have to be dropped) first. Full setup requirements and steps will be detailed in the documentation of Vector Search. Co-authored-by: @smoczy123 Fixes: VECTOR-27 Fixes: VECTOR-25 Closes scylladb/scylladb#25179 * github.com:scylladb/scylladb: test/cqlpy: ensure Vector Search CDC options test/boost: adjust CDC boost tests for Vector Search test/cql: add Vector Search CDC enable/disable test cdc, vector_index: provide minimal option setup for Vector Search test/cqlpy: adjust describe table tests with CDC for Vector Search describe, cdc: adjust describe for cdc log tables cdc: enable CDC log when vector index is created test/cqlpy: run vector_index tests only on vnodes vector_index: check if vector index exists in schema	2025-08-26 23:01:32 +03:00
Dawid Mędrek	fc1c41536c	service/qos: Move effective SL cache to auth_integration Since `auth_integration` manages effective service levels, let's move the relevant cache from `service_level_controller` to it.	2025-08-26 18:41:48 +02:00
Dawid Mędrek	dd5a35dc67	service/qos: Add auth::service to auth_integration The new service, `auth_integration`, has taken over the responsibility over managing effective service levels from `service_level_controller`. However, before these changes, it still accessed `auth::service` via the service level controller. Let's change that. Note that we also remove a check that `auth::service` has been initialized. It's not necessary anymore because the lifetime of `auth_integration` is strictly nested within the lifetime of `auth::service`. In actuality, `service_level_controller` should lose its reference to `auth::service` completely. All of the management over effective service levels has already been moved to `auth_integration`. However, the referernce is still needed when dropping a distributed service level because we need to update the corresponding attribute for relevant roles. That should not lead to invalid accesses, though. Dropping a service level should not be possible when `auth::service` is not initialized.	2025-08-26 18:41:43 +02:00
Dawid Mędrek	e929279d74	service/qos: Reload effective SL cache conditionally Since `service_level_controller` outlives `auth_integration`, it may happen that we try to access it when it has already been deinitialized. To prevent that, we only try to reload or clear the effective service level cache when the object is still alive. These changes solve an existing problem with an invalid memory access. For more context, see issue scylladb/scylladb#24792. We provide a reproducer test that consistently fails before these changes but passes after them. Fixes scylladb/scylladb#24792	2025-08-26 18:41:40 +02:00
Dawid Mędrek	34afb6cdd9	service/qos: Add gate to auth_integration We add a named gate to `auth_integration` that will aid us in synchronizing ongoing tasks with stopping the service.	2025-08-26 18:41:37 +02:00
Dawid Mędrek	7d0086b093	service/qos: Introduce auth_integration We introduce a new type, `auth_integration`, that will be used internally by `service_level_controller`. Its purpose is to take over the responsibility over managing effective service levels. The main problem of the current implementation of service level controller is its dependency on `auth::service` whose lifetime is strictly nested within the lifetime of service level controller. That may and already have led to invalid memory accesses; for an example, see issue scylladb/scylladb#24792. Our strategy is to split service level controller into smaller parts and ensure that we access `auth::service` only when it's valid to do so. This commit is the first step towards that. We don't change anything in the logic yet, just add the new type. Further adjustments will be made in following commits.	2025-08-26 18:41:34 +02:00
Pavel Emelyanov	67b63768e4	api: Capture and use db in cache_service handlers Now the sharded<database>& argument is there, so it can replace ctx one on handlers lambdas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-26 11:50:11 +03:00
Pavel Emelyanov	596d4640ff	api: Add sharded<database>& arg to set_cache_service() The reference is already available in set_server_column_family(), pass it further so that "cache" handlers are able to use it (next patch). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-26 11:49:35 +03:00
Pavel Emelyanov	4e556214ba	api: Squash (un)set_cache_service into ..._column_family The set_server_column_family() registers API handlers that work with replica::database. The set_server_cache() does the very same thing, but registers handlers with some other prefix. Squash the latter into former, later "cache" handlers will also make use of the database reference argument that's already available in ..._column_family() setter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-26 11:46:48 +03:00
Pavel Emelyanov	1b4b539706	api: Coroutinize set_server_column_family() To facilitate next patching Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-26 11:46:33 +03:00
Dario Mirovic	8b0a551177	test/cqlpy: add unknown compression algorithm test case Add `test_unknown_compression_algorithm` test case to `test_protocol_exceptions.py` test suite. This change improves test coverage for zero throws protocol exception handling. Refs: #24567	2025-08-25 13:31:40 +02:00
Avi Kivity	6dc2c42f8b	alternator: streams: refactor std::views::transform with side effect std::views::trasform()s should not have side effects since they could be called several times, depending on the algorithm they're paired with. For example, std::ranges::to() can run the algorithm once to measure the resulting container size, and then a second time to copy the data (avoiding reallocations). If that happens, then the side-effect happens twice. Avoid this be refactoring the code. Make the side-effect -- appending to the `column` vector -- happen first, then use that result to generate the `regular_column` vector. In this case, the side effect did not happen twice because small_vector's std::from_range_t constructor only reserves if the input range is sized (and it is not), but better not have the weakness in the code. Closes scylladb/scylladb#25011	2025-08-25 09:40:05 +03:00
Michael Litvak	25fb3b49fa	dist/docker: add dc and rack arguments add --dc and --rack commandline arguments to the scylla docker image, to allow starting a node with a specified dc and rack names in a simple way. This is useful mostly for small examples and demonstrations of starting multiple nodes with different racks, when we prefer not to bother with editing configuration files. The ability to assign nodes to different racks is especially important with RF=Rack enforcing. The previous method to achieve this is to set the snitch to GossipingPropertyFileSnitch and provide a configuration file in /etc/scylla/cassandra-rackdc.properties with the name of the dc and rack. The new dc and rack parameters are implemented similarly by using the snitch GossipingPropertyFileSnitch and writing the dc and rack values to the rackdc properties file. We don't support passing the parameters together with a different snitch, or when mounting a properties file from the host, because we don't want to overwrite it. Example: docker run -d --name scylla1 scylladb/scylla --dc my_dc1 --rack my_rack1 Fixes scylladb/scylladb#23423 Closes scylladb/scylladb#25607	2025-08-24 17:48:07 +03:00
Nadav Har'El	87dd96f9a2	Merge ' Alternator: DynamoDB compatible WCU Calculation via Read-Before-Write Support' from Amnon Heiman This series adds support for a DynamoDB-compatible Write Capacity Unit (WCU) calculation in Alternator by introducing an optional forced read-before-write mechanism. Alternator's model differs from DynamoDB, and as a result, some write operations may report lower WCU usage compared to what DynamoDB would report. While this is acceptable in many cases, there are scenarios where users may require accurate WCU reporting that aligns more closely with DynamoDB's behavior. To address this, a new configuration option, alternator_force_read_before_write, is introduced. When enabled, Alternator will perform a read before executing PutItem, UpdateItem, and DeleteItem operations. This allows it to take the existing item size into account when computing the WCU. BatchWriteItem support is also extended to use this mechanism. Because BatchWriteItem does not support returning old items directly, several internal changes were made to support reading previous item sizes with minimal overhead. Reads are performed at consistency level LOCAL_ONE for efficiency, and the WCU calculation is now done in multiple stages to accurately account for item size differences. In addition to the implementation changes, test coverage was added to validate the new behavior. These tests confirm that WCU is calculated based on the larger of the old and new items when read-before-write is active, including for BatchWriteItem. This feature comes with performance overhead and is therefore disabled by default. It can be enabled at runtime via the system.config table and should be used only when precise WCU tracking is necessary. New feature, no need to backport Closes scylladb/scylladb#24436 * github.com:scylladb/scylladb: alternator/test_returnconsumedcapacity.py: Test forced read before write alternator/executor.cc: DynamoDB WCU calculation in BatchWriteItem using read-before-write executor.cc: get_previous_item with consistency level executor: Extend API of put_or_delete_item alternator/executor.cc: Accurate WCU for put, update, delete config: add alternator_force_read_before_write	2025-08-24 11:38:24 +03:00
Avi Kivity	8815491085	treewide: include boost headers as "system" headers Boost is external to the project so treat its headers as "system" headers and include them with angle brackets. Closes scylladb/scylladb#25619	2025-08-22 17:21:24 +03:00
Piotr Dulikowski	5709d94826	Merge 'cql3: Warn when creating RF-rack-invalid keyspace' from Dawid Mędrek Although RF-rack-valid keyspaces are not universally enforced yet (they're governed by the configuration option `rf_rack_valid_keyspaces`), we'd like to encourage the user to abide by the restriction. To that end, we're introducing a warning when creating or altering a keyspace. If the configuration option is disabled, but the user is trying to create an RF-rack-invalid keyspace, they'll receive a warning. If the option is turned off, we will also log all of the RF-rack-invalid keyspaces at start-up. We provide validation tests. Fixes scylladb/scylladb#23330 Backport: we'd like to encourage the user to abide by the restriction even when they don't enforce it to make it easier in the future to adjust the schema when there's no way to disable it anymore. Because of that, we'd like to backport it to all relevant versions, starting with 2025.1. Closes scylladb/scylladb#24785 * github.com:scylladb/scylladb: main: Log RF-rack-invalid keyspaces at startup cql3/statements: Fix indentation cql3: Warn when creating RF-rack-invalid keyspace	2025-08-22 11:33:32 +02:00
Evgeniy Naydanov	ab15c94a09	test.py: dtest/commitlog_test: add test_pinned_cl_segment_doesnt_resurrect_data test_pinned_cl_segment_doesnt_resurrect_data was not moved in #24946 from scylla-dtest to this repo, because it's marked as xfail (#14879), but actually the issue is fixed and there is no reason to keep the test in scylla-dtest. Also remove unused imports. Closes scylladb/scylladb#25592	2025-08-22 11:30:10 +03:00
Raphael S. Carvalho	149f9d8448	replica: Fix race between drop table and merge completion handling Consider this: 1) merge finishes, wakes up fiber to merge compaction groups 2) drop table happens, which in turn invokes truncate underneath 3) merge fiber stops old groups 4) truncate disables compaction on all groups, but the ones stopped 5) truncate performs a check that compaction has been disabled on all groups, including the ones stopped 6) the check fails because groups being stopped didn't have compaction explicitly disabled on them To fix it, the check on step 6 will ignore groups that have been stopped, since those are not eligible for having compaction explicitly disabled on them. The compaction check is there, so ongoing compaction will not propagate data being truncated, but here it happens in the context of drop table which doesn't leave anything behind. Also, a group stopped is somewhat equivalent to compaction disabled on it, since the procedure to stop a group stops all ongoing compaction and eventually removes its state from compaction manager. Fixes #25551. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#25563	2025-08-22 10:19:43 +03:00
kendrick-ren	d6e62aeb6a	Update launch-on-gcp.rst Add the missing '=' mark in --zone option. Otherwise the command complains. Closes scylladb/scylladb#25471	2025-08-22 10:13:52 +03:00
Botond Dénes	3dcb596201	Merge 'test: properly unset recovery_leader in the recovery procedure tests' from Patryk Jędrzejczak After changing the type of the `recovery_leader` config option from `sstring` to `UUID` in #25032, setting `recovery_leader` to an empty string became an incorrect way to unset it. The following error started to appear in the recovery procedure tests: ``` init - marshaling error: UUID string size mismatch: '' : recovery_leader ``` We unset `recovery_leader` properly in this PR. To do it, we introduce a simple way to remove config options in tests. Backport is unneeded. This error was harmless, and Scylla ignored `recovery_leader` after logging the error as expected by the tests. Closes scylladb/scylladb#25365 * github.com:scylladb/scylladb: test: properly unset recovery_leader in the recovery procedure tests test: manager_client: allow removing a config option test: manager_client: add docstring to server_update_config	2025-08-22 10:09:37 +03:00
Benny Halevy	45c496c276	api: storage_service: fix token_range documentation Note that the token_range type is used only by describe_ring. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25609	2025-08-22 10:06:21 +03:00
Patryk Jędrzejczak	193a74576a	test/cluster/conftest: cluster_con: provide default values for port and use_ssl Some cluster tests use `cluster_con` when they need a different load balancing policy or auth provider. However, no test uses a port other than 9042 or enables SSL, but all tests must pass `9042, False` because these parameters don't have default values. This makes the code more verbose. Also, it's quite obvious that 9042 stands for port, but it's not obvious what `False` is related to, so there is a need to check the definition of `cluster_con` while reading any test that uses it. No reason to backport, it's only a minor refactoring. Closes scylladb/scylladb#25516	2025-08-22 09:51:24 +03:00
David Garcia	07d798a59d	docs: fix sidebar on local preview Closes scylladb/scylladb#25560	2025-08-22 09:50:07 +03:00
David Garcia	c3c70ba73f	docs: expose alternator metrics Renders in the docs some metrics introduced in https://github.com/scylladb/scylladb/pull/24046/files that were not being displayed in https://docs.scylladb.com/manual/stable/reference/metrics.html Closes scylladb/scylladb#25561	2025-08-22 09:49:52 +03:00
David Garcia	461a0bad8a	docs: do not show any version warning for upgrade guide pages Closes scylladb/scylladb#25562	2025-08-22 09:49:27 +03:00
Avi Kivity	e140bd6355	Update seastar submodule * seastar 1520326e6...0a90f7945 (13): > include: Keep linux-aio.hh deeper in internal > include: Move array_map.hh to util/internal/ > io_tester: Add support for scheduling supergroups > Merge 'tls: Force buffer splitting into gnutls record block sized chunks' from Calle Wilund > Add iotune --get-best-iops-with-buffer-sizes option > noncopyable_function: use memcpy instead of bytewise copy loop > test: Make fair_queue_test validation code use BOOST_CHECK_...-s > test: Rework test_fair_queue_random_run > Merge 'Remove capacity configuration for fair_queue tests' from Pavel Emelyanov > reactor: replace boost::barrier with std::barrier<> > rpc: server::process(): reindent > test: Remove no-op dispatch from fair_queue ticker > Merge 'json: API level 8: use noncopyable_function in json_return_type' from Benny (#2921) Closes scylladb/scylladb#25624	2025-08-22 09:41:02 +03:00
Andrzej Jackowski	86fc513bd9	auth: allow dropping roles in saslauthd_authenticator Before this change, `saslauthd_authenticator` prevented dropping roles. The current documentation instructs users to `Ensure Scylla has the same users and roles as listed in the LDAP directory`. Therefore, ScyllaDB should allow dropping roles so administrators can remove obsolete roles from both LDAP and ScyllaDB. The code change is minimal — dropping a role is a no-op, similar to the existing no-op implementations for successful `create` and `alter` operations. `saslauthd_authenticator_test` is updated to verify that dropping a role doesn't throw anymore. Fixes: scylladb/scylladb#25571 Closes scylladb/scylladb#25574	2025-08-22 09:40:44 +03:00
Yaron Kaikov	c0fd3deeab	github: Enhance label sync to support P0/P1 priority labels Extend the existing label synchronization system to handle P0 and P1 priority labels in addition to backport/* labels: - Add P0/P1 label syncing between issues and PRs bidirectionally - Automatically add 'force_on_cloud' label to PRs when P0/P1 labels are present (either copied from issues or added directly) The workflow now triggers on P0 and P1 label events in addition to backport/* labels, ensuring priority labels are properly reflected across the entire PR lifecycle. Refs: https://github.com/scylladb/scylla-pkg/issues/5383 Closes scylladb/scylladb#25604	2025-08-22 06:50:13 +03:00
Dawid Mędrek	837d267cbf	main: Log RF-rack-invalid keyspaces at startup When the configuration option `rf_rack_valid_keyspaces` is enabled and there is an RF-rack-invalid keyspace, starting a node fails. However, when the configuration option is disabled, but there still is a keyspace that violates the condition, we'd like Scylla to print a warning informing the user about the fact. That's what happens in this commit. We provide a validation test.	2025-08-21 19:35:33 +02:00
Dawid Mędrek	af8a3dd17b	cql3/statements: Fix indentation	2025-08-21 19:29:36 +02:00
Dawid Mędrek	60ea22d887	cql3: Warn when creating RF-rack-invalid keyspace Although RF-rack-valid keyspaces are not universally enforced yet (they're governed by the configuration option `rf_rack_valid_keyspaces`), we'd like to encourage the user to abide by the restriction. To that end, we're introducing a warning when creating or altering a keyspace. If the configuration option is disabled, but the user is trying to create an RF-rack-invalid keyspace, they'll receive a warning. We provide a validation test.	2025-08-21 19:29:33 +02:00
Evgeniy Naydanov	3a98331731	test.py: don't fail if use multiple tests from one dir in commandline There is the stash item REPEATED_FILES for directory items which used to cut recursion. But if multiple tests from one directory added to ./test.py commandline this solution prevents handling non-first tests well because it was already collected for the first one. Change behavior to not store all repeated files in the stash but just files which are in the process of repetition. Rename the stash item to REPEATING_FILES to reflect this change. Closes scylladb/scylladb#25611	2025-08-21 19:43:13 +03:00
Pawel Pery	509f5ddb89	vector_store_client: set keepalive for the http client's socket Checking for dead peers or network is helpful in maintaining a lifetime of the http client. This patch sets TCP_KEEPALIVE option on the http client's socket.	2025-08-21 16:22:30 +02:00
Pawel Pery	4b459c6855	vector_store_client: disable Nagle's algorithm on the http client Nagle’s algorithm and Delayed ACK’s algorithm are enabled by default on sockets in Linux. As a result we can experience 40ms latency on simply waiting for ACK on the client side. Disabling the Nagle’s algorithm (using TCP_NODELAY) should fix the issue (client won’t wait 40ms for ACKs). This patch sets `TCP_NODELAY` on every socket created by the `http_client`.	2025-08-21 16:20:26 +02:00
Dawid Pawlik	01e7a48030	vector_store_client: fix HTTP error message formatting Content of the HTTP error was logged in Scylla as literal list of chars (default temporary buffer formatting). Changed to print the sstring made out of temporary buffer, which fixes the problem with formatting, making the output clear and readable for humans. Fixes: VECTOR-141 Closes scylladb/scylladb#25329	2025-08-21 14:33:41 +02:00
Botond Dénes	09dc285b4a	Merge 'Remove redis from scylla source tree' from Ran Regev - remove redis documentation First, remove the redis documentation. - remove ./redis and dependencies Second, remove the redis directory and its dependencies from the project. Fixes: #25144 This is a cleanup, no need to backport. Closes scylladb/scylladb#25148 * github.com:scylladb/scylladb: remove ./redis and dependencies remove redis documentation	2025-08-21 14:26:11 +03:00
Benny Halevy	d6ca393928	test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints Following up on `6129411a5e` improve test_vnode_keyspace_describe_ring be verifying that the endpoints listed by describe_ring match those returned by the `natural_endpoints` api (for random tokens). The latter are calculated using an independent code path directly from the effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-21 11:48:17 +03:00
Benny Halevy	e34980ac87	test/pylib/rest_client: add natural_endpoints function Invoke the `/storage_service/natural_endpoints/{keyspace}` api Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-21 11:48:17 +03:00
Pavel Emelyanov	47750496d2	Merge 'test.py: metrics: add host_id suffix to .db file' from Evgeniy Naydanov CI can run several test.py sessions on different machines (builders) for one build and, and to be not overwritten, .db file with metrics need to have some unique name: add host_id as we already do for .xml report in `run_pytest()` Also add host_id columns to metric tables in case we will somehow aggregate .db files. Add host_id suffix to `toxiproxy_server.log` for the same reason. Fixes: https://github.com/scylladb/scylladb/issues/25462 Closes scylladb/scylladb#25542 * github.com:scylladb/scylladb: test.py: add host_id suffix to toxiproxy_server.log test.py: metrics: add host_id suffix to .db file	2025-08-21 11:34:47 +03:00
Robert Bindar	3291a5cc75	Fix dbuild boost::gregorian usage error On my dbuild runs, compiler complained about no member "gregorian" in namespace boost in the user_function_test.cc file. Was also noticed in CI. Signed-off-by: Robert Bindar <robert.bindar@scylladb.com> Closes scylladb/scylladb#25593	2025-08-21 11:32:47 +03:00
Petr Gusev	f261b4594d	ip_address_updater: call raft_topology_update_ip even if ip hasn't changed Previously, the prev_ip check caused problems for bootstrapping nodes. Suppose a bootstrapping node A appears in the system.peers table of some other node B. Its record has only ID and IP of the node A, due to the special handling of bootstrapping nodes in raft_topology_update_ip. Suppose node B gets temporarily isolated from the topology coordinator. The topology coordinator fences out node B and succesfully finishes bootstrapping of the node A. Later, when the connectivity is restored, topology_state_load runs on the node B, node A is already in normal state, but the gossiper on B might not yet have any state for it yet. In this case, raft_topology_update_ip would not update system.peers because the gossiper state is missing. Subsequently, on_join/on_restart/on_alive events would skip updates because the IP in gossiper matches the IP for that node in system.peers. Removing the check avoids this issue, with negligible overhead: * on_join/on_restart/on_alive happen only once in a node’s lifetime * topology_state_load already updates all nodes each time it runs. This problem was found by a fencing test, which crashed a node while another node was going through the bootstrapping process. After restart the node saw that other node already is in normal state, since the topology coordinator fenced out this node and managed to finish the bootstrapping process successfully. This test will be provided in a separate fencing-for-paxos PR. Closes scylladb/scylladb#25596	2025-08-21 10:02:06 +02:00
Ernest Zaslavsky	4bee0491ba	cmake: Add missing `incremental.cc` to `repair/CMakeLists.txt` Add `incremental.cc` to `repair/CMakeLists.txt` to fix CMake based build Closes scylladb/scylladb#25601	2025-08-21 09:40:36 +03:00
Asias He	b12404ba52	streaming: Enclose potential throws in try block and ensure sink close before logging - Move the initialization of log_done inside the try block to catch any exceptions it may throw. - Relocate the failure warning log after sink.close() cleanup to guarantee sink.close() is always called before logging errors. Refs #25497 Closes scylladb/scylladb#25591	2025-08-20 19:46:56 +02:00
Dawid Pawlik	9463ac10e2	test/cqlpy: ensure Vector Search CDC options Add test to check if minimal options for Vector Search are ensured and if it is disallowed to create CDC unrespectfully to the minimal requirements.	2025-08-20 17:20:38 +02:00
Dawid Pawlik	61c7b935e1	test/boost: adjust CDC boost tests for Vector Search Adjust name conflict and permissions tests when enabling CDC for Vector Search. Add test that checks if CDC with vector column is setup properly.	2025-08-20 17:20:37 +02:00
Dawid Pawlik	45a4714ab8	test/cql: add Vector Search CDC enable/disable test Add CQL test for the automatic enablement of CDC log when creating an index on vector column using 'vector_index' custom class. Check if the logging is disabled after index is dropped.	2025-08-20 17:20:37 +02:00
Dawid Pawlik	a27eef9f18	cdc, vector_index: provide minimal option setup for Vector Search Ensure that the CDC used by Vector Search has at least 24h TTL and delta mode is set to 'full' or postimage is enabled. This setup is required by the Vector Store to work as intended. The TTL of at least 24h is a rough estimate of the maximal time needed for the full scan conducted by Vector Store to finish. The delta mode set to 'full' or postimage enabled is needed to read the values of vectors being written to the table, so Vector Store can save them in the desired external index. As the default we set TTL = 24h, delta = 'full', postimage = false. Full delta is preffered option to log the vector values as it is less costly and does not require additional read on write.	2025-08-20 17:20:20 +02:00
Ran Regev	ebf1db5c5e	remove ./redis and dependencies Remove ./redis and all its usages. This is the second commit that removes ./redis from Scylla Signed-off-by: Ran Regev <ran.regev@scylladb.com>	2025-08-20 17:53:23 +03:00
Ran Regev	6eca083137	remove redis documentation As part of removing redis from Scylla source tree. This commit removes all related documentation. Following commit remove the code itself. Signed-off-by: Ran Regev <ran.regev@scylladb.com>	2025-08-20 17:53:23 +03:00
Benny Halevy	6129411a5e	locator: utils: get_all_ranges, construct_range_to_endpoint_map: use end-bound ranges Commit `60d2cc886a` changed get_all_ranges to return start-bound ranges and pre-calculate the wrapping range, and then construct_range_to_endpoint_map to pass r.start() (that is now always engaged) as the vnode token. However, as can be seen in token_metadata_impl::first_token the token ranges (a.k.a. vnodes) end with the sorted tokens, not start with them, so an arbitrary token t belongs to a vnode in some range `sorted_tokens[i-1] < t <= sorted_tokens[i]` Fixes #25541 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#25580	2025-08-20 15:15:40 +02:00
Dawid Pawlik	38d568c7c0	test/cqlpy: adjust describe table tests with CDC for Vector Search Run tests describing CDC tables both with standard and vector index created CDC log enablement. Adjust the test message of CDC log describe statement. Mark `test_desc_restore` as failing due to the #25187 bug.	2025-08-20 12:38:52 +02:00
Dawid Pawlik	35b82e6d2f	describe, cdc: adjust describe for cdc log tables Make CDC log table describe mention that it can be created by creating the vector index on base table's vector column.	2025-08-20 12:38:52 +02:00
Dawid Pawlik	af2a544395	cdc: enable CDC log when vector index is created Enable CDC log table when creating an index on vector column using 'vector_index' custom index class.	2025-08-20 12:38:52 +02:00
Dawid Pawlik	461d820fbb	test/cqlpy: run vector_index tests only on vnodes When creating an index on vector column using 'vector_index' class the CDC log is being created as it is required for Vector Search. Due to the fact that CDC does not yet work with tablets (Refs #16317) enabled we have to mark the tests failing on tablets and run them on vnodes to make sure the vector index tests continue to pass.	2025-08-20 12:38:52 +02:00
Avi Kivity	eefb6a0642	Merge 'storage_proxy: node_local_only: always use my_host_id' from Petr Gusev The previous implementation did not handle topology changes well: * In `node_local_only` mode with CL=1, if the current node is pending, the CL is increased to 2, causing `unavailable_exception`. * If the current tablet is in `write_both_read_old` and we try to read with `node_local_only` on the new node, the replica list will be empty. This patch changes `node_local_only` mode to always use `my_host_id` as the replica list. An explicit check ensures the current node is a replica for the operation; otherwise `on_internal_error` is called. backport: not needed, since `node_local_only` is only used in LWT for tablets and it hasn't been released yet. Closes scylladb/scylladb#25508 * github.com:scylladb/scylladb: test_tablets_lwt: add test_lwt_during_migration storage_proxy: node_local_only: always use my_host_id	2025-08-20 12:11:44 +03:00
Avi Kivity	34f661e5aa	Merge 'Make api/column_family endpoints capture and use sharded<database>' from Pavel Emelyanov The http_context object carries sharded<database> reference and all handlers in the api/ code can use it they way they want. This creates potential use-after-free, because the context is initialized very early and is destroyed very late. All other services are used by handlers differently -- after a service is initialized, the relevant endpoints are registered and the service reference is captured on handlers. Since endpoint deregistration is defer-scheduled at the same place, this guarantees that handlers cannot use the service after it's stopped. This PR does the same for api/ handlers -- the sharded<database> reference is captured inside set_server_column_family() and then used by handlers lambdas. Similar changes for other services: #21053, #19417, #15831, etc It's a part of the on-going cleanup of service dependencies, no need to backport Closes scylladb/scylladb#25467 * github.com:scylladb/scylladb: api/column_family: Capture sharded<database> to call get_cf_stats() api: Patch get_cf_stats to get sharded<database>& argument api: Drop CF map-reducers ability to work with http context api: Patch callers of map_reduce_cf(_raw)? to use sharded<database> api: Use captured sharded<database> reference in handlers api/column_family: Make map_reduce_cf_time_histogram() use sharded<database> api/column_famliy: Make sum_sstable() use sharded<database> api/column_family: Make get_cf_unleveled_sstables() use sharded<database> api/column_famliy: Make get_cf_stats_count() use sharded<database> api/column_family: Make get_cf_rate_and_histogram() use sharded<database> api/column_family: Make get_cf_histogram() use sharded<database> api/column_family: Make get_cf_stats_sum() use sharded<database> api/column_family: Make set_tables_tombstone_gc() use sharded<database> api/column_family: Make set_tables_autocompaction() use sharded<database> api/column_family: Make for_tables_on_all_shards() use sharded<database> api: Capture sharded<database> for set_server_column_family() api: Make CF map-reducers work on sharded<database> directly api: Make map_reduce_cf_time_histogram() file-local api: Remove unused ctx argument from run_toppartitions_query()	2025-08-20 12:09:39 +03:00
Dawid Pawlik	27ceb85508	vector_index: check if vector index exists in schema Add `has_vector_index` function to check if an index on vector column using 'vector_index' custom index class exists in the schema. Co-authored-by: Michał Hudobski <michal.hudobski@scylladb.com>	2025-08-20 10:35:55 +02:00
Avi Kivity	352cda4467	treewide: avoid including gms/feature_service.hh from headers To avoid dependency proliferation, switch to forward declarations. In one case, we introduce indirection via std::unique_ptr and deinline the constructor and destructor. Ref #1 Closes scylladb/scylladb#25584	2025-08-20 10:30:27 +03:00
Petr Gusev	894c8081e6	test_tablets_lwt: add test_lwt_during_migration	2025-08-19 16:11:56 +02:00
Petr Gusev	ed6bec2cac	storage_proxy: node_local_only: always use my_host_id The previous implementation did not handle topology changes well: * In node_local_only mode with CL=1, if the current node is pending, the CL is raised to 2, causing unavailable_exception. * If the current tablet is in write_both_read_old and we read with node_local_only on the new node, the replica list is empty. This patch changes node_local_only mode to always use my_host_id as the replica list. An explicit check ensures the current node is a replica for the operation; otherwise on_internal_error is called.	2025-08-19 16:11:49 +02:00
Evgeniy Naydanov	47e4d470af	test.py: add host_id suffix to toxiproxy_server.log	2025-08-19 11:33:47 +00:00
Evgeniy Naydanov	8ea49092b7	test.py: metrics: add host_id suffix to .db file CI can run several test.py sessions on different machines (builders) for one build and, and to be not overwritten, .db file with metrics need to have some unique name: add host_id as we already do for .xml report in run_pytest() Also add host_id columns to metric tables in case we will somehow aggregate .db files.	2025-08-19 11:33:11 +00:00
Sayanta Banerjee	eae1869d3a	Update docs/features/cdc/cdc-streams.rst Co-authored-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2025-08-19 15:06:30 +05:30
Pavel Emelyanov	2510a7b488	api/column_family: Capture sharded<database> to call get_cf_stats() Update more handlers not to get databse from context, but to capture it directly on handlers' lambdas. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 12:04:34 +03:00
Pavel Emelyanov	dc31b68451	api: Patch get_cf_stats to get sharded<database>& argument Now it accepts http context and immediately gets the database from it to pass to map_reduce_cf. Callers are updated to pass database from where the context they already have. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 12:03:27 +03:00
Pavel Emelyanov	7933a68921	api: Drop CF map-reducers ability to work with http context This patch finalizes the change started by the previous patch of the similar title -- the map_reduce_cf(_raw) is switched to work only with sharded<replica::database> reference. All callers were updated by previous patches. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:45:24 +03:00
Pavel Emelyanov	ffd25f0c16	api: Patch callers of map_reduce_cf(_raw)? to use sharded<database> There are some of them left that still pass http_context. These handlers will eventually get their captured sharded database reference, but for now make them explicitly use one from context. This will allow to de-templatize map_reduce_cf... helpers making the code simpler. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:43:55 +03:00
Pavel Emelyanov	7e0726d55b	api: Use captured sharded<database> reference in handlers Not all of them can switch from ctx to database, so in few places both, the database and ctx, are captured. However, the ctx.db reference is no longer used by the column_family handlers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:55 +03:00
Pavel Emelyanov	720a8fef4b	api/column_family: Make map_reduce_cf_time_histogram() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:55 +03:00
Pavel Emelyanov	49cb81fb56	api/column_famliy: Make sum_sstable() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:55 +03:00
Pavel Emelyanov	3595ea7f49	api/column_family: Make get_cf_unleveled_sstables() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:55 +03:00
Pavel Emelyanov	d32ac35f60	api/column_famliy: Make get_cf_stats_count() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:55 +03:00
Pavel Emelyanov	cde39d3fc7	api/column_family: Make get_cf_rate_and_histogram() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	edc9e302e3	api/column_family: Make get_cf_histogram() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	422debbee2	api/column_family: Make get_cf_stats_sum() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	1c1fabc578	api/column_family: Make set_tables_tombstone_gc() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	0158743f5e	api/column_family: Make set_tables_autocompaction() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	f52eb7cae2	api/column_family: Make for_tables_on_all_shards() use sharded<database> Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	818a41ccdb	api: Capture sharded<database> for set_server_column_family() Similarly to other API handlers, instead of using a database from http context, patch the setting methods to capture the database from main code and pass it around to handlers. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:54 +03:00
Pavel Emelyanov	d3d217b3c9	api: Make CF map-reducers work on sharded<database> directly Next patches are going to change a bunch of map_reduce_cf_... callers to pass sharded<database> reference to it, not the http context. Not to patch all the api/ code at once, keep the ability to call it with ctx at hand. Eventually only the sharded<database>& overload will be kept. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:53 +03:00
Pavel Emelyanov	bdb7c2b014	api: Make map_reduce_cf_time_histogram() file-local It's not used outside of api/column_family.cc Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:53 +03:00
Pavel Emelyanov	b0db83575c	api: Remove unused ctx argument from run_toppartitions_query() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-08-18 11:00:53 +03:00
Amnon Heiman	a46671ac59	alternator/test_returnconsumedcapacity.py: Test forced read before write This patch introduces two test cases to validate the effect of `alternator_force_read_before_write` on WCU (Write Capacity Unit) tracking: - The first test verifies that when a smaller item replaces a larger one, the WCU reflects the size of the larger item, as expected when read-before-write is enabled. - The second test covers `BatchWriteItem` with `alternator_force_read_before_write` enabled, ensuring that WCU is computed using the actual size of existing items.	2025-08-13 18:04:03 +03:00
Amnon Heiman	ffc7171a5f	alternator/executor.cc: DynamoDB WCU calculation in BatchWriteItem using read-before-write Alternator's internal model differs from DynamoDB, which can result in lower WCU (Write Capacity Unit) estimates for some operations. While this is often acceptable, accurate WCU tracking is occasionally required. This patch enables a DynamoDB like WCU calculation for `BatchWriteItem` when the `alternator_force_read_before_write` configuration is enabled, by reading existing items before applying changes. WCU calculation in `BatchWriteItem` is now performed in three stages: 1. During the initial scan, no calculation is done, just the pointers are collected. 2. If read-before-write is enabled, each item is read again and its prior size is compared to the new value. WCU is based on the larger of the two. The updated size is stored in the mutation building array. 3. Regardless of read-before-write, metrics are updated and consumed capacity units are returned if requested. This is done in a loop before sending the mutations. For performance, reads in `BatchWriteItem` use consistency level LOCAL_ONE. These changes increase WCU DynamoDB compatibility in batch operations, but add overhead and should be enabled only when needed. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-08-13 18:03:55 +03:00
Patryk Jędrzejczak	161521b674	test: properly unset recovery_leader in the recovery procedure tests After changing the type of the `recovery_leader` config option from `sstring` to `UUID` in #25032, setting `recovery_leader` to an empty string became an incorrect way to unset it. The following error started to appear in the recovery procedure tests: ``` init - marshaling error: UUID string size mismatch: '' : recovery_leader ``` We fix it in this commit by removing `recovery_leader` from the config file.	2025-08-07 11:20:00 +02:00
Patryk Jędrzejczak	31372843e4	test: manager_client: allow removing a config option Currently, there is no simple way to remove an option from the server's config file in tests. One example when this is needed is removing the `recovery_leader` option on all servers during the recovery procedure. In this commit, we add a new method to `ManagerClient` that removes an option from the given server's config file.	2025-08-07 11:20:00 +02:00
Patryk Jędrzejczak	ce26896704	test: manager_client: add docstring to server_update_config	2025-08-07 11:19:54 +02:00
Amnon Heiman	2c02cb394b	executor.cc: get_previous_item with consistency level This patch extends get_previous_item so it can be used to calculate the size of a previous item. This will allow batch_get_item to obtain the size of a previous item without needing the item itself. The patch includes the following changes: * Removes the unneeded forward declaration of get_previous_item. * Extends get_previous_item to accept an explicit consistency level. * Modifies the regular get_previous_item to maintain the same functionality while calling the base implementation. * Adds a get_previous_item_size function that uses the base implementation to retrieve the size of a previous item when only the size is needed. For performance reasons, get_previous_item_size uses consistency level one. Signed-off-by: Amnon Heiman <amnon@scylladb.com>	2025-08-07 12:14:43 +03:00
Amnon Heiman	5abbfa1e52	executor: Extend API of put_or_delete_item This patch adds two methods to put_or_delete_item that will be used in WCU batch calculation: A setter to set the item length. A boolean getter that determines if this is a delete action.	2025-08-05 10:23:36 +03:00
Amnon Heiman	f75aa1b35e	alternator/executor.cc: Accurate WCU for put, update, delete To improve the accuracy of Write Capacity Unit (WCU) calculation, this patch introduces the use of the `alternator_force_read_before_write` configuration option. When enabled, it forces a read-before-write operation on `PutItem`, `UpdateItem`, and `DeleteItem` requests. This comes with performance overhead and should be used with caution, especially in high-throughput environments.	2025-08-05 10:01:32 +03:00
Amnon Heiman	94a3556be5	config: add alternator_force_read_before_write This patch introduces a new configuration parameter, `alternator_force_read_before_write`, which forces Alternator to perform a read-before-write on all write operations (`PutItem`, `UpdateItem`, and `DeleteItem`), even when not strictly required. Enabling this option ensures abetter DynamoDB compatibility in WCU calculation. by accounting for the size of the existing item, as done in DynamoDB. This option introduces performance overhead and should be used with care. The parameter is runtime-configurable and can be toggled via CQL: UPDATE system.config SET value = 'true' WHERE name = 'alternator_force_read_before_write';	2025-08-05 08:31:00 +03:00
Nikos Dragazis	b186c48a65	encryption-at-rest.rst: add "Rotate Encryption Keys" section Add a new section for key rotation, offering separate instructions per key provider, organized in tabs. The gist: * Local Key Provider - Rotation requires creating a new key file per node. It's a manual procedure. * Replicated Key Provider - Rotation is not supported. * KMIP Key Provider - Rotation is transparent to Scylla, but it requires manually revoking the key in the server. * {KMS,GCP} Key Provider - Rotation is transparent to Scylla and can be automated in the server. * Azure Key Provider - Rotation is automatically supported by Scylla by keeping track of the key version along with the encrypted data. The rotation needs to be done at the Key Vault server, and can be automated. Explain that, even after rotation, old keys may be still in use due to caching, and that old SSTables will remain encrypted with the old key until the next compaction. Provide instructions in case they prefer not to wait. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:46 +03:00
Nikos Dragazis	3abacaa465	encryption-at-rest.rst: rewrite "Encrypt System Resources" section - Mention all types of system data that fall under system encryption. - Add "Before you Begin" section with requirements per key provider. The requirements are the same as in user encryption. - Mention explicitly that the Replicated Key Provider cannot be used for system encryption. - Provide separate instructions for each key provider. Explain all the configuration options. - Provide an extra example for the Local Key Provider with a ``system_key_directory`` and ``key_name``. - Highlight the code blocks as YAML. Make their indentation consistent with the rest of the doc (2 spaces). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:46 +03:00
Nikos Dragazis	c59f71b399	encryption-at-rest.rst: rewrite "Update Encryption Properties of Existing Tables" section - Split the various scenarios into sub-sections, not just examples. - Amend the example for changing cipher algorithm and key length. The algorithm used in the example was the same. - Point out that disabling encryption through the table schema is not possible if a node has default encryption configured. - Amend the `nodetool upgradesstables` command. The `--include-all-sstables` is necessary. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:46 +03:00
Nikos Dragazis	22f941b325	encryption-at-rest.rst: rewrite "Encrypt a Single Table" section - Add a short intro. - Add an early note about the fact that options from ``scylla_encryption_options`` cannot be mixed with options from ``user_info_encryption``. - Add a new "Allow Per-Table Encryption" subsection to document the ``allow_per_table_encryption`` option. - Move the top-level procedure into a new "Encrypt a New Table" subsection to differentiate it from the "Update Encryption Properties of Existing Tables"". - Add tabs for provider-dependent steps in "Before you Begin" and "Procedure". - Amend "bytes" to "bits" (for the key length). - Add examples for the replicated, KMIP, GCP, and Azure key providers. Use consistent keyspace and table names in all examples. - Remove step for upgrading SSTables. The table is new - no SSTables exist yet. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:46 +03:00
Nikos Dragazis	bd83f3e672	encryption-at-rest.rst: rewrite "Encrypt Tables" section - Provide separate requirements and instructions for each key provider, organized in tabs. - Mention explicitly that the Replicated Key Provider cannot be used for default encryption. - Fix indentation for code blocks in examples (2 spaces). - For KMS, GCP, and Azure, add the `master_key` option in the list of options and remove the relevant example (not so common). - Add steps for rolling restart. - Amend "bytes" to "bits" (for the key length). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:46 +03:00
Nikos Dragazis	fb030b11c3	encryption-at-rest.rst: update "Set the Azure Host" section - Mark the `master_key` as required. Technically, it's not, since it can be specified in the schema encryption options, but: - It's better to keep it simple. The common case is to have a default value that occasionally needs to be overridden. - No functionality is lost. - It is mentioned as required for AWS and GCP. - Add a note about credential resolution. - Make some minor formatting changes to be consistent with the AWS and GCP sections. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	e25b283c8d	encryption-at-rest.rst: update "Set the GCP Host" section - Add list of requirements (KMS Key, credentials, permissions). - Add a reference to "Create Encryption Keys" section. - Amend description for `master_key`. - Add one example per credential type. - Explain how credentials are resolved if not explicitly specified in the configuration. - Fix indentation of "restart" command. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	d9242ba47f	encryption-at-rest.rst: update "Set the KMS Host" section - Add a list of requirements (KMS key, credentials, permissions). - Add a reference to "Create Encryption Keys" section. - Add one example per credential type. - Explain how credentials are resolved from the environment, or the AWS credentials file. - Fix indentation of "restart" command. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	cf9301c573	encryption-at-rest.rst: update "Set the KMIP Host" section - Uncomment the code block to match the other hosts. - Remove the ``certficate_revocation_list`` option; it's not supported. - Amend the default values for ``key_cache_expiry`` and ``key_cache_refresh``. - Add an example with mutual TLS authentication. - Fix indentation of "restart" command. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	b777dd267d	encryption-at-rest.rst: rewrite "Create Encryption Keys" section - Provide separate instructions for each key provider, organized in tabs. Move the existing instructions with the key generator script under the "Local Key Provider" tab. Point to the cloud provider's documentation for AWS, GCP, and Azure keys. List the required attributes for KMIP keys. List the required keys for the Replicated Key Provider. - In the example for the key generator script, use the same algorithm and key strength for both the secret key and the system key, since this is the recommended case. - Reorder the usage list of arguments for the key generator script. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	60df275197	encryption-at-rest.rst: rewrite "Key Providers" section - Use monospace font for key provider factories. - Add a sub-section for every key provider. Explain how they operate at a high level and highlight any possible limitations. - Remove version availability notes. The version 2019.1.3 is old and unsupported. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Nikos Dragazis	3c2f4ed1e7	encryption-at-rest.rst: hoist and update "Cipher Algorithm Descriptors" Turn an earlier reference to "algorithm descriptor" into a hyperlink. Use monospace font in the table header for "cipher_algorithm" and "secret_key_strength"; these are verbatim identifiers in "scylla.yaml" and "scylla_encryption_options". Same for their supported values. Restrict the Blowfish key size to 128 bits, due to <https://github.com/scylladb/scylla-enterprise/issues/4848>. Add notes on ECB vs. CBC, and on Blowfish's 64-bit block size. Emphasize our recommendation more. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	f07125cfea	encryption-at-rest.rst: rewrite/replace section "Encryption Key Types" - Referring to system info encryption vs. user info encryption as distinct "encryption key types" is confusing. The behavior of encryption is similar in both cases, only the sets of data that are subject to encryption differ. Rename the section to "Data Classes for Encryption". - Introduce the two highest-level "scylla.yaml" stanzas, "system_info_encryption" and "user_info_encryption". Subsequently, we'll expand on their (common!) contents later. - Remove the comment that, for the Local Key Provider, a keystore can be created either manually or automatically. This is stated / repeated elsewhere in the document. - Remove the unused anchor "_Replicated". - The notes on the Replicated Key Provider both lack nuance, and are ill-placed, here. Remove those notes. Add a dedicated description for Replicated later, elsewhere. Do mention "system_replicated_keys.encrypted_keys" here in passing, as a system table with sensitive contents. - The short listing of key providers is ill-placed here. We have an entire section dedicated to those. Furthermore, the various key providers apply to system info encryption, too. - Explain the two levels of configuration for SSTables of user tables. - Move the note about preserving keys for restoring backups to Key Providers \| About Local Key Storage, at least temporarily. When keys are stored on a key management server (KMIP, GCP, AWS, Azure), then backing those up is its own admin task / responsibility. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	268f5b1564	encryption-at-rest.rst: About: describe high-level operation more precisely Clarify some table vs. SSTable differences. Spell out the SSTable metadata ("Scylla.db") component. Spell out commit log metadata files. Explain that encryption settings are "snapshotted" into those meta-files. Highlight that encryption config may vary per table and per node. (For example, a local file key provider under the same pathname on each node, referenced by the table's "scylla_encryption_options" in the schema, may provide different keys for different nodes.) Introduce "algorithm descriptor" and "key provider" as generic concepts. Touch up the grammar / vocabulary slightly. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	8717102ae5	encryption-at-rest.rst: improve wording / formatting in About intro - Remove the KMIP password from the list of system level data. Encrypting this would require the `configuration_encryptor`, which has been removed as part of the effort to decommission all our java tools. - Provide an exhaustive list of system tables being encrypted. - "Table level granularity" is redundant; either "table level" or "table granularity" should suffice. Pick the latter. - Distinguish "block cipher" from "mode of operation" more precisely. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	b45d7417ef	encryption-at-rest.rst: users (plural) typo fix scylladb presumably stores data for multiple users. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	68dfa41e69	encryption-at-rest.rst: rewrap Wrap long lines at 80 chars. Seastar coding style suggests 160 chars, but 80 chars is more comfortable for side-by-side PR diffs on GitHub. Exclude arg lists and code blocks. Set the limit at 160 chars for arg lists to avoid too much wrapping that would hurt readability. Do not wrap code blocks at all. Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com> Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2025-08-01 17:27:45 +03:00
Laszlo Ersek	54ad1fe35f	encryption-at-rest.rst: strip trailing whitespace Signed-off-by: Laszlo Ersek <laszlo.ersek@scylladb.com>	2025-08-01 17:27:45 +03:00
Michał Chojnowski	10e742df2d	storage_service: in publish_new_sstable_dict, use _group0_as instead of the main abort source We probably want to abort this call also when the node is being DRAINED, not just when it's being shut down. So use a more appropriate abort_source.	2025-07-28 12:42:37 +02:00
Michał Chojnowski	6ba2781dff	storage_service: hold group0 gate in `publish_new_sstable_dict` When I was writing this code, I assumed that calls to `group0_client` is a friendly API that would take of keeping the group0 server alive until the client call returns. But after examining the code of `group0_client`, I think I was wrong, and that group0 must be kept alive by external means. So this patch attempts to keep group0 alive until the `publish_new_sstable_dict` call returns.	2025-07-28 12:42:37 +02:00
Sayanta Banerjee	fa1eafa166	Small grammatical changes	2025-06-26 15:52:38 +05:30

3455 changed files with 135160 additions and 51144 deletions

9

.github/CODEOWNERS vendored

View File

@@ -1,5 +1,5 @@
 # AUTH
 auth/* @nuivall @ptrsmrn
 auth/* @nuivall
 # CACHE
 row_cache* @tgrabiec
@@ -25,11 +25,11 @@ compaction/* @raphaelsc
 transport/*
 # CQL QUERY LANGUAGE
 cql3/* @tgrabiec @nuivall @ptrsmrn
 cql3/* @tgrabiec @nuivall
 # COUNTERS
 counters* @nuivall @ptrsmrn
 tests/counter_test* @nuivall @ptrsmrn
 counters* @nuivall
 tests/counter_test* @nuivall
 # DOCS
 docs/* @annastuchlik @tzach
@@ -57,7 +57,6 @@ repair/* @tgrabiec @asias
 # SCHEMA MANAGEMENT
 db/schema_tables* @tgrabiec
 db/legacy_schema_migrator* @tgrabiec
 service/migration* @tgrabiec
 schema* @tgrabiec

									
										101

.github/copilot-instructions.md
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,101 @@

				# ScyllaDB Development Instructions

				## Project Context

				High-performance distributed NoSQL database. Core values: performance, correctness, readability.

				## Build System

				### Modern Build (configure.py + ninja)

				```bash

				# Configure (run once per mode, or when switching modes)

				./configure.py --mode=<mode>  # mode: dev, debug, release, sanitize

				# Build everything

				ninja <mode>-build  # e.g., ninja dev-build

				# Build Scylla binary only (sufficient for Python integration tests)

				ninja build/<mode>/scylla

				# Build specific test

				ninja build/<mode>/test/boost/<test_name>

				```

				## Running Tests

				### C++ Unit Tests

				```bash

				# Run all tests in a file

				./test.py --mode=<mode> test/<suite>/<test_name>.cc

				# Run a single test case from a file

				./test.py --mode=<mode> test/<suite>/<test_name>.cc::<test_case_name>

				# Examples

				./test.py --mode=dev test/boost/memtable_test.cc

				./test.py --mode=dev test/raft/raft_server_test.cc::test_check_abort_on_client_api

				```

				**Important:** 

				- Use full path with `.cc` extension (e.g., `test/boost/test_name.cc`, not `boost/test_name`)

				- To run a single test case, append `::<test_case_name>` to the file path

				- If you encounter permission issues with cgroup metric gathering, add `--no-gather-metrics` flag

				**Rebuilding Tests:**

				- test.py does NOT automatically rebuild when test source files are modified

				- Many tests are part of composite binaries (e.g., `combined_tests` in test/boost contains multiple test files)

				- To find which binary contains a test, check `configure.py` in the repository root (primary source) or `test/<suite>/CMakeLists.txt`

				- To rebuild a specific test binary: `ninja build/<mode>/test/<suite>/<binary_name>`

				- Examples: 

				  - `ninja build/dev/test/boost/combined_tests` (contains group0_voter_calculator_test.cc and others)

				  - `ninja build/dev/test/raft/replication_test` (standalone Raft test)

				### Python Integration Tests

				```bash

				# Only requires Scylla binary (full build usually not needed)

				ninja build/<mode>/scylla

				# Run all tests in a file

				./test.py --mode=<mode> test/<suite>/<test_name>.py

				# Run a single test case from a file

				./test.py --mode=<mode> test/<suite>/<test_name>.py::<test_function_name>

				# Run all tests in a directory

				./test.py --mode=<mode> test/<suite>/

				# Examples

				./test.py --mode=dev test/alternator/

				./test.py --mode=dev test/cluster/test_raft_voters.py::test_raft_limited_voters_retain_coordinator

				./test.py --mode=dev test/cqlpy/test_json.py

				# Optional flags

				./test.py --mode=dev test/cluster/test_raft_no_quorum.py -v  # Verbose output

				./test.py --mode=dev test/cluster/test_raft_no_quorum.py --repeat 5  # Repeat test 5 times

				```

				**Important:**

				- Use full path with `.py` extension (e.g., `test/cluster/test_raft_no_quorum.py`, not `cluster/test_raft_no_quorum`)

				- To run a single test case, append `::<test_function_name>` to the file path

				- Add `-v` for verbose output

				- Add `--repeat <num>` to repeat a test multiple times

				- After modifying C++ source files, only rebuild the Scylla binary for Python tests - building the entire repository is unnecessary

				## Code Philosophy

				- Performance matters in hot paths (data read/write, inner loops)

				- Self-documenting code through clear naming

				- Comments explain "why", not "what"

				- Prefer standard library over custom implementations

				- Strive for simplicity and clarity, add complexity only when clearly justified

				- Question requests: don't blindly implement requests - evaluate trade-offs, identify issues, and suggest better alternatives when appropriate

				- Consider different approaches, weigh pros and cons, and recommend the best fit for the specific context

				## Test Philosophy

				- Performance matters. Tests should run as quickly as possible. Sleeps in the code are highly discouraged and should be avoided, to reduce run time and flakiness.

				- Stability matters. Tests should be stable. New tests should be executed 100 times at least to ensure they pass 100 out of 100 times. (use --repeat 100 --max-failures 1 when running it)

				- Unit tests should ideally test one thing and one thing only.

				- Tests for bug fixes should run before the fix - and show the failure and after the fix - and show they now pass.

				- Tests for bug fixes should have in their comments which bug fixes (GitHub or JIRA issue) they test.

				- Tests in debug are always slower, so if needed, reduce number of iterations, rows, data used, cycles, etc. in debug mode.

				- Tests should strive to be repeatable, and not use random input that will make their results unpredictable.

				- Tests should consume as little resources as possible. Prefer running tests on a single node if it is sufficient, for example.

									
										2

.github/dependabot.yml
									
										vendored
									
												View File
												
				@@ -1,6 +1,6 @@

				version: 2

				updates:

				- package-ecosystem: "pip"

				- package-ecosystem: "uv"

				  directory: "/docs"

				  schedule:

				    interval: "daily"

									
										115

.github/instructions/cpp.instructions.md
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,115 @@

				---

				applyTo: "**/*.{cc,hh}"

				---

				# C++ Guidelines

				**Important:** Always match the style and conventions of existing code in the file and directory.

				## Memory Management

				- Prefer stack allocation whenever possible

				- Use `std::unique_ptr` by default for dynamic allocations

				- `new`/`delete` are forbidden (use RAII)

				- Use `seastar::lw_shared_ptr` or `seastar::shared_ptr` for shared ownership within same shard

				- Use `seastar::foreign_ptr` for cross-shard sharing

				- Avoid `std::shared_ptr` except when interfacing with external C++ APIs

				- Avoid raw pointers except for non-owning references or C API interop

				## Seastar Asynchronous Programming

				- Use `seastar::future<T>` for all async operations

				- Prefer coroutines (`co_await`, `co_return`) over `.then()` chains for readability

				- Coroutines are preferred over `seastar::do_with()` for managing temporary state

				- In hot paths where futures are ready, continuations may be more efficient than coroutines

				- Chain futures with `.then()`, don't block with `.get()` (unless in `seastar::thread` context)

				- All I/O must be asynchronous (no blocking calls)

				- Use `seastar::gate` for shutdown coordination

				- Use `seastar::semaphore` for resource limiting (not `std::mutex`)

				- Break long loops with `maybe_yield()` to avoid reactor stalls

				## Coroutines

				```cpp

				seastar::future<T> func() {

				    auto result = co_await async_operation();

				    co_return result;

				}

				```

				## Error Handling

				- Throw exceptions for errors (futures propagate them automatically)

				- In data path: avoid exceptions, use `std::expected` (or `boost::outcome`) instead

				- Use standard exceptions (`std::runtime_error`, `std::invalid_argument`)

				- Database-specific: throw appropriate schema/query exceptions

				## Performance

				- Pass large objects by `const&` or `&&` (move semantics)

				- Use `std::string_view` for non-owning string references

				- Avoid copies: prefer move semantics

				- Use `utils::chunked_vector` instead of `std::vector` for large allocations (>128KB)

				- Minimize dynamic allocations in hot paths

				## Database-Specific Types

				- Use `schema_ptr` for schema references

				- Use `mutation` and `mutation_partition` for data modifications

				- Use `partition_key` and `clustering_key` for keys

				- Use `api::timestamp_type` for database timestamps

				- Use `gc_clock` for garbage collection timing

				## Style

				- C++23 standard (prefer modern features, especially coroutines)

				- Use `auto` when type is obvious from RHS

				- Avoid `auto` when it obscures the type

				- Use range-based for loops: `for (const auto& item : container)`

				- Use standard algorithms when they clearly simplify code (e.g., replacing 10-line loops)

				- Avoid chaining multiple algorithms if a straightforward loop is clearer

				- Mark functions and variables `const` whenever possible

				- Use scoped enums: `enum class` (not unscoped `enum`)

				## Headers

				- Use `#pragma once`

				- Include order: own header, C++ std, Seastar, Boost, project headers

				- Forward declare when possible

				- Never `using namespace` in headers (exception: `using namespace seastar` is globally available via `seastarx.hh`)

				## Documentation

				- Public APIs require clear documentation

				- Implementation details should be self-evident from code

				- Use `///` or Doxygen `/** */` for public documentation, `//` for implementation notes - follow the existing style

				## Naming

				- `snake_case` for most identifiers (classes, functions, variables, namespaces)

				- Template parameters: `CamelCase` (e.g., `template<typename ValueType>`)

				- Member variables: prefix with `_` (e.g., `int _count;`)

				- Structs (value-only): no `_` prefix on members

				- Constants and `constexpr`: `snake_case` (e.g., `static constexpr int max_size = 100;`)

				- Files: `.hh` for headers, `.cc` for source

				## Formatting

				- 4 spaces indentation, never tabs

				- Opening braces on same line as control structure (except namespaces)

				- Space after keywords: `if (`, `while (`, `return `

				- Whitespace around operators matches precedence: `*a + *b` not `* a+* b`

				- Line length: keep reasonable (<160 chars), use continuation lines with double indent if needed

				- Brace all nested scopes, even single statements

				- Minimal patches: only format code you modify, never reformat entire files

				## Logging

				- Use structured logging with appropriate levels: DEBUG, INFO, WARN, ERROR

				- Include context in log messages (e.g., request IDs)

				- Never log sensitive data (credentials, PII)

				## Forbidden

				- `malloc`/`free`

				- `printf` family (use logging or fmt)

				- Raw pointers for ownership

				- `using namespace` in headers

				- Blocking operations: `std::sleep`, `std::read`, `std::mutex` (use Seastar equivalents)

				- `std::atomic` (reserved for very special circumstances only)

				- Macros (use `inline`, `constexpr`, or templates instead)

				## Testing

				When modifying existing code, follow TDD: create/update test first, then implement.

				- Examine existing tests for style and structure

				- Use Boost.Test framework

				- Use `SEASTAR_THREAD_TEST_CASE` for Seastar asynchronous tests

				- Aim for high code coverage, especially for new features and bug fixes

				- Maintain bisectability: all tests must pass in every commit. Mark failing tests with `BOOST_FAIL()` or similar, then fix in subsequent commit

									
										51

.github/instructions/python.instructions.md
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,51 @@

				---

				applyTo: "**/*.py"

				---

				# Python Guidelines

				**Important:** Match existing code style. Some directories (like `test/cqlpy` and `test/alternator`) prefer simplicity over type hints and docstrings.

				## Style

				- Follow PEP 8

				- Use type hints for function signatures (unless directory style omits them)

				- Use f-strings for formatting

				- Line length: 160 characters max

				- 4 spaces for indentation

				## Imports

				Order: standard library, third-party, local imports

				```python

				import os

				import sys

				import pytest

				from cassandra.cluster import Cluster

				from test.utils import setup_keyspace

				```

				Never use `from module import *`

				## Documentation

				All public functions/classes need docstrings (unless the current directory conventions omit them):

				```python

				def my_function(arg1: str, arg2: int) -> bool:

				    """

				    Brief summary of function purpose.

				    Args:

				        arg1: Description of first argument.

				        arg2: Description of second argument.

				    Returns:

				        Description of return value.

				    """

				    pass

				```

				## Testing Best Practices

				- Maintain bisectability: all tests must pass in every commit

				- Mark currently-failing tests with `@pytest.mark.xfail`, unmark when fixed

				- Use descriptive names that convey intent

				- Docstrings/comments should explain what the test verifies and why, and if it reproduces a specific issue or how it fits into the larger test suite

									
										55

.github/scripts/auto-backport.py
									
										vendored
									
												View File
												
				@@ -47,13 +47,29 @@ def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr

				            draft=is_draft

				        )

				        logging.info(f"Pull request created: {backport_pr.html_url}")

				        labels_to_add = []

				        priority_labels = {"P0", "P1"}

				        parent_pr_labels = [label.name for label in pr.labels]

				        for label in priority_labels:

				            if label in parent_pr_labels:

				                labels_to_add.append(label)

				                labels_to_add.append("force_on_cloud")

				                logging.info(f"Adding {label} and force_on_cloud labels from parent PR to backport PR")

				                break  # Only apply the highest priority label

				        if is_collaborator:

				            backport_pr.add_to_assignees(pr.user)

				        if is_draft:

				            backport_pr.add_to_labels("conflicts")

				            labels_to_add.append("conflicts")

				            pr_comment = f"@{pr.user.login} - This PR was marked as draft because it has conflicts\n"

				            pr_comment += "Please resolve them and mark this PR as ready for review"

				            pr_comment += "Please resolve them and remove the 'conflicts' label. The PR will be made ready for review automatically."

				            backport_pr.create_issue_comment(pr_comment)

				        # Apply all labels at once if we have any

				        if labels_to_add:

				            backport_pr.add_to_labels(*labels_to_add)

				            logging.info(f"Added labels to backport PR: {labels_to_add}")

				        logging.info(f"Assigned PR to original author: {pr.user}")

				        return backport_pr

				    except GithubException as e:

				@@ -126,20 +142,31 @@ def backport(repo, pr, version, commits, backport_base_branch, is_collaborator):

				def with_github_keyword_prefix(repo, pr):

				    pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"

				    match = re.findall(pattern, pr.body, re.IGNORECASE)

				    if not match:

				        for commit in pr.get_commits():

				            match = re.findall(pattern, commit.commit.message, re.IGNORECASE)

				            if match:

				                print(f'{pr.number} has a valid close reference in commit message {commit.sha}')

				                break

				    if not match:

				        print(f'No valid close reference for {pr.number}')

				        return False

				    else:

				    # GitHub issue pattern: #123, scylladb/scylladb#123, or full GitHub URLs

				    github_pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"

				    # JIRA issue pattern: PKG-92 or https://scylladb.atlassian.net/browse/PKG-92

				    jira_pattern = r"(?:fix(?:|es|ed))\s*:?\s*(?:(?:https://scylladb\.atlassian\.net/browse/)?([A-Z]+-\d+))"

				    # Check PR body for GitHub issues

				    github_match = re.findall(github_pattern, pr.body, re.IGNORECASE)

				    # Check PR body for JIRA issues

				    jira_match = re.findall(jira_pattern, pr.body, re.IGNORECASE)

				    match = github_match or jira_match

				    if match:

				        return True

				    for commit in pr.get_commits():

				        github_match = re.findall(github_pattern, commit.commit.message, re.IGNORECASE)

				        jira_match = re.findall(jira_pattern, commit.commit.message, re.IGNORECASE)

				        if github_match or jira_match:

				            print(f'{pr.number} has a valid close reference in commit message {commit.sha}')

				            return True

				    print(f'No valid close reference for {pr.number}')

				    return False

				def main():

				    args = parse_args()

									
										20

.github/scripts/sync_labels.py
									
										vendored
									
												View File
												
				@@ -30,8 +30,13 @@ def copy_labels_from_linked_issues(repo, pr_number):

				            try:

				                issue = repo.get_issue(int(issue_number))

				                for label in issue.labels:

				                    # Copy ALL labels from issues to PR when PR is opened

				                    pr.add_to_labels(label.name)

				                print(f"Labels from issue #{issue_number} copied to PR #{pr_number}")

				                    print(f"Copied label '{label.name}' from issue #{issue_number} to PR #{pr_number}")

				                    if label.name in ['P0', 'P1']:

				                        pr.add_to_labels('force_on_cloud')

				                        print(f"Added force_on_cloud label to PR #{pr_number} due to {label.name} label")

				                print(f"All labels from issue #{issue_number} copied to PR #{pr_number}")

				            except Exception as e:

				                print(f"Error processing issue #{issue_number}: {e}")

				@@ -74,9 +79,22 @@ def sync_labels(repo, number, label, action, is_issue=False):

				            target = repo.get_issue(int(pr_or_issue_number))

				        if action == 'labeled':

				            target.add_to_labels(label)

				            if label in ['P0', 'P1'] and is_issue:

				                # Only add force_on_cloud to PRs when P0/P1 is added to an issue

				                target.add_to_labels('force_on_cloud')

				                print(f"Added 'force_on_cloud' label to PR #{pr_or_issue_number} due to {label} label")

				            print(f"Label '{label}' successfully added.")

				        elif action == 'unlabeled':

				            target.remove_from_labels(label)

				            if label in ['P0', 'P1'] and is_issue:

				                # Check if any other P0/P1 labels remain before removing force_on_cloud

				                remaining_priority_labels = [l.name for l in target.labels if l.name in ['P0', 'P1']]

				                if not remaining_priority_labels:

				                    try:

				                        target.remove_from_labels('force_on_cloud')

				                        print(f"Removed 'force_on_cloud' label from PR #{pr_or_issue_number} as no P0/P1 labels remain")

				                    except Exception as e:

				                        print(f"Warning: Could not remove force_on_cloud label: {e}")

				            print(f"Label '{label}' successfully removed.")

				        elif action == 'opened':

				            copy_labels_from_linked_issues(repo, number)

									
										7

.github/workflows/add-label-when-promoted.yaml
									
										vendored
									
												View File
												
				@@ -54,10 +54,13 @@ jobs:

				          GITHUB_TOKEN: ${{ secrets.AUTO_BACKPORT_TOKEN }}

				        run: python .github/scripts/auto-backport.py --repo ${{ github.repository }} --base-branch ${{ github.ref }} --commits ${{ github.event.before }}..${{ github.sha }}

				      - name: Check if a valid backport label exists and no backport_error

				        env:

				          LABELS_JSON: ${{ toJson(github.event.pull_request.labels) }}

				        id: check_label

				        run: |

				          labels_json='${{ toJson(github.event.pull_request.labels) }}'

				          echo "Checking labels: $(echo "$labels_json" | jq -r '.[].name')"

				          labels_json="$LABELS_JSON"

				          echo "Checking labels:"

				          echo "$labels_json" | jq -r '.[].name'

				          # Check if a valid backport label exists

				          if echo "$labels_json" | jq -e 'any(.[] | .name; test("backport/[0-9]+\\.[0-9]+$"))' > /dev/null; then

									
										5

.github/workflows/backport-pr-fixes-validation.yaml
									
										vendored
									
												View File
												
				@@ -8,6 +8,9 @@ on:

				jobs:

				  check-fixes-prefix:

				    runs-on: ubuntu-latest

				    permissions:

				      contents: read

				      issues: write

				    steps:

				      - name: Check PR body for "Fixes" prefix patterns

				        uses: actions/github-script@v7

				@@ -18,7 +21,7 @@ jobs:

				            // Regular expression pattern to check for "Fixes" prefix

				            // Adjusted to dynamically insert the repository full name

				            const pattern = `Fixes:? (?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)`;

				            const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|(?:https://scylladb\\.atlassian\\.net/browse/)?([A-Z]+-\\d+))`;

				            const regex = new RegExp(pattern);

				            if (!regex.test(body)) {

									
										53

.github/workflows/call_backport_with_jira.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,53 @@

				name: Backport with Jira Integration

				on:

				  push:

				    branches:

				      - master

				      - next-*.*

				      - branch-*.*

				  pull_request_target:

				    types: [labeled, closed]

				    branches: 

				      - master

				      - next

				      - next-*.*

				      - branch-*.*

				jobs:

				  backport-on-push:

				    if: github.event_name == 'push'

				    uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main

				    with:

				      event_type: 'push'

				      base_branch: ${{ github.ref }}

				      commits: ${{ github.event.before }}..${{ github.sha }}

				    secrets:

				      gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}

				      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

				  backport-on-label:

				    if: github.event_name == 'pull_request_target' && github.event.action == 'labeled'

				    uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main

				    with:

				      event_type: 'labeled'

				      base_branch: refs/heads/${{ github.event.pull_request.base.ref }}

				      pull_request_number: ${{ github.event.pull_request.number }}

				      head_commit: ${{ github.event.pull_request.base.sha }}

				      label_name: ${{ github.event.label.name }}

				      pr_state: ${{ github.event.pull_request.state }}

				    secrets:

				      gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}

				      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

				  backport-chain:

				    if: github.event_name == 'pull_request_target' && github.event.action == 'closed' && github.event.pull_request.merged == true

				    uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main

				    with:

				      event_type: 'chain'

				      base_branch: refs/heads/${{ github.event.pull_request.base.ref }}

				      pull_request_number: ${{ github.event.pull_request.number }}

				      pr_body: ${{ github.event.pull_request.body }}

				    secrets:

				      gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}

				      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

									
										11

.github/workflows/call_jira_status_in_progress.yml
									
										vendored
									
												View File
											
				@@ -1,11 +0,0 @@

				name: Call Jira Status In Progress

				on:

				  pull_request:

				    types: [opened]

				jobs:

				  call-jira-status-in-progress:

				    uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_in_progress.yml@main

				    secrets:

				      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

									
										11

.github/workflows/call_jira_status_in_review.yml
									
										vendored
									
												View File
											
				@@ -1,11 +0,0 @@

				name: Call Jira Status In Review

				on:

				  pull_request:

				    types: [ready_for_review, review_requested]

				jobs:

				  call-jira-status-in-review:

				    uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_in_review.yml@main

				    secrets:

				      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

									
										13

.github/workflows/call_jira_status_ready_for_merge.yml
									
										vendored
									
												View File
											
				@@ -1,13 +0,0 @@

				name: Call Jira Status Ready For Merge

				on:

				  pull_request:

				    types: [labeled]

				jobs:

				  call-jira-status-update:

				    uses: scylladb/github-automation/.github/workflows/main_update_jira_status_to_ready_for_merge.yml@main

				    with:

				      label_name: 'status/merge_candidate'

				    secrets:

				      jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

									
										18

.github/workflows/call_jira_sync.yml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,18 @@

				name: Sync Jira Based on PR Events

				on:

				  pull_request_target:

				    types: [opened, edited, ready_for_review, review_requested, labeled, unlabeled, closed]

				permissions:

				  contents: read

				  pull-requests: write

				  issues: write

				jobs:

				  jira-sync:

				    uses: scylladb/github-automation/.github/workflows/main_pr_events_jira_sync.yml@main

				    with:

				      caller_action: ${{ github.event.action }}

				    secrets:

				      caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

									
										22

.github/workflows/call_jira_sync_pr_milestone.yml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,22 @@

				name: Sync Jira Based on PR Milestone Events

				on:

				  pull_request_target:

				    types: [milestoned, demilestoned]

				permissions:

				  contents: read

				  pull-requests: read

				jobs:

				  jira-sync-milestone-set:

				    if: github.event.action == 'milestoned'

				    uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_milestone_set.yml@main

				    secrets:

				      caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

				  jira-sync-milestone-removed:

				    if: github.event.action == 'demilestoned'

				    uses: scylladb/github-automation/.github/workflows/main_jira_sync_pr_milestone_removed.yml@main

				    secrets:

				      caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

									
										14

.github/workflows/call_sync_milestone_to_jira.yml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,14 @@

				name: Call Jira release creation for new milestone

				on:

				  milestone:

				    types: [created, closed]

				jobs:

				  sync-milestone-to-jira:

				    uses: scylladb/github-automation/.github/workflows/main_sync_milestone_to_jira_release.yml@main

				    with:

				      # Comma-separated list of Jira project keys

				      jira_project_keys: "SCYLLADB,CUSTOMER,SMI,RELENG,VECTOR"

				    secrets:

				      caller_jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

									
										13

.github/workflows/call_validate_pr_author_email.yml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,13 @@

				name: validate_pr_author_email

				on:

				  pull_request_target:

				    types:

				      - opened

				      - synchronize

				      - reopened

				jobs:

				  validate_pr_author_email:

				    uses: scylladb/github-automation/.github/workflows/validate_pr_author_email.yml@main

									
										62

.github/workflows/close_issue_for_scylla_associate.yml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,62 @@

				name: Close issues created by Scylla associates

				on:

				  issues:

				    types: [opened, reopened]

				permissions:

				  issues: write

				jobs:

				  comment-and-close:

				    runs-on: ubuntu-latest

				    steps:

				      - name: Comment and close if author email is scylladb.com

				        uses: actions/github-script@v7

				        with:

				          github-token: ${{ secrets.GITHUB_TOKEN }}

				          script: |

				            const issue = context.payload.issue;

				            const actor = context.actor;

				            // Get user data (only public email is available)

				            const { data: user } = await github.rest.users.getByUsername({

				              username: actor,

				            });

				            const email = user.email || "";

				            console.log(`Actor: ${actor}, public email: ${email || "<none>"}`);

				            // Only continue if email exists and ends with @scylladb.com

				            if (!email || !email.toLowerCase().endsWith("@scylladb.com")) {

				              console.log("User is not a scylladb.com email (or email not public); skipping.");

				              return;

				            }

				            const owner = context.repo.owner;

				            const repo = context.repo.repo;

				            const issue_number = issue.number;

				            const body = "Issues in this repository are closed automatically. Scylla associates should use Jira to manage issues.\nPlease move this issue to Jira https://scylladb.atlassian.net/jira/software/c/projects/SCYLLADB/list";

				            // Add the comment

				            await github.rest.issues.createComment({

				              owner,

				              repo,

				              issue_number,

				              body,

				            });

				            console.log(`Comment added to #${issue_number}`);

				            // Close the issue

				            await github.rest.issues.update({

				              owner,

				              repo,

				              issue_number,

				              state: "closed",

				              state_reason: "not_planned"

				            });

				            console.log(`Issue #${issue_number} closed.`);

									
										2

.github/workflows/codespell.yaml
									
										vendored
									
												View File
												
				@@ -13,5 +13,5 @@ jobs:

				      - uses: codespell-project/actions-codespell@master

				        with:

				          only_warn: 1

				          ignore_words_list: "ans,datas,fo,ser,ue,crate,nd,reenable,strat,stap,te,raison"

				          ignore_words_list: "ans,datas,fo,ser,ue,crate,nd,reenable,strat,stap,te,raison,iif,tread"

				          skip: "./.git,./build,./tools,*.js,*.lock,./test,./licenses,./redis/lolwut.cc,*.svg"

									
										8

.github/workflows/docs-pages.yaml
									
										vendored
									
												View File
												
				@@ -18,6 +18,10 @@ on:

				jobs:

				  release:

				    permissions:

				      pages: write

				      id-token: write

				      contents: write

				    runs-on: ubuntu-latest

				    steps:

				      - name: Checkout

				@@ -29,7 +33,9 @@ jobs:

				      - name: Set up Python

				        uses: actions/setup-python@v5

				        with:

				          python-version: "3.10"

				          python-version: "3.12"

				      - name: Install uv

				        uses: astral-sh/setup-uv@v6

				      - name: Set up env

				        run: make -C docs FLAG="${{ env.FLAG }}" setupenv

				      - name: Build docs

									
										7

.github/workflows/docs-pr.yaml
									
										vendored
									
												View File
												
				@@ -2,6 +2,9 @@ name: "Docs / Build PR"

				# For more information,

				# see https://sphinx-theme.scylladb.com/stable/deployment/production.html#available-workflows

				permissions:

				  contents: read

				env:

				  FLAG: ${{ github.repository == 'scylladb/scylla-enterprise' && 'enterprise' || 'opensource' }}

				@@ -26,7 +29,9 @@ jobs:

				      - name: Set up Python

				        uses: actions/setup-python@v5

				        with:

				          python-version: "3.10"

				          python-version: "3.12"

				      - name: Install uv

				        uses: astral-sh/setup-uv@v6

				      - name: Set up env

				        run: make -C docs FLAG="${{ env.FLAG }}" setupenv

				      - name: Build docs

									
										37

.github/workflows/docs-validate-metrics.yml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,37 @@

				name: Docs / Validate metrics

				permissions:

				  contents: read

				on:

				  pull_request:

				    branches:

				      - master

				      - enterprise

				    paths:

				      - '**/*.cc'

				      - 'scripts/metrics-config.yml'

				      - 'scripts/get_description.py'

				      - 'docs/_ext/scylladb_metrics.py'

				jobs:

				  validate-metrics:

				    runs-on: ubuntu-latest

				    name: Check metrics documentation coverage

				    steps:

				    - name: Checkout code

				      uses: actions/checkout@v4

				      with:

				        submodules: true

				    - name: Set up Python

				      uses: actions/setup-python@v6

				      with:

				        python-version: '3.10'

				    - name: Install dependencies

				      run: pip install PyYAML

				    - name: Validate metrics

				      run: python3 scripts/get_description.py --validate -c scripts/metrics-config.yml

									
										5

.github/workflows/iwyu.yaml
									
										vendored
									
												View File
												
				@@ -14,7 +14,8 @@ env:

				  CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service

				  SEASTAR_BAD_INCLUDE_OUTPUT_PATH: build/seastar-bad-include.log

				permissions: {}

				permissions:

				  contents: read

				# cancel the in-progress run upon a repush

				concurrency:

				@@ -34,8 +35,6 @@ jobs:

				      - uses: actions/checkout@v4

				        with:

				          submodules: true

				      - run: |

				          sudo dnf -y install clang-tools-extra

				      - name: Generate compilation database

				        run: |

				          cmake                                         \

									
										2

.github/workflows/read-toolchain.yaml
									
										vendored
									
												View File
												
				@@ -10,6 +10,8 @@ on:

				jobs:

				  read-toolchain:

				    runs-on: ubuntu-latest

				    permissions:

				      contents: read

				    outputs:

				      image: ${{ steps.read.outputs.image }}

				    steps:

									
										4

.github/workflows/sync-labels.yaml
									
										vendored
									
												View File
												
				@@ -37,13 +37,13 @@ jobs:

				        run: python .github/scripts/sync_labels.py --repo ${{ github.repository }} --number ${{ github.event.number }} --action ${{ github.event.action }}

				      - name: Pull request labeled or unlabeled event

				        if: github.event_name == 'pull_request_target' && startsWith(github.event.label.name, 'backport/')

				        if: github.event_name == 'pull_request_target' && (startsWith(github.event.label.name, 'backport/') || github.event.label.name == 'P0' || github.event.label.name == 'P1')

				        env:

				          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				        run: python .github/scripts/sync_labels.py --repo ${{ github.repository }} --number ${{ github.event.number }} --action ${{ github.event.action }} --label ${{ github.event.label.name }}

				      - name: Issue labeled or unlabeled event

				        if: github.event_name == 'issues' && startsWith(github.event.label.name, 'backport/')

				        if: github.event_name == 'issues' && (startsWith(github.event.label.name, 'backport/') || github.event.label.name == 'P0' || github.event.label.name == 'P1')

				        env:

				          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				        run: python .github/scripts/sync_labels.py --repo ${{ github.repository }} --number ${{ github.event.issue.number }} --action ${{ github.event.action }} --is_issue --label ${{ github.event.label.name }}

									
										66

.github/workflows/trigger-scylla-ci.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,66 @@

				name: Trigger Scylla CI Route

				permissions:

				  contents: read

				on:

				  issue_comment:

				    types: [created]

				  pull_request_target:

				    types:

				      - unlabeled

				jobs:

				  trigger-jenkins:

				    if: (github.event_name == 'issue_comment' && github.event.comment.user.login != 'scylladbbot') || github.event.label.name == 'conflicts'

				    runs-on: ubuntu-latest

				    steps:

				      - name: Verify Org Membership

				        id: verify_author

				        env:

				          EVENT_NAME: ${{ github.event_name }}

				          PR_AUTHOR: ${{ github.event.pull_request.user.login }}

				          PR_ASSOCIATION: ${{ github.event.pull_request.author_association }}

				          COMMENT_AUTHOR: ${{ github.event.comment.user.login }}

				          COMMENT_ASSOCIATION: ${{ github.event.comment.author_association }}

				        shell: bash

				        run: |

				          if [[ "$EVENT_NAME" == "pull_request_target" ]]; then

				            AUTHOR="$PR_AUTHOR"

				            ASSOCIATION="$PR_ASSOCIATION"

				          else

				            AUTHOR="$COMMENT_AUTHOR"

				            ASSOCIATION="$COMMENT_ASSOCIATION"

				          fi

				          if [[ "$ASSOCIATION" == "MEMBER" || "$ASSOCIATION" == "OWNER" ]]; then

				            echo "member=true" >> $GITHUB_OUTPUT

				          else

				            echo "::warning::${AUTHOR} is not a member of scylladb (association: ${ASSOCIATION}); skipping CI trigger."

				            echo "member=false" >> $GITHUB_OUTPUT

				          fi

				      - name: Validate Comment Trigger

				        if: github.event_name == 'issue_comment'

				        id: verify_comment

				        env:

				          COMMENT_BODY: ${{ github.event.comment.body }}

				        shell: bash

				        run: |

				          CLEAN_BODY=$(echo "$COMMENT_BODY" | grep -v '^[[:space:]]*>')

				          if echo "$CLEAN_BODY" | grep -qi '@scylladbbot' && echo "$CLEAN_BODY" | grep -qi 'trigger-ci'; then

				            echo "trigger=true" >> $GITHUB_OUTPUT

				          else

				            echo "trigger=false" >> $GITHUB_OUTPUT

				          fi

				      - name: Trigger Scylla-CI-Route Jenkins Job

				        if: steps.verify_author.outputs.member == 'true' && (github.event_name == 'pull_request_target' || steps.verify_comment.outputs.trigger == 'true')

				        env:

				          JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}

				          JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}

				          JENKINS_URL: "https://jenkins.scylladb.com"

				          PR_NUMBER: "${{ github.event.issue.number || github.event.pull_request.number }}"

				          PR_REPO_NAME: "${{ github.event.repository.full_name }}"

				        run: |

				          curl -X POST "$JENKINS_URL/job/releng/job/Scylla-CI-Route/buildWithParameters?PR_NUMBER=$PR_NUMBER&PR_REPO_NAME=$PR_REPO_NAME" \

				            --user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail

									
										242

.github/workflows/trigger_ci.yaml
									
										vendored
									
										Normal file
									
												View File
												
				@@ -0,0 +1,242 @@

				name: Trigger next gating

				on:

				  pull_request_target:

				    types: [opened, reopened, synchronize]

				  issue_comment:

				    types: [created]

				jobs:

				  trigger-ci:

				    runs-on: ubuntu-latest

				    steps:

				      - name: Dump GitHub context

				        env:

				          GITHUB_CONTEXT: ${{ toJson(github) }}

				        run: echo "$GITHUB_CONTEXT"

				      - name: Checkout PR code

				        uses: actions/checkout@v3

				        with:

				          fetch-depth: 0  # Needed to access full history

				          ref: ${{ github.event.pull_request.head.ref }}

				      - name: Fetch before commit if needed

				        run: |

				          if ! git cat-file -e ${{ github.event.before }} 2>/dev/null; then

				            echo "Fetching before commit ${{ github.event.before }}"

				            git fetch --depth=1 origin ${{ github.event.before }}

				          fi

				      - name: Compare commits for file changes

				        if: github.action == 'synchronize'

				        env:

				          GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}

				        run: |

				          echo "Base: ${{ github.event.before }}"

				          echo "Head: ${{ github.event.after }}"

				          TREE_BEFORE=$(git show -s --format=%T ${{ github.event.before }})

				          TREE_AFTER=$(git show -s --format=%T ${{ github.event.after }})

				          echo "TREE_BEFORE=$TREE_BEFORE" >> $GITHUB_ENV

				          echo "TREE_AFTER=$TREE_AFTER" >> $GITHUB_ENV

				      - name: Check if last push has file changes

				        run: |

				          if [[ "${{ env.TREE_BEFORE }}" == "${{ env.TREE_AFTER }}" ]]; then

				            echo "No file changes detected in the last push, only commit message edit."

				            echo "has_file_changes=false" >> $GITHUB_ENV

				          else

				            echo "File changes detected in the last push."

				            echo "has_file_changes=true" >> $GITHUB_ENV

				          fi

				      - name: Rule 1 - Check PR draft or conflict status

				        run: |

				          # Check if PR is in draft mode

				          IS_DRAFT="${{ github.event.pull_request.draft }}"

				          # Check if PR has 'conflict' label

				          HAS_CONFLICT_LABEL="false"

				          LABELS='${{ toJson(github.event.pull_request.labels) }}'

				          if echo "$LABELS" | jq -r '.[].name' | grep -q "^conflict$"; then

				            HAS_CONFLICT_LABEL="true"

				          fi

				          # Set draft_or_conflict variable

				          if [[ "$IS_DRAFT" == "true" || "$HAS_CONFLICT_LABEL" == "true" ]]; then

				            echo "draft_or_conflict=true" >> $GITHUB_ENV

				            echo "✅ Rule 1: PR is in draft mode or has conflict label - setting draft_or_conflict=true"

				          else

				            echo "draft_or_conflict=false" >> $GITHUB_ENV

				            echo "✅ Rule 1: PR is ready and has no conflict label - setting draft_or_conflict=false"

				          fi

				          echo "Draft status: $IS_DRAFT"

				          echo "Has conflict label: $HAS_CONFLICT_LABEL"

				          echo "Result: draft_or_conflict = $draft_or_conflict"

				      - name: Rule 2 - Check labels

				        run: |

				          # Check if PR has P0 or P1 labels

				          HAS_P0_P1_LABEL="false"

				          LABELS='${{ toJson(github.event.pull_request.labels) }}'

				          if echo "$LABELS" | jq -r '.[].name' | grep -E "^(P0|P1)$" > /dev/null; then

				            HAS_P0_P1_LABEL="true"

				          fi

				          # Check if PR already has force_on_cloud label

				          echo "HAS_FORCE_ON_CLOUD_LABEL=false" >> $GITHUB_ENV

				          if echo "$LABELS" | jq -r '.[].name' | grep -q "^force_on_cloud$"; then

				            HAS_FORCE_ON_CLOUD_LABEL="true"

				            echo "HAS_FORCE_ON_CLOUD_LABEL=true" >> $GITHUB_ENV

				          fi

				          echo "Has P0/P1 label: $HAS_P0_P1_LABEL"

				          echo "Has force_on_cloud label: $HAS_FORCE_ON_CLOUD_LABEL"

				          # Add force_on_cloud label if PR has P0/P1 and doesn't already have force_on_cloud

				          if [[ "$HAS_P0_P1_LABEL" == "true" && "$HAS_FORCE_ON_CLOUD_LABEL" == "false" ]]; then

				            echo "✅ Rule 2: PR has P0 or P1 label - adding force_on_cloud label"

				            curl -X POST \

				              -H "Authorization: token ${{ secrets.GITHUB_TOKEN }}" \

				              -H "Accept: application/vnd.github.v3+json" \

				              "https://api.github.com/repos/${{ github.repository }}/issues/${{ github.event.pull_request.number }}/labels" \

				              -d '{"labels":["force_on_cloud"]}'

				          elif [[ "$HAS_P0_P1_LABEL" == "true" && "$HAS_FORCE_ON_CLOUD_LABEL" == "true" ]]; then

				            echo "✅ Rule 2: PR has P0 or P1 label and already has force_on_cloud label - no action needed"

				          else

				            echo "✅ Rule 2: PR does not have P0 or P1 label - no force_on_cloud label needed"

				          fi

				          SKIP_UNIT_TEST_CUSTOM="false"

				          if echo "$LABELS" | jq -r '.[].name' | grep -q "^ci/skip_unit-tests_custom$"; then

				            SKIP_UNIT_TEST_CUSTOM="true"

				          fi

				          echo "SKIP_UNIT_TEST_CUSTOM=$SKIP_UNIT_TEST_CUSTOM" >> $GITHUB_ENV

				      - name: Rule 3 - Analyze changed files and set build requirements

				        run: |

				          # Get list of changed files

				          CHANGED_FILES=$(git diff --name-only ${{ github.event.pull_request.base.sha }} ${{ github.event.pull_request.head.sha }})

				          echo "Changed files:"

				          echo "$CHANGED_FILES"

				          echo ""

				          # Initialize all requirements to false

				          REQUIRE_BUILD="false"

				          REQUIRE_DTEST="false"

				          REQUIRE_UNITTEST="false"

				          REQUIRE_ARTIFACTS="false"

				          REQUIRE_SCYLLA_GDB="false"

				          # Check each file against patterns

				          while IFS= read -r file; do

				            if [[ -n "$file" ]]; then

				              echo "Checking file: $file"

				              # Build pattern: ^(?!scripts\/pull_github_pr.sh).*$

				              # Everything except scripts/pull_github_pr.sh

				              if [[ "$file" != "scripts/pull_github_pr.sh" ]]; then

				                REQUIRE_BUILD="true"

				                echo "  ✓ Matches build pattern"

				              fi

				              # Dtest pattern: ^(?!test(.py|\/)|dist\/docker\/|dist\/common\/scripts\/).*$

				              # Everything except test files, dist/docker/, dist/common/scripts/

				              if [[ ! "$file" =~ ^test\.(py|/).*$ ]] && [[ ! "$file" =~ ^dist/docker/.*$ ]] && [[ ! "$file" =~ ^dist/common/scripts/.*$ ]]; then

				                REQUIRE_DTEST="true"

				                echo "  ✓ Matches dtest pattern"

				              fi

				              # Unittest pattern: ^(?!dist\/docker\/|dist\/common\/scripts).*$

				              # Everything except dist/docker/, dist/common/scripts/

				              if [[ ! "$file" =~ ^dist/docker/.*$ ]] && [[ ! "$file" =~ ^dist/common/scripts.*$ ]]; then

				                REQUIRE_UNITTEST="true"

				                echo "  ✓ Matches unittest pattern"

				              fi

				              # Artifacts pattern: ^(?:dist|tools\/toolchain).*$

				              # Files starting with dist or tools/toolchain

				              if [[ "$file" =~ ^dist.*$ ]] || [[ "$file" =~ ^tools/toolchain.*$ ]]; then

				                REQUIRE_ARTIFACTS="true"

				                echo "  ✓ Matches artifacts pattern"

				              fi

				              # Scylla GDB pattern: ^(scylla-gdb.py).*$

				              # Files starting with scylla-gdb.py

				              if [[ "$file" =~ ^scylla-gdb\.py.*$ ]]; then

				                REQUIRE_SCYLLA_GDB="true"

				                echo "  ✓ Matches scylla_gdb pattern"

				              fi

				            fi

				          done <<< "$CHANGED_FILES"

				          # Set environment variables

				          echo "requireBuild=$REQUIRE_BUILD" >> $GITHUB_ENV

				          echo "requireDtest=$REQUIRE_DTEST" >> $GITHUB_ENV

				          echo "requireUnittest=$REQUIRE_UNITTEST" >> $GITHUB_ENV

				          echo "requireArtifacts=$REQUIRE_ARTIFACTS" >> $GITHUB_ENV

				          echo "requireScyllaGdb=$REQUIRE_SCYLLA_GDB" >> $GITHUB_ENV

				          echo ""

				          echo "✅ Rule 3: File analysis complete"

				          echo "Build required: $REQUIRE_BUILD"

				          echo "Dtest required: $REQUIRE_DTEST"

				          echo "Unittest required: $REQUIRE_UNITTEST"

				          echo "Artifacts required: $REQUIRE_ARTIFACTS"

				          echo "Scylla GDB required: $REQUIRE_SCYLLA_GDB"

				      - name: Determine Jenkins Job Name

				        run: |

				          if [[ "${{ github.ref_name }}" == "next" ]]; then

				            FOLDER_NAME="scylla-master"

				          elif [[ "${{ github.ref_name }}" == "next-enterprise" ]]; then

				            FOLDER_NAME="scylla-enterprise"

				          else

				            VERSION=$(echo "${{ github.ref_name }}" | awk -F'-' '{print $2}')

				            if [[ "$VERSION" =~ ^202[0-4]\.[0-9]+$ ]]; then

				              FOLDER_NAME="enterprise-$VERSION"

				            elif [[ "$VERSION" =~ ^[0-9]+\.[0-9]+$ ]]; then

				              FOLDER_NAME="scylla-$VERSION"

				            fi

				          fi

				          echo "JOB_NAME=${FOLDER_NAME}/job/scylla-ci" >> $GITHUB_ENV

				      - name: Trigger Jenkins Job

				        if: env.draft_or_conflict == 'false' && env.has_file_changes == 'true' && github.action == 'opened' || github.action == 'reopened'

				        env:

				          JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}

				          JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}

				          JENKINS_URL: "https://jenkins.scylladb.com"

				          SLACK_BOT_TOKEN: ${{ secrets.SLACK_BOT_TOKEN }}

				        run: |

				          PR_NUMBER=${{ github.event.issue.number }}

				          PR_REPO_NAME=${{ github.event.repository.full_name }}

				          echo "Triggering Jenkins Job: $JOB_NAME"

				          curl -X POST \

				            "$JENKINS_URL/job/$JOB_NAME/buildWithParameters? \

				            PR_NUMBER=$PR_NUMBER& \

				            RUN_DTEST=$REQUIRE_DTEST& \

				            RUN_ONLY_SCYLLA_GDB=$REQUIRE_SCYLLA_GDB& \

				            RUN_UNIT_TEST=$REQUIRE_UNITTEST& \

				            FORCE_ON_CLOUD=$HAS_FORCE_ON_CLOUD_LABEL& \

				            SKIP_UNIT_TEST_CUSTOM=$SKIP_UNIT_TEST_CUSTOM& \

				            RUN_ARTIFACT_TESTS=$REQUIRE_ARTIFACTS" \

				            --fail \

				            --user "$JENKINS_USER:$JENKINS_API_TOKEN" \

				            -i -v

				  trigger-ci-via-comment:

				    if: github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')

				    runs-on: ubuntu-latest

				    steps:

				      - name: Trigger Scylla-CI Jenkins Job

				        env:

				          JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}

				          JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}

				          JENKINS_URL: "https://jenkins.scylladb.com"

				        run: |

				          PR_NUMBER=${{ github.event.issue.number }}

				          PR_REPO_NAME=${{ github.event.repository.full_name }}

				          curl -X POST "$JENKINS_URL/job/$JOB_NAME/buildWithParameters?PR_NUMBER=$PR_NUMBER" \

				          --user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail -i -v

									
										3

.github/workflows/trigger_jenkins.yaml
									
										vendored
									
												View File
												
				@@ -1,5 +1,8 @@

				name: Trigger next gating

				permissions:

				  contents: read

				on:

				  push:

				    branches:

									
										66

CMakeLists.txt
									
												View File
												
				@@ -49,7 +49,7 @@ include(limit_jobs)

				set(CMAKE_CXX_STANDARD "23" CACHE INTERNAL "")

				set(CMAKE_CXX_EXTENSIONS ON CACHE INTERNAL "")

				set(CMAKE_CXX_SCAN_FOR_MODULES OFF CACHE INTERNAL "")

				set(CMAKE_CXX_VISIBILITY_PRESET hidden)

				set(CMAKE_VISIBILITY_INLINES_HIDDEN ON)

				if(is_multi_config)

				    find_package(Seastar)

				@@ -90,13 +90,13 @@ if(is_multi_config)

				    add_dependencies(Seastar::seastar_testing Seastar)

				else()

				    set(Seastar_TESTING ON CACHE BOOL "" FORCE)

				    set(Seastar_API_LEVEL 7 CACHE STRING "" FORCE)

				    set(Seastar_API_LEVEL 9 CACHE STRING "" FORCE)

				    set(Seastar_DEPRECATED_OSTREAM_FORMATTERS OFF CACHE BOOL "" FORCE)

				    set(Seastar_APPS ON CACHE BOOL "" FORCE)

				    set(Seastar_EXCLUDE_APPS_FROM_ALL ON CACHE BOOL "" FORCE)

				    set(Seastar_EXCLUDE_TESTS_FROM_ALL ON CACHE BOOL "" FORCE)

				    set(Seastar_IO_URING ON CACHE BOOL "" FORCE)

				    set(Seastar_SCHEDULING_GROUPS_COUNT 19 CACHE STRING "" FORCE)

				    set(Seastar_SCHEDULING_GROUPS_COUNT 21 CACHE STRING "" FORCE)

				    set(Seastar_UNUSED_RESULT_ERROR ON CACHE BOOL "" FORCE)

				    add_subdirectory(seastar)

				    target_compile_definitions (seastar

				@@ -116,6 +116,7 @@ list(APPEND absl_cxx_flags

				if(CMAKE_CXX_COMPILER_ID STREQUAL "GNU")

				    list(APPEND ABSL_GCC_FLAGS ${absl_cxx_flags})

				elseif(CMAKE_CXX_COMPILER_ID STREQUAL "Clang")

				    list(APPEND absl_cxx_flags "-Wno-deprecated-builtins")

				    list(APPEND ABSL_LLVM_FLAGS ${absl_cxx_flags})

				endif()

				set(ABSL_DEFAULT_LINKOPTS

				@@ -163,37 +164,66 @@ file(MAKE_DIRECTORY "${scylla_gen_build_dir}")

				include(add_version_library)

				generate_scylla_version()

				option(Scylla_USE_PRECOMPILED_HEADER "Use precompiled header for Scylla" ON)

				add_library(scylla-precompiled-header STATIC exported_templates.cc)

				target_link_libraries(scylla-precompiled-header PRIVATE

				    absl::headers

				    absl::btree

				    absl::hash

				    absl::raw_hash_set

				    Seastar::seastar

				    Snappy::snappy

				    systemd

				    ZLIB::ZLIB

				    lz4::lz4_static

				    zstd::zstd_static)

				if (Scylla_USE_PRECOMPILED_HEADER)

				  set(Scylla_USE_PRECOMPILED_HEADER_USE ON)

				  find_program(DISTCC_EXEC NAMES distcc OPTIONAL)

				  if (DISTCC_EXEC)

				    if(DEFINED ENV{DISTCC_HOSTS})

				      set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)

				      message(STATUS "Disabling precompiled header usage because distcc exists and DISTCC_HOSTS is set, assuming you're using distributed compilation.")

				    else()

				      file(REAL_PATH "~/.distcc/hosts" DIST_CC_HOSTS_PATH EXPAND_TILDE)

				      if (EXISTS ${DIST_CC_HOSTS_PATH})

				        set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)

				        message(STATUS "Disabling precompiled header usage because distcc and ~/.distcc/hosts exists, assuming you're using distributed compilation.")

				      endif()

				    endif()

				  endif()

				  if (Scylla_USE_PRECOMPILED_HEADER_USE)

				    message(STATUS "Using precompiled header for Scylla - remember to add `sloppiness = pch_defines,time_macros` to ccache.conf, if you're using ccache.")

				    target_precompile_headers(scylla-precompiled-header PRIVATE "stdafx.hh")

				    target_compile_definitions(scylla-precompiled-header PRIVATE SCYLLA_USE_PRECOMPILED_HEADER)

				  endif()

				else()

				  set(Scylla_USE_PRECOMPILED_HEADER_USE OFF)

				endif()

				add_library(scylla-main STATIC)

				target_sources(scylla-main

				  PRIVATE

				    absl-flat_hash_map.cc

				    bytes.cc

				    client_data.cc

				    clocks-impl.cc

				    collection_mutation.cc

				    converting_mutation_partition_applier.cc

				    counters.cc

				    sstable_dict_autotrainer.cc

				    duration.cc

				    exceptions/exceptions.cc

				    frozen_schema.cc

				    generic_server.cc

				    debug.cc

				    init.cc

				    keys/keys.cc

				    multishard_mutation_query.cc

				    mutation_query.cc

				    node_ops/task_manager_module.cc

				    partition_slice_builder.cc

				    querier.cc

				    query.cc

				    query/query.cc

				    query_ranges_to_vnodes.cc

				    query-result-set.cc

				    query/query-result-set.cc

				    tombstone_gc_options.cc

				    tombstone_gc.cc

				    reader_concurrency_semaphore.cc

				    reader_concurrency_semaphore_group.cc

				    schema_mutations.cc

				    serializer.cc

				    service/direct_failure_detector/failure_detector.cc

				    sstables_loader.cc

				@@ -217,6 +247,7 @@ target_link_libraries(scylla-main

				    ZLIB::ZLIB

				    lz4::lz4_static

				    zstd::zstd_static

				    scylla-precompiled-header

				)

				option(Scylla_CHECK_HEADERS

				@@ -269,9 +300,7 @@ add_subdirectory(locator)

				add_subdirectory(message)

				add_subdirectory(mutation)

				add_subdirectory(mutation_writer)

				add_subdirectory(node_ops)

				add_subdirectory(readers)

				add_subdirectory(redis)

				add_subdirectory(replica)

				add_subdirectory(raft)

				add_subdirectory(repair)

				@@ -286,6 +315,7 @@ add_subdirectory(tracing)

				add_subdirectory(transport)

				add_subdirectory(types)

				add_subdirectory(utils)

				add_subdirectory(vector_search)

				add_version_library(scylla_version

				    release.cc)

				@@ -315,7 +345,6 @@ set(scylla_libs

				    mutation_writer

				    raft

				    readers

				    redis

				    repair

				    replica

				    schema

				@@ -328,7 +357,8 @@ set(scylla_libs

				    tracing

				    transport

				    types

				    utils)

				    utils

				    vector_search)

				target_link_libraries(scylla PRIVATE

				    ${scylla_libs})

									
										2

CONTRIBUTING.md
									
												View File
												
				@@ -12,7 +12,7 @@ Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to re

				## Contributing code to Scylla

				Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).

				Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form to cla@scylladb.com. You can then submit your changes as patches to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).

				If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).

				The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.

									
										8

HACKING.md
									
												View File
												
				@@ -43,7 +43,7 @@ $ ./tools/toolchain/dbuild ninja build/release/scylla

				$ ./tools/toolchain/dbuild ./build/release/scylla --developer-mode 1

				```

				Note: do not mix environemtns - either perform all your work with dbuild, or natively on the host.

				Note: do not mix environments - either perform all your work with dbuild, or natively on the host.

				Note2: you can get to an interactive shell within dbuild by running it without any parameters:

				```bash

				$ ./tools/toolchain/dbuild

				@@ -91,7 +91,7 @@ You can also specify a single mode. For example

				$ ninja-build release

				```

				Will build everytihng in release mode. The valid modes are

				Will build everything in release mode. The valid modes are

				* Debug: Enables [AddressSanitizer](https://github.com/google/sanitizers/wiki/AddressSanitizer)

				  and other sanity checks. It has no optimizations, which allows for debugging with tools like

				@@ -361,7 +361,7 @@ avoid that the gold linker can be told to create an index with

				More info at https://gcc.gnu.org/wiki/DebugFission.

				Both options can be enable by passing `--split-dwarf` to configure.py.

				Both options can be enabled by passing `--split-dwarf` to configure.py.

				Note that distcc is *not* compatible with it, but icecream

				(https://github.com/icecc/icecream) is.

				@@ -370,7 +370,7 @@ Note that distcc is *not* compatible with it, but icecream

				Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.

				One way to do this it to create a local remote for the Seastar submodule in the Scylla repository:

				One way to do this is to create a local remote for the Seastar submodule in the Scylla repository:

				```bash

				$ cd $HOME/src/scylla

									
										4

README.md
									
												View File
												
				@@ -18,7 +18,7 @@ Scylla is fairly fussy about its build environment, requiring very recent

				versions of the C++23 compiler and of many libraries to build. The document

				[HACKING.md](HACKING.md) includes detailed information on building and

				developing Scylla, but to get Scylla building quickly on (almost) any build

				machine, Scylla offers a [frozen toolchain](tools/toolchain/README.md),

				machine, Scylla offers a [frozen toolchain](tools/toolchain/README.md).

				This is a pre-configured Docker image which includes recent versions of all

				the required compilers, libraries and build tools. Using the frozen toolchain

				allows you to avoid changing anything in your build machine to meet Scylla's

				@@ -43,7 +43,7 @@ For further information, please see:

				[developer documentation]: HACKING.md

				[build documentation]: docs/dev/building.md

				[docker image build documentation]: dist/docker/debian/README.md

				[docker image build documentation]: dist/docker/redhat/README.md

				## Running Scylla

2

SCYLLA-VERSION-GEN

View File

@@ -78,7 +78,7 @@ fi
 # Default scylla product/version tags
 PRODUCT=scylla
 VERSION=2025.4.0-dev
 VERSION=2026.2.0-dev
 if test -f version
 then

									
										5

alternator/CMakeLists.txt
									
												View File
												
				@@ -17,6 +17,8 @@ target_sources(alternator

				    streams.cc

				    consumed_capacity.cc

				    ttl.cc

				    parsed_expression_cache.cc

				    http_compression.cc

				    ${cql_grammar_srcs})

				target_include_directories(alternator

				  PUBLIC

				@@ -33,5 +35,8 @@ target_link_libraries(alternator

				    idl

				    absl::headers)

				if (Scylla_USE_PRECOMPILED_HEADER_USE)

				  target_precompile_headers(alternator REUSE_FROM scylla-precompiled-header)

				endif()

				check_headers(check-headers alternator

				  GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

									
										10

alternator/auth.cc
									
												View File
												
				@@ -11,10 +11,10 @@

				#include "utils/log.hh"

				#include <string>

				#include <string_view>

				#include "bytes.hh"

				#include "alternator/auth.hh"

				#include <fmt/format.h>

				#include "auth/password_authenticator.hh"

				#include "db/consistency_level_type.hh"

				#include "db/system_keyspace.hh"

				#include "service/storage_proxy.hh"

				#include "alternator/executor.hh"

				#include "cql3/selection/selection.hh"

				@@ -26,8 +26,8 @@ namespace alternator {

				static logging::logger alogger("alternator-auth");

				future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::service& as, std::string username) {

				    schema_ptr schema = proxy.data_dictionary().find_schema(auth::get_auth_ks_name(as.query_processor()), "roles");

				future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username) {

				    schema_ptr schema = proxy.data_dictionary().find_schema(db::system_keyspace::NAME, "roles");

				    partition_key pk = partition_key::from_single_value(*schema, utf8_type->decompose(username));

				    dht::partition_range_vector partition_ranges{dht::partition_range(dht::decorate_key(*schema, pk))};

				    std::vector<query::clustering_range> bounds{query::clustering_range::make_open_ended_both_sides()};

				@@ -40,7 +40,7 @@ future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::serv

				    auto partition_slice = query::partition_slice(std::move(bounds), {}, query::column_id_vector{salted_hash_col->id, can_login_col->id}, selection->get_query_options());

				    auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice,

				            proxy.get_max_result_size(partition_slice), query::tombstone_limit(proxy.get_tombstone_limit()));

				    auto cl = auth::password_authenticator::consistency_for_user(username);

				    auto cl = db::consistency_level::LOCAL_ONE;

				    service::client_state client_state{service::client_state::internal_tag()};

				    service::storage_proxy::coordinator_query_result qr = co_await proxy.query(schema, std::move(command), std::move(partition_ranges), cl,

									
										2

alternator/auth.hh
									
												View File
												
				@@ -20,6 +20,6 @@ namespace alternator {

				using key_cache = utils::loading_cache<std::string, std::string, 1>;

				future<std::string> get_key_from_roles(service::storage_proxy& proxy, auth::service& as, std::string username);

				future<std::string> get_key_from_roles(service::storage_proxy& proxy, std::string username);

				}

									
										14

alternator/conditions.cc
									
												View File
												
				@@ -42,7 +42,7 @@ comparison_operator_type get_comparison_operator(const rjson::value& comparison_

				    if (!comparison_operator.IsString()) {

				        throw api_error::validation(fmt::format("Invalid comparison operator definition {}", rjson::print(comparison_operator)));

				    }

				    std::string op = comparison_operator.GetString();

				    std::string op = rjson::to_string(comparison_operator);

				    auto it = ops.find(op);

				    if (it == ops.end()) {

				        throw api_error::validation(fmt::format("Unsupported comparison operator {}", op));

				@@ -377,8 +377,8 @@ bool check_compare(const rjson::value* v1, const rjson::value& v2, const Compara

				        return cmp(unwrap_number(*v1, cmp.diagnostic), unwrap_number(v2, cmp.diagnostic));

				    }

				    if (kv1.name == "S") {

				        return cmp(std::string_view(kv1.value.GetString(), kv1.value.GetStringLength()),

				                   std::string_view(kv2.value.GetString(), kv2.value.GetStringLength()));

				        return cmp(rjson::to_string_view(kv1.value),

				                   rjson::to_string_view(kv2.value));

				    }

				    if (kv1.name == "B") {

				        auto d_kv1 = unwrap_bytes(kv1.value, v1_from_query);

				@@ -470,9 +470,9 @@ static bool check_BETWEEN(const rjson::value* v, const rjson::value& lb, const r

				        return check_BETWEEN(unwrap_number(*v, diag), unwrap_number(lb, diag), unwrap_number(ub, diag), bounds_from_query);

				    }

				    if (kv_v.name == "S") {

				        return check_BETWEEN(std::string_view(kv_v.value.GetString(), kv_v.value.GetStringLength()),

				                             std::string_view(kv_lb.value.GetString(), kv_lb.value.GetStringLength()),

				                             std::string_view(kv_ub.value.GetString(), kv_ub.value.GetStringLength()),

				        return check_BETWEEN(rjson::to_string_view(kv_v.value),

				                             rjson::to_string_view(kv_lb.value),

				                             rjson::to_string_view(kv_ub.value),

				                             bounds_from_query);

				    }

				    if (kv_v.name == "B") {

				@@ -618,7 +618,7 @@ conditional_operator_type get_conditional_operator(const rjson::value& req) {

				// Check if the existing values of the item (previous_item) match the

				// conditions given by the Expected and ConditionalOperator parameters

				// (if they exist) in the request (an UpdateItem, PutItem or DeleteItem).

				// This function can throw an ValidationException API error if there

				// This function can throw a ValidationException API error if there

				// are errors in the format of the condition itself.

				bool verify_expected(const rjson::value& req, const rjson::value* previous_item) {

				    const rjson::value* expected = rjson::find(req, "Expected");

									
										8

alternator/consumed_capacity.cc
									
												View File
												
				@@ -8,6 +8,8 @@

				#include "consumed_capacity.hh"

				#include "error.hh"

				#include "utils/rjson.hh"

				#include <fmt/format.h>

				namespace alternator {

				@@ -32,18 +34,18 @@ bool consumed_capacity_counter::should_add_capacity(const rjson::value& request)

				    if (!return_consumed->IsString()) {

				        throw api_error::validation("Non-string ReturnConsumedCapacity field in request");

				    }

				    std::string consumed = return_consumed->GetString();

				    std::string_view consumed = rjson::to_string_view(*return_consumed);

				    if (consumed == "INDEXES") {

				        throw api_error::validation("INDEXES consumed capacity is not supported");

				    }

				    if (consumed != "TOTAL") {

				        throw api_error::validation("Unknown consumed capacity "+ consumed);

				        throw api_error::validation(fmt::format("Unknown consumed capacity {}", consumed));

				    }

				    return true;

				}

				void consumed_capacity_counter::add_consumed_capacity_to_response_if_needed(rjson::value& response) const noexcept {

				    if (_should_add_to_reponse) {

				    if (_should_add_to_response) {

				        auto consumption = rjson::empty_object();

				        rjson::add(consumption, "CapacityUnits", get_consumed_capacity_units());

				        rjson::add(response, "ConsumedCapacity", std::move(consumption));

									
										6

alternator/consumed_capacity.hh
									
												View File
												
				@@ -28,9 +28,9 @@ namespace alternator {

				class consumed_capacity_counter {

				public:

				    consumed_capacity_counter() = default;

				    consumed_capacity_counter(bool should_add_to_reponse) : _should_add_to_reponse(should_add_to_reponse){}

				    consumed_capacity_counter(bool should_add_to_response) : _should_add_to_response(should_add_to_response){}

				    bool operator()() const noexcept {

				        return _should_add_to_reponse;

				        return _should_add_to_response;

				    }

				    consumed_capacity_counter& operator +=(uint64_t bytes);

				@@ -44,7 +44,7 @@ public:

				    uint64_t _total_bytes = 0;

				    static bool should_add_capacity(const rjson::value& request);

				protected:

				    bool _should_add_to_reponse = false;

				    bool _should_add_to_response = false;

				};

				class rcu_consumed_capacity_counter : public consumed_capacity_counter {

									
										51

alternator/controller.cc
									
												View File
												
				@@ -28,6 +28,7 @@ static logging::logger logger("alternator_controller");

				controller::controller(

				        sharded<gms::gossiper>& gossiper,

				        sharded<service::storage_proxy>& proxy,

				        sharded<service::storage_service>& ss,

				        sharded<service::migration_manager>& mm,

				        sharded<db::system_distributed_keyspace>& sys_dist_ks,

				        sharded<cdc::generation_service>& cdc_gen_svc,

				@@ -39,6 +40,7 @@ controller::controller(

				    : protocol_server(sg)

				    , _gossiper(gossiper)

				    , _proxy(proxy)

				    , _ss(ss)

				    , _mm(mm)

				    , _sys_dist_ks(sys_dist_ks)

				    , _cdc_gen_svc(cdc_gen_svc)

				@@ -89,7 +91,7 @@ future<> controller::start_server() {

				        auto get_timeout_in_ms = [] (const db::config& cfg) -> utils::updateable_value<uint32_t> {

				            return cfg.alternator_timeout_in_ms;

				        };

				        _executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_mm), std::ref(_sys_dist_ks),

				        _executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_ss), std::ref(_mm), std::ref(_sys_dist_ks),

				                        sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value(),

				                        sharded_parameter(get_timeout_in_ms, std::ref(_config))).get();

				        _server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller)).get();

				@@ -103,11 +105,23 @@ future<> controller::start_server() {

				            alternator_port = _config.alternator_port();

				            _listen_addresses.push_back({addr, *alternator_port});

				        }

				        std::optional<uint16_t> alternator_port_proxy_protocol;

				        if (_config.alternator_port_proxy_protocol()) {

				            alternator_port_proxy_protocol = _config.alternator_port_proxy_protocol();

				            _listen_addresses.push_back({addr, *alternator_port_proxy_protocol});

				        }

				        std::optional<uint16_t> alternator_https_port;

				        std::optional<uint16_t> alternator_https_port_proxy_protocol;

				        std::optional<tls::credentials_builder> creds;

				        if (_config.alternator_https_port()) {

				            alternator_https_port = _config.alternator_https_port();

				            _listen_addresses.push_back({addr, *alternator_https_port});

				        if (_config.alternator_https_port() || _config.alternator_https_port_proxy_protocol()) {

				            if (_config.alternator_https_port()) {

				                alternator_https_port = _config.alternator_https_port();

				                _listen_addresses.push_back({addr, *alternator_https_port});

				            }

				            if (_config.alternator_https_port_proxy_protocol()) {

				                alternator_https_port_proxy_protocol = _config.alternator_https_port_proxy_protocol();

				                _listen_addresses.push_back({addr, *alternator_https_port_proxy_protocol});

				            }

				            creds.emplace();

				            auto opts = _config.alternator_encryption_options();

				            if (opts.empty()) {

				@@ -133,18 +147,29 @@ future<> controller::start_server() {

				            }

				        }

				        _server.invoke_on_all(

				                [this, addr, alternator_port, alternator_https_port, creds = std::move(creds)] (server& server) mutable {

				            return server.init(addr, alternator_port, alternator_https_port, creds,

				                [this, addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol, creds = std::move(creds)] (server& server) mutable {

				            return server.init(addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol, creds,

				                    _config.alternator_enforce_authorization,

				                    _config.alternator_warn_authorization,

				                    _config.alternator_max_users_query_size_in_trace_output,

				                    &_memory_limiter.local().get_semaphore(),

				                    _config.max_concurrent_requests_per_shard);

				        }).handle_exception([this, addr, alternator_port, alternator_https_port] (std::exception_ptr ep) {

				            logger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}: {}",

				                    addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF", ep);

				        }).handle_exception([this, addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol] (std::exception_ptr ep) {

				            logger.error("Failed to set up Alternator HTTP server on {} port {}, TLS port {}, proxy-protocol port {}, TLS proxy-protocol port {}: {}",

				                    addr,

				                    alternator_port ? std::to_string(*alternator_port) : "OFF",

				                    alternator_https_port ? std::to_string(*alternator_https_port) : "OFF",

				                    alternator_port_proxy_protocol ? std::to_string(*alternator_port_proxy_protocol) : "OFF",

				                    alternator_https_port_proxy_protocol ? std::to_string(*alternator_https_port_proxy_protocol) : "OFF",

				                    ep);

				            return stop_server().then([ep = std::move(ep)] { return make_exception_future<>(ep); });

				        }).then([addr, alternator_port, alternator_https_port] {

				            logger.info("Alternator server listening on {}, HTTP port {}, HTTPS port {}",

				                    addr, alternator_port ? std::to_string(*alternator_port) : "OFF", alternator_https_port ? std::to_string(*alternator_https_port) : "OFF");

				        }).then([addr, alternator_port, alternator_https_port, alternator_port_proxy_protocol, alternator_https_port_proxy_protocol] {

				            logger.info("Alternator server listening on {}, HTTP port {}, HTTPS port {}, proxy-protocol port {}, TLS proxy-protocol port {}",

				                    addr,

				                    alternator_port ? std::to_string(*alternator_port) : "OFF",

				                    alternator_https_port ? std::to_string(*alternator_https_port) : "OFF",

				                    alternator_port_proxy_protocol ? std::to_string(*alternator_port_proxy_protocol) : "OFF",

				                    alternator_https_port_proxy_protocol ? std::to_string(*alternator_https_port_proxy_protocol) : "OFF");

				        }).get();

				    });

				}

				@@ -167,7 +192,7 @@ future<> controller::request_stop_server() {

				    });

				}

				future<utils::chunked_vector<client_data>> controller::get_client_data() {

				future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> controller::get_client_data() {

				    return _server.local().get_client_data();

				}

									
										7

alternator/controller.hh
									
												View File
												
				@@ -11,10 +11,11 @@

				#include <seastar/core/sharded.hh>

				#include <seastar/core/smp.hh>

				#include "protocol_server.hh"

				#include "transport/protocol_server.hh"

				namespace service {

				class storage_proxy;

				class storage_service;

				class migration_manager;

				class memory_limiter;

				}

				@@ -57,6 +58,7 @@ class server;

				class controller : public protocol_server {

				    sharded<gms::gossiper>& _gossiper;

				    sharded<service::storage_proxy>& _proxy;

				    sharded<service::storage_service>& _ss;

				    sharded<service::migration_manager>& _mm;

				    sharded<db::system_distributed_keyspace>& _sys_dist_ks;

				    sharded<cdc::generation_service>& _cdc_gen_svc;

				@@ -74,6 +76,7 @@ public:

				    controller(

				        sharded<gms::gossiper>& gossiper,

				        sharded<service::storage_proxy>& proxy,

				        sharded<service::storage_service>& ss,

				        sharded<service::migration_manager>& mm,

				        sharded<db::system_distributed_keyspace>& sys_dist_ks,

				        sharded<cdc::generation_service>& cdc_gen_svc,

				@@ -93,7 +96,7 @@ public:

				    // This virtual function is called (on each shard separately) when the

				    // virtual table "system.clients" is read. It is expected to generate a

				    // list of clients connected to this server (on this shard).

				    virtual future<utils::chunked_vector<client_data>> get_client_data() override;

				    virtual future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> get_client_data() override;

				};

				}

									
										3

alternator/error.hh
									
												View File
												
				@@ -94,6 +94,9 @@ public:

				    static api_error internal(std::string msg) {

				        return api_error("InternalServerError", std::move(msg), http::reply::status_type::internal_server_error);

				    }

				    static api_error payload_too_large(std::string msg) {

				        return api_error("PayloadTooLarge", std::move(msg), status_type::payload_too_large);

				    }

				    // Provide the "std::exception" interface, to make it easier to print this

				    // exception in log messages. Note that this function is *not* used to

1678

alternator/executor.cc

View File

File diff suppressed because it is too large Load Diff

									
										53

alternator/executor.hh
									
												View File
												
				@@ -17,11 +17,13 @@

				#include "service/client_state.hh"

				#include "service_permit.hh"

				#include "db/timeout_clock.hh"

				#include "db/config.hh"

				#include "alternator/error.hh"

				#include "stats.hh"

				#include "utils/rjson.hh"

				#include "utils/updateable_value.hh"

				#include "utils/simple_value_with_expiry.hh"

				#include "tracing/trace_state.hh"

				@@ -40,6 +42,8 @@ namespace cql3::selection {

				namespace service {

				    class storage_proxy;

				    class cas_shard;

				    class storage_service;

				}

				namespace cdc {

				@@ -56,11 +60,9 @@ class schema_builder;

				namespace alternator {

				enum class table_status;

				class rmw_operation;

				namespace parsed {

				class path;

				};

				class put_or_delete_item;

				schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);

				bool is_alternator_keyspace(const sstring& ks_name);

				@@ -132,18 +134,30 @@ using attrs_to_get_node = attribute_path_map_node<std::monostate>;

				// optional means we should get all attributes, not specific ones.

				using attrs_to_get = attribute_path_map<std::monostate>;

				namespace parsed {

				class expression_cache;

				}

				class executor : public peering_sharded_service<executor> {

				    gms::gossiper& _gossiper;

				    service::storage_service& _ss;

				    service::storage_proxy& _proxy;

				    service::migration_manager& _mm;

				    db::system_distributed_keyspace& _sdks;

				    cdc::metadata& _cdc_metadata;

				    utils::updateable_value<bool> _enforce_authorization;

				    utils::updateable_value<bool> _warn_authorization;

				    // An smp_service_group to be used for limiting the concurrency when

				    // forwarding Alternator request between shards - if necessary for LWT.

				    smp_service_group _ssg;

				    std::unique_ptr<parsed::expression_cache> _parsed_expression_cache;

				    struct describe_table_info_manager;

				    std::unique_ptr<describe_table_info_manager> _describe_table_info_manager;

				    future<> cache_newly_calculated_size_on_all_shards(schema_ptr schema, std::uint64_t size_in_bytes, std::chrono::nanoseconds ttl);

				    future<> fill_table_size(rjson::value &table_description, schema_ptr schema, bool deleting);

				public:

				    using client_state = service::client_state;

				    // request_return_type is the return type of the executor methods, which

				@@ -169,11 +183,13 @@ public:

				    executor(gms::gossiper& gossiper,

				             service::storage_proxy& proxy,

				             service::storage_service& ss,

				             service::migration_manager& mm,

				             db::system_distributed_keyspace& sdks,

				             cdc::metadata& cdc_metadata,

				             smp_service_group ssg,

				             utils::updateable_value<uint32_t> default_timeout_in_ms);

				    ~executor();

				    future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);

				    future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);

				@@ -201,11 +217,7 @@ public:

				    future<request_return_type> describe_continuous_backups(client_state& client_state, service_permit permit, rjson::value request);

				    future<> start();

				    future<> stop() {

				        // disconnect from the value source, but keep the value unchanged.

				        s_default_timeout_in_ms = utils::updateable_value<uint32_t>{s_default_timeout_in_ms()};

				        return make_ready_future<>();

				    }

				    future<> stop();

				    static sstring table_name(const schema&);

				    static db::timeout_clock::time_point default_timeout();

				@@ -218,10 +230,22 @@ public:

				private:

				    friend class rmw_operation;

				    static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr);

				    static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr, const std::map<sstring, sstring> *tags = nullptr);

				    future<rjson::value> fill_table_description(schema_ptr schema, table_status tbl_status, service::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit);

				    future<executor::request_return_type> create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, bool enforce_authorization, bool warn_authorization, const db::tablets_mode_t::mode tablets_mode);

				    future<> do_batch_write(

				        std::vector<std::pair<schema_ptr, put_or_delete_item>> mutation_builders,

				        service::client_state& client_state,

				        tracing::trace_state_ptr trace_state,

				        service_permit permit);

				    future<> cas_write(schema_ptr schema, service::cas_shard cas_shard, const dht::decorated_key& dk,

				        const std::vector<put_or_delete_item>& mutation_builders, service::client_state& client_state,

				        tracing::trace_state_ptr trace_state, service_permit permit);

				public:

				    static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&);

				    static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&, const std::map<sstring, sstring> *tags = nullptr);

				    static std::optional<rjson::value> describe_single_item(schema_ptr,

				        const query::partition_slice&,

				@@ -230,12 +254,15 @@ public:

				        const std::optional<attrs_to_get>&,

				        uint64_t* = nullptr);

				    // Converts a multi-row selection result to JSON compatible with DynamoDB.

				    // For each row, this method calls item_callback, which takes the size of

				    // the item as the parameter.

				    static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,

				        const query::partition_slice&& slice,

				        shared_ptr<cql3::selection::selection> selection,

				        foreign_ptr<lw_shared_ptr<query::result>> query_result,

				        shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,

				        uint64_t& rcu_half_units);

				        noncopyable_function<void(uint64_t)> item_callback = {});

				    static void describe_single_item(const cql3::selection::selection&,

				        const std::vector<managed_bytes_opt>&,

				@@ -263,7 +290,7 @@ bool is_big(const rjson::value& val, int big_size = 100'000);

				// Check CQL's Role-Based Access Control (RBAC) permission (MODIFY,

				// SELECT, DROP, etc.) on the given table. When permission is denied an

				// appropriate user-readable api_error::access_denied is thrown.

				future<> verify_permission(bool enforce_authorization, const service::client_state&, const schema_ptr&, auth::permission);

				future<> verify_permission(bool enforce_authorization, bool warn_authorization, const service::client_state&, const schema_ptr&, auth::permission, alternator::stats& stats);

				/**

				 * Make return type for serializing the object "streamed",

29

alternator/expressions.g

View File

@@ -91,6 +91,18 @@ options {
         throw expressions_syntax_error(format("{} at char {}", err,
             ex->get_charPositionInLine()));
     }
     // ANTLR3 tries to recover missing tokens - it tries to finish parsing
     // and create valid objects, as if the missing token was there.
     // But it has a bug and leaks these tokens.
     // We override offending method and handle abandoned pointers.
     std::vector<std::unique_ptr<TokenType>> _missing_tokens;
     TokenType* getMissingSymbol(IntStreamType* istream, ExceptionBaseType* e,
                                 ANTLR_UINT32 expectedTokenType, BitsetListType* follow) {
         auto token = BaseType::getMissingSymbol(istream, e, expectedTokenType, follow);
         _missing_tokens.emplace_back(token);
         return token;
     }
 }
 @lexer::context {
     void displayRecognitionError(ANTLR_UINT8** token_names, ExceptionBaseType* ex) {
@@ -184,7 +196,13 @@ path_component: NAME | NAMEREF;
 path returns [parsed::path p]:
     root=path_component           { $p.set_root($root.text); }
     (   '.' name=path_component   { $p.add_dot($name.text); }
       | '[' INTEGER ']'           { $p.add_index(std::stoi($INTEGER.text)); }
       | '[' INTEGER ']'           {
                 try {
                     $p.add_index(std::stoi($INTEGER.text));
                 } catch(std::out_of_range&) {
                     throw expressions_syntax_error("list index out of integer range");
                 }
             }
     )*;
 /* See comment above why the "depth" counter was needed here */
@@ -230,7 +248,7 @@ update_expression_clause returns [parsed::update_expression e]:
 // Note the "EOF" token at the end of the update expression. We want to the
 //  parser to match the entire string given to it - not just its beginning!
 update_expression returns [parsed::update_expression e]:
     (update_expression_clause { e.append($update_expression_clause.e); })* EOF;
     (update_expression_clause { e.append($update_expression_clause.e); })+ EOF;
 projection_expression returns [std::vector<parsed::path> v]:
     p=path      { $v.push_back(std::move($p.p)); }
@@ -257,6 +275,13 @@ primitive_condition returns [parsed::primitive_condition c]:
          (',' v=value[0] { $c.add_value(std::move($v.v)); })*
          ')'
       )?
       {
           // Post-parse check to reject non-function single values
           if ($c._op == parsed::primitive_condition::type::VALUE &&
               !$c._values.front().is_func()) {
               throw expressions_syntax_error("Single value must be a function");
           }
       }
     ;
 // The following rules for parsing boolean expressions are verbose and

									
										22

alternator/expressions.hh
									
												View File
												
				@@ -18,6 +18,8 @@

				#include "expressions_types.hh"

				#include "utils/rjson.hh"

				#include "utils/updateable_value.hh"

				#include "stats.hh"

				namespace alternator {

				@@ -26,6 +28,26 @@ public:

				    using runtime_error::runtime_error;

				};

				namespace parsed {

				class expression_cache_impl;

				class expression_cache {

				    std::unique_ptr<expression_cache_impl> _impl;

				public:

				    struct config {

				        utils::updateable_value<uint32_t> max_cache_entries;

				    };

				    expression_cache(config cfg, stats& stats);

				    ~expression_cache();

				    // stop background tasks, if any

				    future<> stop();

				    update_expression parse_update_expression(std::string_view query);

				    std::vector<path> parse_projection_expression(std::string_view query);

				    condition_expression parse_condition_expression(std::string_view query, const char* caller);

				};

				} // namespace parsed

				// Preferably use parsed::expression_cache instance instead of this free functions.

				parsed::update_expression parse_update_expression(std::string_view query);

				std::vector<parsed::path> parse_projection_expression(std::string_view query);

				parsed::condition_expression parse_condition_expression(std::string_view query, const char* caller);

									
										10

alternator/expressions_types.hh
									
												View File
												
				@@ -50,7 +50,7 @@ public:

				        _operators.emplace_back(i);

				        check_depth_limit();

				    }

				    void add_dot(std::string(name)) {

				    void add_dot(std::string name) {

				        _operators.emplace_back(std::move(name));

				        check_depth_limit();

				    }

				@@ -85,7 +85,7 @@ struct constant {

				    }

				};

				// "value" is is a value used in the right hand side of an assignment

				// "value" is a value used in the right hand side of an assignment

				// expression, "SET a = ...". It can be a constant (a reference to a value

				// included in the request, e.g., ":val"), a path to an attribute from the

				// existing item (e.g., "a.b[3].c"), or a function of other such values.

				@@ -205,13 +205,11 @@ public:

				// The supported primitive conditions are:

				// 1. Binary operators - v1 OP v2, where OP is =, <>, <, <=, >, or >= and

				//    v1 and v2 are values - from the item (an attribute path), the query

				//    (a ":val" reference), or a function of the the above (only the size()

				//    (a ":val" reference), or a function of the above (only the size()

				//    function is supported).

				// 2. Ternary operator - v1 BETWEEN v2 and v3 (means v1 >= v2 AND v1 <= v3).

				// 3. N-ary operator - v1 IN ( v2, v3, ... )

				// 4. A single function call (attribute_exists etc.). The parser actually

				//    accepts a more general "value" here but later stages reject a value

				//    which is not a function call (because DynamoDB does it too).

				// 4. A single function call (attribute_exists etc.).

				class primitive_condition {

				public:

				    enum class type {

									
										2

alternator/extract_from_attrs.hh
									
												View File
												
				@@ -13,7 +13,7 @@

				#include "utils/rjson.hh"

				#include "serialization.hh"

				#include "column_computation.hh"

				#include "schema/column_computation.hh"

				#include "db/view/regular_column_transformation.hh"

				namespace alternator {

									
										301

alternator/http_compression.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,301 @@

				/*

				 * Copyright 2025-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#include "alternator/http_compression.hh"

				#include "alternator/server.hh"

				#include <seastar/coroutine/maybe_yield.hh>

				#include <zlib.h>

				static logging::logger slogger("alternator-http-compression");

				namespace alternator {

				static constexpr size_t compressed_buffer_size = 1024;

				class zlib_compressor {

				    z_stream _zs;

				    temporary_buffer<char> _output_buf;

				    noncopyable_function<future<>(temporary_buffer<char>&&)> _write_func;

				public:

				    zlib_compressor(bool gzip, int compression_level, noncopyable_function<future<>(temporary_buffer<char>&&)> write_func)

				     : _write_func(std::move(write_func)) {

				        memset(&_zs, 0, sizeof(_zs));

				        if (deflateInit2(&_zs, std::clamp(compression_level, Z_NO_COMPRESSION, Z_BEST_COMPRESSION), Z_DEFLATED,

				                (gzip ? 16 : 0) + MAX_WBITS, 8, Z_DEFAULT_STRATEGY) != Z_OK) {

				            // Should only happen if memory allocation fails

				            throw std::bad_alloc();

				        }

				    }

				    ~zlib_compressor() {

				        deflateEnd(&_zs);

				    }

				    future<> close() {

				        return compress(nullptr, 0, true);

				    }

				    future<> compress(const char* buf, size_t len, bool is_last_chunk = false) {

				        _zs.next_in = reinterpret_cast<unsigned char*>(const_cast<char*>(buf));

				        _zs.avail_in = (uInt) len;

				        int mode = is_last_chunk ? Z_FINISH : Z_NO_FLUSH;

				        while(_zs.avail_in > 0 || is_last_chunk) {

				            co_await coroutine::maybe_yield();

				            if (_output_buf.empty()) {

				                if (is_last_chunk) {

				                    uint32_t max_buffer_size = 0;

				                    deflatePending(&_zs, &max_buffer_size, nullptr);

				                    max_buffer_size += deflateBound(&_zs, _zs.avail_in) + 1;

				                    _output_buf = temporary_buffer<char>(std::min(compressed_buffer_size, (size_t) max_buffer_size));

				                } else {

				                    _output_buf = temporary_buffer<char>(compressed_buffer_size);

				                }

				                _zs.next_out = reinterpret_cast<unsigned char*>(_output_buf.get_write());

				                _zs.avail_out = compressed_buffer_size;

				            }

				            int e = deflate(&_zs, mode);

				            if (e < Z_OK) {

				                throw api_error::internal("Error during compression of response body");

				            }

				            if (e == Z_STREAM_END || _zs.avail_out < compressed_buffer_size / 4) {

				                _output_buf.trim(compressed_buffer_size - _zs.avail_out);

				                co_await _write_func(std::move(_output_buf));

				                if (e == Z_STREAM_END) {

				                    break;

				                }

				            }

				        }

				    }

				};

				// Helper string_view functions for parsing Accept-Encoding header

				struct case_insensitive_cmp_sv {

				    bool operator()(std::string_view s1, std::string_view s2) const {

				        return std::equal(s1.begin(), s1.end(), s2.begin(), s2.end(),

				            [](char a, char b) { return ::tolower(a) == ::tolower(b); });

				    }

				};

				static inline std::string_view trim_left(std::string_view sv) {

				    while (!sv.empty() && std::isspace(static_cast<unsigned char>(sv.front())))

				        sv.remove_prefix(1);

				    return sv;

				}

				static inline std::string_view trim_right(std::string_view sv) {

				    while (!sv.empty() && std::isspace(static_cast<unsigned char>(sv.back())))

				        sv.remove_suffix(1);

				    return sv;

				}

				static inline std::string_view trim(std::string_view sv) {

				    return trim_left(trim_right(sv));

				}

				inline std::vector<std::string_view> split(std::string_view text, char separator) {

				    std::vector<std::string_view> tokens;

				    if (text == "") {

				        return tokens;

				    }

				    while (true) {

				        auto pos = text.find_first_of(separator);

				        if (pos != std::string_view::npos) {

				            tokens.emplace_back(text.data(), pos);

				            text.remove_prefix(pos + 1);

				        } else {

				            tokens.emplace_back(text);

				            break;

				        }

				    }

				    return tokens;

				}

				constexpr response_compressor::compression_type response_compressor::get_compression_type(std::string_view encoding) {

				    for (size_t i = 0; i < static_cast<size_t>(compression_type::count); ++i) {

				        if (case_insensitive_cmp_sv{}(encoding, compression_names[i])) {

				            return static_cast<compression_type>(i);

				        }

				    }

				    return compression_type::unknown;

				}

				response_compressor::compression_type response_compressor::find_compression(std::string_view accept_encoding, size_t response_size) {

				    std::optional<float> ct_q[static_cast<size_t>(compression_type::count)];

				    ct_q[static_cast<size_t>(compression_type::none)] = std::numeric_limits<float>::min(); // enabled, but lowest priority

				    compression_type selected_ct = compression_type::none;

				    std::vector<std::string_view> entries = split(accept_encoding, ',');

				    for (auto& e : entries) {

				        std::vector<std::string_view> params = split(e, ';');

				        if (params.size() == 0) {

				            continue;

				        }

				        compression_type ct = get_compression_type(trim(params[0]));

				        if (ct == compression_type::unknown) {

				            continue; // ignore unknown encoding types

				        }

				        if (ct_q[static_cast<size_t>(ct)].has_value() && ct_q[static_cast<size_t>(ct)] != 0.0f) {

				            continue; // already processed this encoding

				        }

				        if (response_size < _threshold[static_cast<size_t>(ct)]) {

				            continue; // below threshold treat as unknown

				        }

				        for (size_t i = 1; i < params.size(); ++i) { // find "q=" parameter

				            auto pos = params[i].find("q=");

				            if (pos == std::string_view::npos) {

				                continue;

				            }

				            std::string_view param = params[i].substr(pos + 2);

				            param = trim(param);

				            // parse quality value

				            float q_value = 1.0f;

				            auto [ptr, ec] = std::from_chars(param.data(), param.data() + param.size(), q_value);

				            if (ec != std::errc() || ptr != param.data() + param.size()) {

				                continue;

				            }

				            if (q_value < 0.0) {

				                q_value = 0.0;

				            } else if (q_value > 1.0) {

				                q_value = 1.0;

				            }

				            ct_q[static_cast<size_t>(ct)] = q_value;

				            break; // we parsed quality value

				        }

				        if (!ct_q[static_cast<size_t>(ct)].has_value()) {

				            ct_q[static_cast<size_t>(ct)] = 1.0f; // default quality value

				        }

				        // keep the highest encoding (in the order, unless 'any')

				        if (selected_ct == compression_type::any) {

				            if (ct_q[static_cast<size_t>(ct)] >= ct_q[static_cast<size_t>(selected_ct)]) {

				                selected_ct = ct;

				            }

				        } else {

				            if (ct_q[static_cast<size_t>(ct)] > ct_q[static_cast<size_t>(selected_ct)]) {

				                selected_ct = ct;

				            }

				        }

				    }

				    if (selected_ct == compression_type::any) {

				        // select any not mentioned or highest quality

				        selected_ct = compression_type::none;

				        for (size_t i = 0; i < static_cast<size_t>(compression_type::compressions_count); ++i) {

				            if (!ct_q[i].has_value()) {

				                return static_cast<compression_type>(i);

				            }

				            if (ct_q[i] > ct_q[static_cast<size_t>(selected_ct)]) {

				                selected_ct = static_cast<compression_type>(i);

				            }

				        }

				    }

				    return selected_ct;

				}

				static future<chunked_content> compress(response_compressor::compression_type ct, const db::config& cfg, std::string str) {

				    chunked_content compressed;

				    auto write = [&compressed](temporary_buffer<char>&& buf) -> future<> {

				        compressed.push_back(std::move(buf));

				        return make_ready_future<>();

				    };

				    zlib_compressor compressor(ct != response_compressor::compression_type::deflate,

				        cfg.alternator_response_gzip_compression_level(), std::move(write));

				    co_await compressor.compress(str.data(), str.size(), true);

				    co_return compressed;

				}

				static sstring flatten(chunked_content&& cc) {

				    size_t total_size = 0;

				    for (const auto& chunk : cc) {

				        total_size += chunk.size();

				    }

				    sstring result = sstring{ sstring::initialized_later{}, total_size };

				    size_t offset = 0;

				    for (const auto& chunk : cc) {

				        std::copy(chunk.begin(), chunk.end(), result.begin() + offset);

				        offset += chunk.size();

				    }

				    return result;

				}

				future<std::unique_ptr<http::reply>> response_compressor::generate_reply(std::unique_ptr<http::reply> rep, sstring accept_encoding, const char* content_type, std::string&& response_body) {

				    response_compressor::compression_type ct = find_compression(accept_encoding, response_body.size());

				    if (ct != response_compressor::compression_type::none) {

				        rep->add_header("Content-Encoding", get_encoding_name(ct));

				        rep->set_content_type(content_type);

				        return compress(ct, cfg, std::move(response_body)).then([rep = std::move(rep)] (chunked_content compressed) mutable {

				            rep->_content = flatten(std::move(compressed));

				            return make_ready_future<std::unique_ptr<http::reply>>(std::move(rep));

				        });

				    } else {

				        // Note that despite the move, there is a copy here -

				        // as str is std::string and rep->_content is sstring.

				        rep->_content = std::move(response_body);

				        rep->set_content_type(content_type);

				    }

				    return make_ready_future<std::unique_ptr<http::reply>>(std::move(rep));

				}

				template<typename Compressor>

				class compressed_data_sink_impl : public data_sink_impl {

				    output_stream<char> _out;

				    Compressor _compressor;

				public:

				    template<typename... Args>

				    compressed_data_sink_impl(output_stream<char>&& out, Args&&... args)

				     : _out(std::move(out)), _compressor(std::forward<Args>(args)..., [this](temporary_buffer<char>&& buf) {

				        return _out.write(std::move(buf));

				    }) { }

				    future<> put(std::span<temporary_buffer<char>> data) override {

				        return data_sink_impl::fallback_put(data, [this] (temporary_buffer<char>&& buf) {

				            return do_put(std::move(buf));

				        });

				    }

				private:

				    future<> do_put(temporary_buffer<char> buf) {

				        co_return co_await _compressor.compress(buf.get(), buf.size());

				    }

				    future<> close() override {

				        return _compressor.close().then([this] {

				            return _out.close();

				        });

				    }

				};

				executor::body_writer compress(response_compressor::compression_type ct, const db::config& cfg, executor::body_writer&& bw) {

				    return [bw = std::move(bw), ct, level = cfg.alternator_response_gzip_compression_level()](output_stream<char>&& out) mutable -> future<> {

				        output_stream_options opts;

				        opts.trim_to_size = true;

				        std::unique_ptr<data_sink_impl> data_sink_impl;

				        switch (ct) {

				            case response_compressor::compression_type::gzip:

				                data_sink_impl = std::make_unique<compressed_data_sink_impl<zlib_compressor>>(std::move(out), true, level);

				                break;

				            case response_compressor::compression_type::deflate:

				                data_sink_impl = std::make_unique<compressed_data_sink_impl<zlib_compressor>>(std::move(out), false, level);

				                break;

				            case response_compressor::compression_type::none:

				            case response_compressor::compression_type::any:

				            case response_compressor::compression_type::unknown:

				                on_internal_error(slogger,"Compression not selected");

				            default:

				                on_internal_error(slogger, "Unsupported compression type for data sink");

				        }

				        return bw(output_stream<char>(data_sink(std::move(data_sink_impl)), compressed_buffer_size, opts));

				    };

				}

				future<std::unique_ptr<http::reply>> response_compressor::generate_reply(std::unique_ptr<http::reply> rep, sstring accept_encoding, const char* content_type, executor::body_writer&& body_writer) {

				    response_compressor::compression_type ct = find_compression(accept_encoding, std::numeric_limits<size_t>::max());

				    if (ct != response_compressor::compression_type::none) {

				        rep->add_header("Content-Encoding", get_encoding_name(ct));

				        rep->write_body(content_type, compress(ct, cfg, std::move(body_writer)));

				    } else {

				        rep->write_body(content_type, std::move(body_writer));

				    }

				    return make_ready_future<std::unique_ptr<http::reply>>(std::move(rep));

				}

				} // namespace alternator

									
										91

alternator/http_compression.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,91 @@

				/*

				 * Copyright 2025-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#pragma once

				#include "alternator/executor.hh"

				#include <seastar/http/httpd.hh>

				#include "db/config.hh"

				namespace alternator {

				class response_compressor {

				public:

				    enum class compression_type {

				        gzip,

				        deflate,

				        compressions_count,

				        any = compressions_count,

				        none,

				        count,

				        unknown = count

				    };

				    static constexpr std::string_view compression_names[] = {

				        "gzip",

				        "deflate",

				        "*",

				        "identity"

				    };

				    static sstring get_encoding_name(compression_type ct) {

				        return sstring(compression_names[static_cast<size_t>(ct)]);

				    }

				    static constexpr compression_type get_compression_type(std::string_view encoding);

				    sstring get_accepted_encoding(const http::request& req) {

				        if (get_threshold() == 0) {

				            return "";

				        }

				        return req.get_header("Accept-Encoding");

				    }

				    compression_type find_compression(std::string_view accept_encoding, size_t response_size);

				    response_compressor(const db::config& cfg)

				        : cfg(cfg)

				        ,_gzip_level_observer(

				            cfg.alternator_response_gzip_compression_level.observe([this](int v) {

				                    update_threshold();

				                }))

				        ,_gzip_threshold_observer(

				            cfg.alternator_response_compression_threshold_in_bytes.observe([this](uint32_t v) {

				                    update_threshold();

				                }))

				    {

				        update_threshold();

				    }

				    response_compressor(const response_compressor& rhs) : response_compressor(rhs.cfg) {}

				private:

				    const db::config& cfg;

				    utils::observable<int>::observer _gzip_level_observer;

				    utils::observable<uint32_t>::observer _gzip_threshold_observer;

				    uint32_t _threshold[static_cast<size_t>(compression_type::count)];

				    size_t get_threshold() { return _threshold[static_cast<size_t>(compression_type::any)]; }

				    void update_threshold() {

				        _threshold[static_cast<size_t>(compression_type::none)] = std::numeric_limits<uint32_t>::max();

				        _threshold[static_cast<size_t>(compression_type::any)] = std::numeric_limits<uint32_t>::max();

				        uint32_t gzip = cfg.alternator_response_gzip_compression_level() <= 0 ? std::numeric_limits<uint32_t>::max()

				            : cfg.alternator_response_compression_threshold_in_bytes();

				        _threshold[static_cast<size_t>(compression_type::gzip)] = gzip;

				        _threshold[static_cast<size_t>(compression_type::deflate)] = gzip;

				        for (size_t i = 0; i < static_cast<size_t>(compression_type::compressions_count); ++i) {

				            if (_threshold[i] < _threshold[static_cast<size_t>(compression_type::any)]) {

				                _threshold[static_cast<size_t>(compression_type::any)] = _threshold[i];

				            }

				        }

				    }

				public:

				    future<std::unique_ptr<http::reply>> generate_reply(std::unique_ptr<http::reply> rep,

				         sstring accept_encoding, const char* content_type, std::string&& response_body);

				    future<std::unique_ptr<http::reply>> generate_reply(std::unique_ptr<http::reply> rep,

				         sstring accept_encoding, const char* content_type, executor::body_writer&& body_writer);

				};

				}

									
										109

alternator/parsed_expression_cache.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,109 @@

				/*

				 * Copyright 2025-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#include "expressions.hh"

				#include "utils/log.hh"

				#include "utils/lru_string_map.hh"

				#include <variant>

				static logging::logger logger_("parsed-expression-cache");

				namespace alternator::parsed {

				struct expression_cache_impl {

				    stats& _stats;

				    using cached_expressions_types = std::variant<

				        update_expression,

				        condition_expression,

				        std::vector<path>

				    >;

				    sized_lru_string_map<cached_expressions_types> _cached_entries;

				    utils::observable<uint32_t>::observer _max_cache_entries_observer;

				    expression_cache_impl(expression_cache::config cfg, stats& stats);

				    // to define the specialized return type of `get_or_create()`

				    template <typename Func, typename... Args>

				    using ParseResult = std::invoke_result_t<Func, std::string_view, Args...>;

				    // Caching layer for parsed expressions

				    // The expression type is determined by the type of the parsing function passed as a parameter,

				    // and the return type is exactly the same as the return type of this parsing function.

				    // StatsType is used only to update appropriate statistics - currently it is aligned with the expression type,

				    // but it could be extended in the future if needed, e.g. split per operation.

				    template <stats::expression_types StatsType, typename Func, typename... Args>

				    ParseResult<Func, Args...> get_or_create(std::string_view query, Func&& parse_func, Args&&... other_args) {

				        if (_cached_entries.disabled()) {

				            return parse_func(query, std::forward<Args>(other_args)...);

				        }

				        if (!_cached_entries.sanity_check()) {

				            _stats.expression_cache.requests[StatsType].misses++;

				            return parse_func(query, std::forward<Args>(other_args)...);

				        }

				        auto value = _cached_entries.find(query);

				        if (value) {

				            logger_.trace("Cache hit for query: {}", query);

				            _stats.expression_cache.requests[StatsType].hits++;

				            try {

				                return std::get<ParseResult<Func, Args...>>(value->get());

				            } catch (const std::bad_variant_access&) {

				                // User can reach this code, by sending the same query string as a different expression type.

				                // In practice valid queries are different enough to not collide.

				                // Entries in cache are only valid queries.

				                // This request will fail at parsing below.

				                // If, by any chance this is a valid query, it will be updated below with the new value.

				                logger_.trace("Cache hit for '{}', but type mismatch.", query);

				                _stats.expression_cache.requests[StatsType].hits--;

				            }

				        } else {

				            logger_.trace("Cache miss for query: {}", query);

				        }

				        ParseResult<Func, Args...> expr = parse_func(query, std::forward<Args>(other_args)...);

				        // Invalid query will throw here ^

				        _stats.expression_cache.requests[StatsType].misses++;

				        if (value) [[unlikely]] {

				            value->get() = cached_expressions_types{expr};

				        } else {

				            _cached_entries.insert(query, cached_expressions_types{expr});

				        }

				        return expr;

				    }

				};

				expression_cache_impl::expression_cache_impl(expression_cache::config cfg, stats& stats) : 

				    _stats(stats), _cached_entries(logger_, _stats.expression_cache.evictions),

				    _max_cache_entries_observer(cfg.max_cache_entries.observe([this] (uint32_t max_value) {

				        _cached_entries.set_max_size(max_value);

				    })) {

				    _cached_entries.set_max_size(cfg.max_cache_entries());

				}

				expression_cache::expression_cache(expression_cache::config cfg, stats& stats) : 

				    _impl(std::make_unique<expression_cache_impl>(std::move(cfg), stats)) {

				}

				expression_cache::~expression_cache() = default;

				future<> expression_cache::stop() {

				    return _impl->_cached_entries.stop();

				}

				update_expression expression_cache::parse_update_expression(std::string_view query) {

				    return _impl->get_or_create<stats::expression_types::UPDATE_EXPRESSION>(query, alternator::parse_update_expression);

				}

				std::vector<path> expression_cache::parse_projection_expression(std::string_view query) {

				    return _impl->get_or_create<stats::expression_types::PROJECTION_EXPRESSION>(query, alternator::parse_projection_expression);

				}

				condition_expression expression_cache::parse_condition_expression(std::string_view query, const char* caller) {

				    return _impl->get_or_create<stats::expression_types::CONDITION_EXPRESSION>(query, alternator::parse_condition_expression, caller);

				}

				} // namespace alternator::parsed

									
										12

alternator/rmw_operation.hh
									
												View File
												
				@@ -8,6 +8,8 @@

				#pragma once

				#include "cdc/cdc_options.hh"

				#include "cdc/log.hh"

				#include "seastarx.hh"

				#include "service/paxos/cas_request.hh"

				#include "service/cas_shard.hh"

				@@ -56,7 +58,7 @@ public:

				    static write_isolation get_write_isolation_for_schema(schema_ptr schema);

				    static write_isolation default_write_isolation;

				public:

				    static void set_default_write_isolation(std::string_view mode);

				protected:

				@@ -107,10 +109,11 @@ public:

				    // violating this). We mark apply() "const" to let the compiler validate

				    // this for us. The output-only field _return_attributes is marked

				    // "mutable" above so that apply() can still write to it.

				    virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts) const = 0;

				    virtual std::optional<mutation> apply(std::unique_ptr<rjson::value> previous_item, api::timestamp_type ts, cdc::per_request_options& cdc_opts) const = 0;

				    // Convert the above apply() into the signature needed by cas_request:

				    virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts) override;

				    virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts, cdc::per_request_options& cdc_opts) override;

				    virtual ~rmw_operation() = default;

				    const wcu_consumed_capacity_counter& consumed_capacity() const noexcept { return _consumed_capacity; }

				    schema_ptr schema() const { return _schema; }

				    const rjson::value& request() const { return _request; }

				    rjson::value&& move_request() && { return std::move(_request); }

				@@ -124,6 +127,9 @@ public:

				            stats& per_table_stats,

				            uint64_t& wcu_total);

				    std::optional<service::cas_shard> shard_for_execute(bool needs_read_before_write);

				private:

				    inline bool should_fill_preimage() const { return _schema->cdc_options().enabled(); }

				};

				} // namespace alternator

									
										51

alternator/serialization.cc
									
												View File
												
				@@ -11,8 +11,8 @@

				#include "utils/log.hh"

				#include "serialization.hh"

				#include "error.hh"

				#include "concrete_types.hh"

				#include "cql3/type_json.hh"

				#include "types/concrete_types.hh"

				#include "types/json_utils.hh"

				#include "mutation/position_in_partition.hh"

				static logging::logger slogger("alternator-serialization");

				@@ -282,15 +282,23 @@ std::string type_to_string(data_type type) {

				    return it->second;

				}

				bytes get_key_column_value(const rjson::value& item, const column_definition& column) {

				std::optional<bytes> try_get_key_column_value(const rjson::value& item, const column_definition& column) {

				    std::string column_name = column.name_as_text();

				    const rjson::value* key_typed_value = rjson::find(item, column_name);

				    if (!key_typed_value) {

				        throw api_error::validation(fmt::format("Key column {} not found", column_name));

				        return std::nullopt;

				    }

				    return get_key_from_typed_value(*key_typed_value, column);

				}

				bytes get_key_column_value(const rjson::value& item, const column_definition& column) {

				    auto value = try_get_key_column_value(item, column);

				    if (!value) {

				        throw api_error::validation(fmt::format("Key column {} not found", column.name_as_text()));

				    }

				    return std::move(*value);

				}

				// Parses the JSON encoding for a key value, which is a map with a single

				// entry whose key is the type and the value is the encoded value.

				// If this type does not match the desired "type_str", an api_error::validation

				@@ -380,20 +388,38 @@ clustering_key ck_from_json(const rjson::value& item, schema_ptr schema) {

				        return clustering_key::make_empty();

				    }

				    std::vector<bytes> raw_ck;

				    // FIXME: this is a loop, but we really allow only one clustering key column.

				    // Note: it's possible to get more than one clustering column here, as

				    // Alternator can be used to read scylla internal tables.

				    for (const column_definition& cdef : schema->clustering_key_columns()) {

				        bytes raw_value = get_key_column_value(item,  cdef);

				        auto raw_value = get_key_column_value(item,  cdef);

				        raw_ck.push_back(std::move(raw_value));

				    }

				    return clustering_key::from_exploded(raw_ck);

				}

				position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {

				    auto ck = ck_from_json(item, schema);

				    if (is_alternator_keyspace(schema->ks_name())) {

				        return position_in_partition::for_key(std::move(ck));

				clustering_key_prefix ck_prefix_from_json(const rjson::value& item, schema_ptr schema) {

				    if (schema->clustering_key_size() == 0) {

				        return clustering_key_prefix::make_empty();

				    }

				    std::vector<bytes> raw_ck;

				    for (const column_definition& cdef : schema->clustering_key_columns()) {

				        auto raw_value = try_get_key_column_value(item,  cdef);

				        if (!raw_value) {

				            break;

				        }

				        raw_ck.push_back(std::move(*raw_value));

				    }

				    return clustering_key_prefix::from_exploded(raw_ck);

				}

				position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {

				    const bool is_alternator_ks = is_alternator_keyspace(schema->ks_name());

				    if (is_alternator_ks) {

				        return position_in_partition::for_key(ck_from_json(item, schema));

				    }

				    const auto region_item = rjson::find(item, scylla_paging_region);

				    const auto weight_item = rjson::find(item, scylla_paging_weight);

				    if (bool(region_item) != bool(weight_item)) {

				@@ -413,8 +439,9 @@ position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema)

				        } else {

				            throw std::runtime_error(fmt::format("Invalid value for weight: {}", weight_view));

				        }

				        return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(std::move(ck)) : std::nullopt);

				        return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(ck_prefix_from_json(item, schema)) : std::nullopt);

				    }

				    auto ck = ck_from_json(item, schema);

				    if (ck.is_empty()) {

				        return position_in_partition::for_partition_start();

				    }

				@@ -469,7 +496,7 @@ const std::pair<std::string, const rjson::value*> unwrap_set(const rjson::value&

				        return {"", nullptr};

				    }

				    auto it = v.MemberBegin();

				    const std::string it_key = it->name.GetString();

				    const std::string it_key = rjson::to_string(it->name);

				    if (it_key != "SS" && it_key != "BS" && it_key != "NS") {

				        return {std::move(it_key), nullptr};

				    }

									
										2

alternator/serialization.hh
									
												View File
												
				@@ -55,7 +55,7 @@ partition_key pk_from_json(const rjson::value& item, schema_ptr schema);

				clustering_key ck_from_json(const rjson::value& item, schema_ptr schema);

				position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema);

				// If v encodes a number (i.e., it is a {"N": [...]}, returns an object representing it.  Otherwise,

				// If v encodes a number (i.e., it is a {"N": [...]}), returns an object representing it.  Otherwise,

				// raises ValidationException with diagnostic.

				big_decimal unwrap_number(const rjson::value& v, std::string_view diagnostic);

									
										482

alternator/server.cc
									
												View File
												
				@@ -13,6 +13,7 @@

				#include <seastar/http/function_handlers.hh>

				#include <seastar/http/short_streams.hh>

				#include <seastar/core/coroutine.hh>

				#include <seastar/coroutine/maybe_yield.hh>

				#include <seastar/util/defer.hh>

				#include <seastar/util/short_streams.hh>

				#include "seastarx.hh"

				@@ -31,6 +32,9 @@

				#include "utils/overloaded_functor.hh"

				#include "utils/aws_sigv4.hh"

				#include "client_data.hh"

				#include "utils/updateable_value.hh"

				#include <zlib.h>

				#include "alternator/http_compression.hh"

				static logging::logger slogger("alternator-server");

				@@ -100,10 +104,20 @@ static void handle_CORS(const request& req, reply& rep, bool preflight) {

				// the user directly. Other exceptions are unexpected, and reported as

				// Internal Server Error.

				class api_handler : public handler_base {

				    // Although the the DynamoDB API responses are JSON, additional

				    // conventions apply to these responses. For this reason, DynamoDB uses

				    // the content type "application/x-amz-json-1.0" instead of the standard

				    // "application/json". Some other AWS services use later versions instead

				    // of "1.0", but DynamoDB currently uses "1.0". Note that this content

				    // type applies to all replies, both success and error.

				    static constexpr const char* REPLY_CONTENT_TYPE = "application/x-amz-json-1.0";

				public:

				    api_handler(const std::function<future<executor::request_return_type>(std::unique_ptr<request> req)>& _handle) : _f_handle(

				    api_handler(const std::function<future<executor::request_return_type>(std::unique_ptr<request> req)>& _handle,

				                const db::config& config) : _response_compressor(config), _f_handle(

				         [this, _handle](std::unique_ptr<request> req, std::unique_ptr<reply> rep) {

				         return seastar::futurize_invoke(_handle, std::move(req)).then_wrapped([this, rep = std::move(rep)](future<executor::request_return_type> resf) mutable {

				         sstring accept_encoding = _response_compressor.get_accepted_encoding(*req);

				         return seastar::futurize_invoke(_handle, std::move(req)).then_wrapped(

				            [this, rep = std::move(rep), accept_encoding=std::move(accept_encoding)](future<executor::request_return_type> resf) mutable {

				             if (resf.failed()) {

				                 // Exceptions of type api_error are wrapped as JSON and

				                 // returned to the client as expected. Other types of

				@@ -123,25 +137,20 @@ public:

				                 return make_ready_future<std::unique_ptr<reply>>(std::move(rep));

				             }

				             auto res = resf.get();

				             std::visit(overloaded_functor {

				             return std::visit(overloaded_functor {

				                [&] (std::string&& str) {

				                    // Note that despite the move, there is a copy here -

				                    // as str is std::string and rep->_content is sstring.

				                    rep->_content = std::move(str);

				                    return _response_compressor.generate_reply(std::move(rep), std::move(accept_encoding),

				                                                               REPLY_CONTENT_TYPE, std::move(str));

				                },

				                [&] (executor::body_writer&& body_writer) {

				                    // Unfortunately, write_body() forces us to choose

				                    // from a fixed and irrelevant list of "mime-types"

				                    // at this point. But we'll override it with the

				                    // correct one (application/x-amz-json-1.0) below.

				                    rep->write_body("json", std::move(body_writer));

				                    return _response_compressor.generate_reply(std::move(rep), std::move(accept_encoding),

				                                                               REPLY_CONTENT_TYPE, std::move(body_writer));

				                },

				                [&] (const api_error& err) {

				                    generate_error_reply(*rep, err);

				                    return make_ready_future<std::unique_ptr<reply>>(std::move(rep));

				                }

				             }, std::move(res));

				             return make_ready_future<std::unique_ptr<reply>>(std::move(rep));

				         });

				    }) { }

				@@ -151,7 +160,6 @@ public:

				        handle_CORS(*req, *rep, false);

				        return _f_handle(std::move(req), std::move(rep)).then(

				                [](std::unique_ptr<reply> rep) {

				                    rep->set_mime_type("application/x-amz-json-1.0");

				                    rep->done();

				                    return make_ready_future<std::unique_ptr<reply>>(std::move(rep));

				                });

				@@ -167,9 +175,11 @@ protected:

				        rjson::add(results, "message", err._msg);

				        rep._content = rjson::print(std::move(results));

				        rep._status = err._http_code;

				        rep.set_content_type(REPLY_CONTENT_TYPE);

				        slogger.trace("api_handler error case: {}", rep._content);

				    }

				    response_compressor _response_compressor;

				    future_handler_function _f_handle;

				};

				@@ -266,24 +276,57 @@ protected:

				    }

				};

				// This function increments the authentication_failures counter, and may also

				// log a warn-level message and/or throw an exception, depending on what

				// enforce_authorization and warn_authorization are set to.

				// The username and client address are only used for logging purposes -

				// they are not included in the error message returned to the client, since

				// the client knows who it is.

				// Note that if enforce_authorization is false, this function will return

				// without throwing. So a caller that doesn't want to continue after an

				// authentication_error must explicitly return after calling this function.

				template<typename Exception>

				static void authentication_error(alternator::stats& stats, bool enforce_authorization, bool warn_authorization, Exception&& e, std::string_view user, gms::inet_address client_address) {

				    stats.authentication_failures++;

				    if (enforce_authorization) {

				        if (warn_authorization) {

				            slogger.warn("alternator_warn_authorization=true: {} for user {}, client address {}", e.what(), user, client_address);

				        }

				        throw std::move(e);

				    } else {

				        if (warn_authorization) {

				            slogger.warn("If you set alternator_enforce_authorization=true the following will be enforced: {} for user {}, client address {}", e.what(), user, client_address);

				        }

				    }

				}

				future<std::string> server::verify_signature(const request& req, const chunked_content& content) {

				    if (!_enforce_authorization) {

				    if (!_enforce_authorization.get() && !_warn_authorization.get()) {

				        slogger.debug("Skipping authorization");

				        return make_ready_future<std::string>();

				    }

				    auto host_it = req._headers.find("Host");

				    if (host_it == req._headers.end()) {

				        throw api_error::invalid_signature("Host header is mandatory for signature verification");

				        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				            api_error::invalid_signature("Host header is mandatory for signature verification"), 

				            "", req.get_client_address());

				        return make_ready_future<std::string>();

				    }

				    auto authorization_it = req._headers.find("Authorization");

				    if (authorization_it == req._headers.end()) {

				        throw api_error::missing_authentication_token("Authorization header is mandatory for signature verification");

				        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				            api_error::missing_authentication_token("Authorization header is mandatory for signature verification"),

				            "", req.get_client_address());

				        return make_ready_future<std::string>();

				    }

				    std::string host = host_it->second;

				    std::string_view authorization_header = authorization_it->second;

				    auto pos = authorization_header.find_first_of(' ');

				    if (pos == std::string_view::npos || authorization_header.substr(0, pos) != "AWS4-HMAC-SHA256") {

				        throw api_error::invalid_signature(fmt::format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));

				        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				            api_error::invalid_signature(fmt::format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header)),

				            "", req.get_client_address());

				        return make_ready_future<std::string>();

				    }

				    authorization_header.remove_prefix(pos+1);

				    std::string credential;

				@@ -318,7 +361,9 @@ future<std::string> server::verify_signature(const request& req, const chunked_c

				    std::vector<std::string_view> credential_split = split(credential, '/');

				    if (credential_split.size() != 5) {

				        throw api_error::validation(fmt::format("Incorrect credential information format: {}", credential));

				        authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				            api_error::validation(fmt::format("Incorrect credential information format: {}", credential)), "", req.get_client_address());

				        return make_ready_future<std::string>();

				    }

				    std::string user(credential_split[0]);

				    std::string datestamp(credential_split[1]);

				@@ -329,39 +374,81 @@ future<std::string> server::verify_signature(const request& req, const chunked_c

				    for (const auto& header : signed_headers) {

				        signed_headers_map.emplace(header, std::string_view());

				    }

				    std::vector<std::string> modified_values;

				    for (auto& header : req._headers) {

				        std::string header_str;

				        header_str.resize(header.first.size());

				        std::transform(header.first.begin(), header.first.end(), header_str.begin(), ::tolower);

				        auto it = signed_headers_map.find(header_str);

				        if (it != signed_headers_map.end()) {

				            it->second = std::string_view(header.second);

				            // replace multiple spaces in the header value header.second with

				            // a single space, as required by AWS SigV4 header canonization.

				            // If we modify the value, we need to save it in modified_values

				            // to keep it alive.

				            std::string value;

				            value.reserve(header.second.size());

				            bool prev_space = false;

				            bool modified = false;

				            for (char ch : header.second) {

				                if (ch == ' ') {

				                    if (!prev_space) {

				                        value += ch;

				                        prev_space = true;

				                    } else {

				                        modified = true; // skip a space

				                    }

				                } else {

				                    value += ch;

				                    prev_space = false;

				                }

				            }

				            if (modified) {

				                modified_values.emplace_back(std::move(value));

				                it->second = std::string_view(modified_values.back());

				            } else {

				                it->second = std::string_view(header.second);

				            }

				        }

				    }

				    auto cache_getter = [&proxy = _proxy, &as = _auth_service] (std::string username) {

				        return get_key_from_roles(proxy, as, std::move(username));

				    auto cache_getter = [&proxy = _proxy] (std::string username) {

				        return get_key_from_roles(proxy, std::move(username));

				    };

				    return _key_cache.get_ptr(user, cache_getter).then([this, &req, &content,

				    return _key_cache.get_ptr(user, cache_getter).then_wrapped([this, &req, &content,

				                                                    user = std::move(user),

				                                                    host = std::move(host),

				                                                    datestamp = std::move(datestamp),

				                                                    signed_headers_str = std::move(signed_headers_str),

				                                                    signed_headers_map = std::move(signed_headers_map),

				                                                    modified_values = std::move(modified_values),

				                                                    region = std::move(region),

				                                                    service = std::move(service),

				                                                    user_signature = std::move(user_signature)] (key_cache::value_ptr key_ptr) {

				                                                    user_signature = std::move(user_signature)] (future<key_cache::value_ptr> key_ptr_fut) {

				        key_cache::value_ptr key_ptr(nullptr);

				        try {

				            key_ptr = key_ptr_fut.get();

				        } catch (const api_error& e) {

				            authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				                e, user, req.get_client_address());

				            return std::string();

				        }

				        std::string signature;

				        try {

				            signature = utils::aws::get_signature(user, *key_ptr, std::string_view(host), "/", req._method,

				                datestamp, signed_headers_str, signed_headers_map, &content, region, service, "");

				        } catch (const std::exception& e) {

				            throw api_error::invalid_signature(e.what());

				            authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				                api_error::invalid_signature(fmt::format("invalid signature: {}", e.what())),

				                user, req.get_client_address());

				            return std::string();

				        }

				        if (signature != std::string_view(user_signature)) {

				            _key_cache.remove(user);

				            throw api_error::unrecognized_client("The security token included in the request is invalid.");

				            authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),

				                api_error::unrecognized_client("wrong signature"),

				                user, req.get_client_address());

				            return std::string();

				        }

				        return user;

				    });

				@@ -374,35 +461,82 @@ static tracing::trace_state_ptr create_tracing_session(tracing::tracing& tracing

				    return tracing_instance.create_session(tracing::trace_type::QUERY, props);

				}

				// truncated_content_view() prints a potentially long chunked_content for

				// debugging purposes. In the common case when the content is not excessively

				// long, it just returns a view into the given content, without any copying.

				// But when the content is very long, it is truncated after some arbitrary

				// max_len (or one chunk, whichever comes first), with "<truncated>" added at

				// the end. To do this modification to the string, we need to create a new

				// std::string, so the caller must pass us a reference to one, "buf", where

				// we can store the content. The returned view is only alive for as long this

				// buf is kept alive.

				static std::string_view truncated_content_view(const chunked_content& content, std::string& buf) {

				    constexpr size_t max_len = 1024;

				    if (content.empty()) {

				        return std::string_view();

				    } else if (content.size() == 1 && content.begin()->size() <= max_len) {

				        return std::string_view(content.begin()->get(), content.begin()->size());

				    } else {

				        buf = std::string(content.begin()->get(), std::min(content.begin()->size(), max_len)) + "<truncated>";

				        return std::string_view(buf);

				// A helper class to represent a potentially truncated view of a chunked_content.

				// If the content is short enough and single chunked, it just holds a view into the content.

				// Otherwise it will be copied into an internal buffer, possibly truncated (depending on maximum allowed size passed in),

				// and the view will point into that buffer.

				// `as_view()` method will return the view.

				// `take_as_sstring()` will either move out the internal buffer (if any), or create a new sstring from the view.

				// You should consider `as_view()` valid as long both the original chunked_content and the truncated_content object are alive.

				class truncated_content {

				    std::string_view _view;

				    sstring _content_maybe;

				    void copy_from_content(const chunked_content& content) {

				        size_t offset = 0;

				        for(auto &tmp : content) {

				            size_t to_copy = std::min(tmp.size(), _content_maybe.size() - offset);

				            std::copy(tmp.get(), tmp.get() + to_copy, _content_maybe.data() + offset);

				            offset += to_copy;

				            if (offset >= _content_maybe.size()) {

				                break;

				            }

				        }

				    }

				public:

				    truncated_content(const chunked_content& content, size_t max_len = std::numeric_limits<size_t>::max()) {

				        if (content.empty()) return;

				        if (content.size() == 1 && content.begin()->size() <= max_len) {

				            _view = std::string_view(content.begin()->get(), content.begin()->size());

				            return;

				        }

				        constexpr std::string_view truncated_text = "<truncated>";

				        size_t content_size = 0;

				        for(auto &tmp : content) {

				            content_size += tmp.size();

				        }

				        if (content_size <= max_len) {

				            _content_maybe = sstring{ sstring::initialized_later{}, content_size };

				            copy_from_content(content);

				        }

				        else {

				            _content_maybe = sstring{ sstring::initialized_later{}, max_len + truncated_text.size() };

				            copy_from_content(content);

				            std::copy(truncated_text.begin(), truncated_text.end(), _content_maybe.data() + _content_maybe.size() - truncated_text.size());

				        }

				        _view = std::string_view(_content_maybe);

				    }

				    std::string_view as_view() const { return _view; }

				    sstring take_as_sstring() && {

				        if (_content_maybe.empty() && !_view.empty()) {

				            return sstring{_view};

				        }

				        return std::move(_content_maybe);

				    }

				};

				// `truncated_content_view` will produce an object representing a view to a passed content

				// possibly truncated at some length. The value returned is used in two ways:

				// - to print it in logs (use `as_view()` method for this)

				// - to pass it to tracing object, where it will be stored and used later

				//   (use `take_as_sstring()` method as this produces a copy in form of a sstring)

				// `truncated_content` delays constructing `sstring` object until it's actually needed.

				// `truncated_content` is valid as long as passed `content` is alive.

				// if the content is truncated, `<truncated>` will be appended at the maximum size limit

				// and total size will be `max_users_query_size_in_trace_output() + strlen("<truncated>")`.

				static truncated_content truncated_content_view(const chunked_content& content, size_t max_size) {

				    return truncated_content{content, max_size};

				}

				static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, std::string_view username, std::string_view op, const chunked_content& query) {

				static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_state, std::string_view username, std::string_view op, const chunked_content& query, size_t max_users_query_size_in_trace_output) {

				    tracing::trace_state_ptr trace_state;

				    tracing::tracing& tracing_instance = tracing::tracing::get_local_tracing_instance();

				    if (tracing_instance.trace_next_query() || tracing_instance.slow_query_tracing_enabled()) {

				        trace_state = create_tracing_session(tracing_instance);

				        std::string buf;

				        tracing::add_session_param(trace_state, "alternator_op", op);

				        tracing::add_query(trace_state, truncated_content_view(query, buf));

				        tracing::add_query(trace_state, truncated_content_view(query, max_users_query_size_in_trace_output).take_as_sstring());

				        tracing::begin(trace_state, seastar::format("Alternator {}", op), client_state.get_client_address());

				        if (!username.empty()) {

				            tracing::set_username(trace_state, auth::authenticated_user(username));

				@@ -411,37 +545,215 @@ static tracing::trace_state_ptr maybe_trace_query(service::client_state& client_

				    return trace_state;

				}

				// This read_entire_stream() is similar to Seastar's read_entire_stream()

				// which reads the given content_stream until its end into non-contiguous

				// memory. The difference is that this implementation takes an extra length

				// limit, and throws an error if we read more than this limit.

				// This length-limited variant would not have been needed if Seastar's HTTP

				// server's set_content_length_limit() worked in every case, but unfortunately

				// it does not - it only works if the request has a Content-Length header (see

				// issue #8196). In contrast this function can limit the request's length no

				// matter how it's encoded. We need this limit to protect Alternator from

				// oversized requests that can deplete memory.

				static future<chunked_content>

				read_entire_stream(input_stream<char>& inp, size_t length_limit) {

				    chunked_content ret;

				    // We try to read length_limit + 1 bytes, so that we can throw an

				    // exception if we managed to read more than length_limit.

				    ssize_t remain = length_limit + 1;

				    do {

				        temporary_buffer<char> buf = co_await inp.read_up_to(remain);

				        if (buf.empty()) {

				            break;

				        }

				        remain -= buf.size();

				        ret.push_back(std::move(buf));

				    } while (remain > 0);

				    // If we read the full length_limit + 1 bytes, we went over the limit:

				    if (remain <= 0) {

				        // By throwing here an error, we may send a reply (the error message)

				        // without having read the full request body. Seastar's httpd will

				        // realize that we have not read the entire content stream, and

				        // correctly mark the connection unreusable, i.e., close it.

				        // This means we are currently exposed to issue #12166 caused by

				        // Seastar issue 1325), where the client may get an RST instead of

				        // a FIN, and may rarely get a "Connection reset by peer" before

				        // reading the error we send.

				        throw api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", length_limit));

				    }

				    co_return ret;

				}

				// safe_gzip_stream is an exception-safe wrapper for zlib's z_stream.

				// The "z_stream" struct is used by zlib to hold state while decompressing a

				// stream of data. It allocates memory which must be freed with inflateEnd(),

				// which the destructor of this class does.

				class safe_gzip_zstream {

				    z_stream _zs;

				public:

				    // If gzip is true, decode a gzip header (for "Content-Encoding: gzip").

				    // Otherwise, a zlib header (for "Content-Encoding: deflate").

				    safe_gzip_zstream(bool gzip = true) {

				        memset(&_zs, 0, sizeof(_zs));

				        if (inflateInit2(&_zs, gzip ? 16 + MAX_WBITS : MAX_WBITS) != Z_OK) {

				            // Should only happen if memory allocation fails

				            throw std::bad_alloc();

				        }

				    }

				    ~safe_gzip_zstream() {

				        inflateEnd(&_zs);

				    }

				    z_stream* operator->() {

				        return &_zs;

				    }

				    z_stream* get() {

				        return &_zs;

				    }

				    void reset() {

				        inflateReset(&_zs);

				    }

				};

				// ungzip() takes a chunked_content of a compressed request body, and returns

				// the uncompressed content as a chunked_content. If gzip is true, we expect

				// gzip header (for "Content-Encoding: gzip"), if gzip is false, we expect a

				// zlib header (for "Content-Encoding: deflate").

				// If the uncompressed content exceeds length_limit, an error is thrown.

				static future<chunked_content>

				ungzip(chunked_content&& compressed_body, size_t length_limit, bool gzip = true) {

				    chunked_content ret;

				    // output_buf can be any size - when uncompressing input_buf, it doesn't

				    // need to fit in a single output_buf, we'll use multiple output_buf for

				    // a single input_buf if needed.

				    constexpr size_t OUTPUT_BUF_SIZE = 4096;

				    temporary_buffer<char> output_buf;

				    safe_gzip_zstream strm(gzip);

				    bool complete_stream = false; // empty input is not a valid gzip/deflate

				    size_t total_out_bytes = 0;

				    for (const temporary_buffer<char>& input_buf : compressed_body) {

				        if (input_buf.empty()) {

				            continue;

				        }

				        complete_stream = false;

				        strm->next_in = (Bytef*) input_buf.get();

				        strm->avail_in = (uInt) input_buf.size();

				        do {

				            co_await coroutine::maybe_yield();

				            if (output_buf.empty()) {

				                output_buf = temporary_buffer<char>(OUTPUT_BUF_SIZE);

				            }

				            strm->next_out = (Bytef*) output_buf.get();

				            strm->avail_out = OUTPUT_BUF_SIZE;

				            int e = inflate(strm.get(), Z_NO_FLUSH);

				            size_t out_bytes = OUTPUT_BUF_SIZE - strm->avail_out;

				            if (out_bytes > 0) {

				                // If output_buf is nearly full, we save it as-is in ret. But

				                // if it only has little data, better copy to a small buffer.

				                if (out_bytes > OUTPUT_BUF_SIZE/2) {

				                    ret.push_back(std::move(output_buf).prefix(out_bytes));

				                    // output_buf is now empty. if this loop finds more input,

				                    // we'll allocate a new output buffer.

				                } else {

				                    ret.push_back(temporary_buffer<char>(output_buf.get(), out_bytes));

				                }

				                total_out_bytes += out_bytes;

				                if (total_out_bytes > length_limit) {

				                    throw api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", length_limit));

				                }

				            }

				            if (e == Z_STREAM_END) {

				                // There may be more input after the first gzip stream - in

				                // either this input_buf or the next one. The additional input

				                // should be a second concatenated gzip. We need to allow that

				                // by resetting the gzip stream and continuing the input loop

				                // until there's no more input.

				                strm.reset();

				                if (strm->avail_in == 0) {

				                    complete_stream = true;

				                    break;

				                }

				            } else if (e != Z_OK && e != Z_BUF_ERROR) {

				                // DynamoDB returns an InternalServerError when given a bad

				                // gzip request body. See test test_broken_gzip_content

				                throw api_error::internal("Error during gzip decompression of request body");

				            }

				        } while (strm->avail_in > 0 || strm->avail_out == 0);

				    }

				    if (!complete_stream) {

				        // The gzip stream was not properly finished with Z_STREAM_END

				        throw api_error::internal("Truncated gzip in request body");

				    }

				    co_return ret;

				}

				future<executor::request_return_type> server::handle_api_request(std::unique_ptr<request> req) {

				    _executor._stats.total_operations++;

				    sstring target = req->get_header("X-Amz-Target");

				    // target is DynamoDB API version followed by a dot '.' and operation type (e.g. CreateTable)

				    auto dot = target.find('.');

				    std::string_view op = (dot == sstring::npos) ? std::string_view() : std::string_view(target).substr(dot+1);

				    if (req->content_length > request_content_length_limit) {

				        // If we have a Content-Length header and know the request will be too

				        // long, we don't need to wait for read_entire_stream() below to

				        // discover it. And we definitely mustn't try to get_units() below for

				        // for such a size.

				        co_return api_error::payload_too_large(fmt::format("Request content length limit of {} bytes exceeded", request_content_length_limit));

				    }

				    // JSON parsing can allocate up to roughly 2x the size of the raw

				    // document, + a couple of bytes for maintenance.

				    // TODO: consider the case where req->content_length is missing. Maybe

				    // we need to take the content_length_limit and return some of the units

				    // when we finish read_content_and_verify_signature?

				    size_t mem_estimate = req->content_length * 2 + 8000;

				    // If the Content-Length of the request is not available, we assume

				    // the largest possible request (request_content_length_limit, i.e., 16 MB)

				    // and after reading the request we return_units() the excess.

				    size_t mem_estimate = (req->content_length ? req->content_length : request_content_length_limit) * 2 + 8000;

				    auto units_fut = get_units(*_memory_limiter, mem_estimate);

				    if (_memory_limiter->waiters()) {

				        ++_executor._stats.requests_blocked_memory;

				    }

				    auto units = co_await std::move(units_fut);

				    SCYLLA_ASSERT(req->content_stream);

				    chunked_content content = co_await util::read_entire_stream(*req->content_stream);

				    throwing_assert(req->content_stream);

				    chunked_content content = co_await read_entire_stream(*req->content_stream, request_content_length_limit);

				    // If the request had no Content-Length, we reserved too many units

				    // so need to return some

				    if (req->content_length == 0) {

				        size_t content_length = 0;

				        for (const auto& chunk : content) {

				            content_length += chunk.size();

				        }

				        size_t new_mem_estimate = content_length * 2 + 8000;

				        units.return_units(mem_estimate - new_mem_estimate);

				    }

				    auto username = co_await verify_signature(*req, content);

				    // If the request is compressed, uncompress it now, after we checked

				    // the signature (the signature is computed on the compressed content).

				    // We apply the request_content_length_limit again to the uncompressed

				    // content - we don't want to allow a tiny compressed request to

				    // expand to a huge uncompressed request.

				    sstring content_encoding = req->get_header("Content-Encoding");

				    if (content_encoding == "gzip") {

				        content = co_await ungzip(std::move(content), request_content_length_limit);

				    } else if (content_encoding == "deflate") {

				        content = co_await ungzip(std::move(content), request_content_length_limit, false);

				    } else if (!content_encoding.empty()) {

				        // DynamoDB returns a 500 error for unsupported Content-Encoding.

				        // I'm not sure if this is the best error code, but let's do it too.

				        // See the test test_garbage_content_encoding confirming this case.

				        co_return api_error::internal("Unsupported Content-Encoding");

				    }

				    // As long as the system_clients_entry object is alive, this request will

				    // be visible in the "system.clients" virtual table. When requested, this

				    // entry will be formatted by server::ongoing_request::make_client_data().

				    auto user_agent_header = co_await _connection_options_keys_and_values.get_or_load(req->get_header("User-Agent"), [] (const client_options_cache_key_type&) {

				        return make_ready_future<options_cache_value_type>(options_cache_value_type{});

				    });

				    auto system_clients_entry = _ongoing_requests.emplace(

				        req->get_client_address(), req->get_header("User-Agent"),

				        req->get_client_address(), std::move(user_agent_header),

				        username, current_scheduling_group(),

				        req->get_protocol_name() == "https");

				    if (slogger.is_enabled(log_level::trace)) {

				        std::string buf;

				        slogger.trace("Request: {} {} {}", op, truncated_content_view(content, buf), req->_headers);

				        slogger.trace("Request: {} {} {}", op, truncated_content_view(content, _max_users_query_size_in_trace_output).as_view(), req->_headers);

				    }

				    auto callback_it = _callbacks.find(op);

				    if (callback_it == _callbacks.end()) {

				@@ -459,9 +771,9 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr

				    if (!username.empty()) {

				        client_state.set_login(auth::authenticated_user(username));

				    }

				    co_await client_state.maybe_update_per_service_level_params();

				    client_state.maybe_update_per_service_level_params();

				    tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content);

				    tracing::trace_state_ptr trace_state = maybe_trace_query(client_state, username, op, content, _max_users_query_size_in_trace_output.get());

				    tracing::trace(trace_state, "{}", op);

				    auto user = client_state.user();

				@@ -481,7 +793,7 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr

				void server::set_routes(routes& r) {

				    api_handler* req_handler = new api_handler([this] (std::unique_ptr<request> req) mutable {

				        return handle_api_request(std::move(req));

				    });

				    }, _proxy.data_dictionary().get_config());

				    r.put(operation_type::POST, "/", req_handler);

				    r.put(operation_type::GET, "/", new health_handler(_pending_requests));

				@@ -512,7 +824,7 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos

				        , _auth_service(auth_service)

				        , _sl_controller(sl_controller)

				        , _key_cache(1024, 1min, slogger)

				        , _enforce_authorization(false)

				        , _max_users_query_size_in_trace_output(1024)

				        , _enabled_servers{}

				        , _pending_requests("alternator::server::pending_requests")

				        , _timeout_config(_proxy.data_dictionary().get_config())

				@@ -592,28 +904,39 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos

				    } {

				}

				future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,

				        utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {

				future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port,

				        std::optional<uint16_t> port_proxy_protocol, std::optional<uint16_t> https_port_proxy_protocol,

				        std::optional<tls::credentials_builder> creds,

				        utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization, utils::updateable_value<uint64_t> max_users_query_size_in_trace_output,

				        semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {

				    _memory_limiter = memory_limiter;

				    _enforce_authorization = std::move(enforce_authorization);

				    _warn_authorization = std::move(warn_authorization);

				    _max_concurrent_requests = std::move(max_concurrent_requests);

				    if (!port && !https_port) {

				    _max_users_query_size_in_trace_output = std::move(max_users_query_size_in_trace_output);

				    if (!port && !https_port && !port_proxy_protocol && !https_port_proxy_protocol) {

				        return make_exception_future<>(std::runtime_error("Either regular port or TLS port"

				                " must be specified in order to init an alternator HTTP server instance"));

				    }

				    return seastar::async([this, addr, port, https_port, creds] {

				    return seastar::async([this, addr, port, https_port, port_proxy_protocol, https_port_proxy_protocol, creds] {

				        _executor.start().get();

				        if (port) {

				        if (port || port_proxy_protocol) {

				            set_routes(_http_server._routes);

				            _http_server.set_content_length_limit(server::content_length_limit);

				            _http_server.set_content_streaming(true);

				            _http_server.listen(socket_address{addr, *port}).get();

				            if (port) {

				                _http_server.listen(socket_address{addr, *port}).get();

				            }

				            if (port_proxy_protocol) {

				                listen_options lo;

				                lo.reuse_address = true;

				                lo.proxy_protocol = true;

				                _http_server.listen(socket_address{addr, *port_proxy_protocol}, lo).get();

				            }

				            _enabled_servers.push_back(std::ref(_http_server));

				        }

				        if (https_port) {

				        if (https_port || https_port_proxy_protocol) {

				            set_routes(_https_server._routes);

				            _https_server.set_content_length_limit(server::content_length_limit);

				            _https_server.set_content_streaming(true);

				            if (this_shard_id() == 0) {

				@@ -632,7 +955,15 @@ future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std:

				            } else {

				                _credentials = creds->build_server_credentials();

				            }

				            _https_server.listen(socket_address{addr, *https_port}, _credentials).get();

				            if (https_port) {

				                _https_server.listen(socket_address{addr, *https_port}, _credentials).get();

				            }

				            if (https_port_proxy_protocol) {

				                listen_options lo;

				                lo.reuse_address = true;

				                lo.proxy_protocol = true;

				                _https_server.listen(socket_address{addr, *https_port_proxy_protocol}, lo, _credentials).get();

				            }

				            _enabled_servers.push_back(std::ref(_https_server));

				        }

				    });

				@@ -705,16 +1036,15 @@ client_data server::ongoing_request::make_client_data() const {

				    // and keep "driver_version" unset.

				    cd.driver_name = _user_agent;

				    // Leave "protocol_version" unset, it has no meaning in Alternator.

				    // Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset.

				    // As reported in issue #9216, we never set these fields in CQL

				    // either (see cql_server::connection::make_client_data()).

				    // Leave "hostname", "ssl_protocol" and "ssl_cipher_suite" unset for Alternator.

				    // Note: CQL sets ssl_protocol and ssl_cipher_suite via generic_server::connection base class.

				    return cd;

				}

				future<utils::chunked_vector<client_data>> server::get_client_data() {

				    utils::chunked_vector<client_data> ret;

				future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> server::get_client_data() {

				    utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>> ret;

				    co_await _ongoing_requests.for_each_gently([&ret] (const ongoing_request& r) {

				        ret.emplace_back(r.make_client_data());

				        ret.emplace_back(make_foreign(std::make_unique<client_data>(r.make_client_data())));

				    });

				    co_return ret;

				}

									
										20

alternator/server.hh
									
												View File
												
				@@ -28,7 +28,11 @@ namespace alternator {

				using chunked_content = rjson::chunked_content;

				class server : public peering_sharded_service<server> {

				    static constexpr size_t content_length_limit = 16*MB;

				    // The maximum size of a request body that Alternator will accept,

				    // in bytes. This is a safety measure to prevent Alternator from

				    // running out of memory when a client sends a very large request.

				    // DynamoDB also has the same limit set to 16 MB.

				    static constexpr size_t request_content_length_limit = 16*MB;

				    using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,

				            tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<http::request>)>;

				    using alternator_callbacks_map = std::unordered_map<std::string_view, alternator_callback>;

				@@ -43,12 +47,15 @@ class server : public peering_sharded_service<server> {

				    key_cache _key_cache;

				    utils::updateable_value<bool> _enforce_authorization;

				    utils::updateable_value<bool> _warn_authorization;

				    utils::updateable_value<uint64_t> _max_users_query_size_in_trace_output;

				    utils::small_vector<std::reference_wrapper<seastar::httpd::http_server>, 2> _enabled_servers;

				    named_gate _pending_requests;

				    // In some places we will need a CQL updateable_timeout_config object even

				    // though it isn't really relevant for Alternator which defines its own

				    // timeouts separately. We can create this object only once.

				    updateable_timeout_config _timeout_config;

				    client_options_cache_type _connection_options_keys_and_values;

				    alternator_callbacks_map _callbacks;

				@@ -82,7 +89,7 @@ class server : public peering_sharded_service<server> {

				    // is called when reading the "system.clients" virtual table.

				    struct ongoing_request {

				        socket_address _client_address;

				        sstring _user_agent;

				        client_options_cache_entry_type _user_agent;

				        sstring _username;

				        scheduling_group _scheduling_group;

				        bool _is_https;

				@@ -93,14 +100,17 @@ class server : public peering_sharded_service<server> {

				public:

				    server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);

				    future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,

				            utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);

				    future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port,

				            std::optional<uint16_t> port_proxy_protocol, std::optional<uint16_t> https_port_proxy_protocol,

				            std::optional<tls::credentials_builder> creds,

				            utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization, utils::updateable_value<uint64_t> max_users_query_size_in_trace_output,

				            semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);

				    future<> stop();

				    // get_client_data() is called (on each shard separately) when the virtual

				    // table "system.clients" is read. It is expected to generate a list of

				    // clients connected to this server (on this shard). This function is

				    // called by alternator::controller::get_client_data().

				    future<utils::chunked_vector<client_data>> get_client_data();

				    future<utils::chunked_vector<foreign_ptr<std::unique_ptr<client_data>>>> get_client_data();

				private:

				    void set_routes(seastar::httpd::routes& r);

				    // If verification succeeds, returns the authenticated user's username

									
										61

alternator/stats.cc
									
												View File
												
				@@ -14,20 +14,6 @@

				namespace alternator {

				const char* ALTERNATOR_METRICS = "alternator";

				static seastar::metrics::histogram estimated_histogram_to_metrics(const utils::estimated_histogram& histogram) {

				    seastar::metrics::histogram res;

				    res.buckets.resize(histogram.bucket_offsets.size());

				    uint64_t cumulative_count = 0;

				    res.sample_count = histogram._count;

				    res.sample_sum = histogram._sample_sum;

				    for (size_t i = 0; i < res.buckets.size(); i++) {

				        auto& v = res.buckets[i];

				        v.upper_bound = histogram.bucket_offsets[i];

				        cumulative_count += histogram.buckets[i];

				        v.count = cumulative_count;

				    }

				    return res;

				}

				static seastar::metrics::label column_family_label("cf");

				static seastar::metrics::label keyspace_label("ks");

				@@ -151,10 +137,53 @@ static void register_metrics_with_optional_table(seastar::metrics::metric_groups

				            seastar::metrics::make_counter("batch_item_count", seastar::metrics::description("The total number of items processed across all batches"), labels,

				                    stats.api_operations.batch_get_item_batch_total)(op("BatchGetItem")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				                    [&stats]{ return to_metrics_histogram(stats.api_operations.batch_get_item_histogram);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("batch_item_count_histogram", seastar::metrics::description("Histogram of the number of items in a batch request"), labels,

				                    [&stats]{ return estimated_histogram_to_metrics(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				                    [&stats]{ return to_metrics_histogram(stats.api_operations.batch_write_item_histogram);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return to_metrics_histogram(stats.operation_sizes.get_item_op_size_kb);})(op("GetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return to_metrics_histogram(stats.operation_sizes.put_item_op_size_kb);})(op("PutItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return to_metrics_histogram(stats.operation_sizes.delete_item_op_size_kb);})(op("DeleteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return to_metrics_histogram(stats.operation_sizes.update_item_op_size_kb);})(op("UpdateItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return to_metrics_histogram(stats.operation_sizes.batch_get_item_op_size_kb);})(op("BatchGetItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_histogram("operation_size_kb", seastar::metrics::description("Histogram of item sizes involved in a request"), labels,

				                    [&stats]{ return to_metrics_histogram(stats.operation_sizes.batch_write_item_op_size_kb);})(op("BatchWriteItem")).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				    });

				    seastar::metrics::label expression_label("expression");

				    metrics.add_group(group_name, {

				            seastar::metrics::make_total_operations("expression_cache_evictions", stats.expression_cache.evictions,

				                    seastar::metrics::description("Counts number of entries evicted from expressions cache"), labels).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_hits", stats.expression_cache.requests[stats::expression_types::UPDATE_EXPRESSION].hits,

				                    seastar::metrics::description("Counts number of hits of cached expressions"), labels)(expression_label("UpdateExpression")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_misses", stats.expression_cache.requests[stats::expression_types::UPDATE_EXPRESSION].misses,

				                    seastar::metrics::description("Counts number of misses of cached expressions"), labels)(expression_label("UpdateExpression")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_hits", stats.expression_cache.requests[stats::expression_types::CONDITION_EXPRESSION].hits,

				                    seastar::metrics::description("Counts number of hits of cached expressions"), labels)(expression_label("ConditionExpression")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_misses", stats.expression_cache.requests[stats::expression_types::CONDITION_EXPRESSION].misses,

				                    seastar::metrics::description("Counts number of misses of cached expressions"), labels)(expression_label("ConditionExpression")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_hits", stats.expression_cache.requests[stats::expression_types::PROJECTION_EXPRESSION].hits,

				                    seastar::metrics::description("Counts number of hits of cached expressions"), labels)(expression_label("ProjectionExpression")).aggregate(aggregate_labels).set_skip_when_empty(),

				            seastar::metrics::make_total_operations("expression_cache_misses", stats.expression_cache.requests[stats::expression_types::PROJECTION_EXPRESSION].misses,

				                    seastar::metrics::description("Counts number of misses of cached expressions"), labels)(expression_label("ProjectionExpression")).aggregate(aggregate_labels).set_skip_when_empty()

				    });

				    // Only register the following metrics for the global metrics, not per-table

				    if (!has_table) {

				        metrics.add_group("alternator", {

				            seastar::metrics::make_counter("authentication_failures", stats.authentication_failures,

				                seastar::metrics::description("total number of authentication failures"), labels).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				            seastar::metrics::make_counter("authorization_failures", stats.authorization_failures,

				                seastar::metrics::description("total number of authorization failures"), labels).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),

				        });

				    }

				}

				void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats) {

									
										63

alternator/stats.hh
									
												View File
												
				@@ -16,6 +16,8 @@

				#include "cql3/stats.hh"

				namespace alternator {

				using batch_histogram = utils::estimated_histogram_with_max<128>;

				using op_size_histogram = utils::estimated_histogram_with_max<512>;

				// Object holding per-shard statistics related to Alternator.

				// While this object is alive, these metrics are also registered to be

				@@ -76,9 +78,46 @@ public:

				        utils::timed_rate_moving_average_summary_and_histogram batch_get_item_latency;

				        utils::timed_rate_moving_average_summary_and_histogram get_records_latency;

				        utils::estimated_histogram batch_get_item_histogram{22}; // a histogram that covers the range 1 - 100

				        utils::estimated_histogram batch_write_item_histogram{22}; // a histogram that covers the range 1 - 100

				        batch_histogram batch_get_item_histogram;

				        batch_histogram batch_write_item_histogram;

				    } api_operations;

				    // Operation size metrics

				    struct {

				        // Item size statistics collected per table and aggregated per node.

				        // Each histogram covers the range 0 - 512. Resolves #25143.

				        // A size is the retrieved item's size.

				        op_size_histogram get_item_op_size_kb;

				        // A size is the maximum of the new item's size and the old item's size.

				        op_size_histogram put_item_op_size_kb;

				        // A size is the deleted item's size. If the deleted item's size is

				        // unknown (i.e. read-before-write wasn't necessary and it wasn't

				        // forced by a configuration option), it won't be recorded on the

				        // histogram.

				        op_size_histogram delete_item_op_size_kb;

				        // A size is the maximum of existing item's size and the estimated size

				        // of the update. This will be changed to the maximum of the existing item's

				        // size and the new item's size in a subsequent PR.

				        op_size_histogram update_item_op_size_kb;

				        // A size is the sum of the sizes of all items per table. This means

				        // that a single BatchGetItem / BatchWriteItem updates the histogram

				        // for each table that it has items in.

				        // The sizes are the retrieved items' sizes grouped per table.

				        op_size_histogram batch_get_item_op_size_kb;

				        // The sizes are the the written items' sizes grouped per table.

				        op_size_histogram batch_write_item_op_size_kb;

				    } operation_sizes;

				    // Count of authentication and authorization failures, counted if either

				    // alternator_enforce_authorization or alternator_warn_authorization are

				    // set to true. If both are false, no authentication or authorization

				    // checks are performed, so failures are not recognized or counted.

				    // "authentication" failure means the request was not signed with a valid

				    // user and key combination. "authorization" failure means the request was

				    // authenticated to a valid user - but this user did not have permissions

				    // to perform the operation (considering RBAC settings and the user's

				    // superuser status).

				    uint64_t authentication_failures = 0;

				    uint64_t authorization_failures = 0;

				    // Miscellaneous event counters

				    uint64_t total_operations = 0;

				    uint64_t unsupported_operations = 0;

				@@ -101,6 +140,22 @@ public:

				    uint64_t wcu_total[NUM_TYPES] = {0};

				    // CQL-derived stats

				    cql3::cql_stats cql_stats;

				    // Enumeration of expression types only for stats

				    // if needed it can be extended e.g. per operation

				    enum expression_types {

				        UPDATE_EXPRESSION,

				        CONDITION_EXPRESSION,

				        PROJECTION_EXPRESSION,

				        NUM_EXPRESSION_TYPES

				    };

				    struct {

				        struct {

				            uint64_t hits = 0;

				            uint64_t misses = 0;

				        } requests[NUM_EXPRESSION_TYPES];

				        uint64_t evictions = 0;

				    } expression_cache;

				};

				struct table_stats {

				@@ -110,4 +165,8 @@ struct table_stats {

				};

				void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats);

				inline uint64_t bytes_to_kb_ceil(uint64_t bytes) {

				    return (bytes) / 1024;

				}

				}

									
										546

alternator/streams.cc
									
												View File
												
				@@ -13,7 +13,6 @@

				#include <seastar/json/formatter.hh>

				#include "auth/permission.hh"

				#include "db/config.hh"

				#include "cdc/log.hh"

				@@ -32,6 +31,9 @@

				#include "executor.hh"

				#include "data_dictionary/data_dictionary.hh"

				#include "utils/rjson.hh"

				static logging::logger elogger("alternator-streams");

				/**

				 * Base template type to implement  rapidjson::internal::TypeHelper<...>:s

				@@ -126,7 +128,7 @@ public:

				    }

				};

				}

				} // namespace alternator

				template<typename ValueType>

				struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_arn>

				@@ -296,7 +298,7 @@ sequence_number::sequence_number(std::string_view v)

				    }())

				{}

				}

				} // namespace alternator

				template<typename ValueType>

				struct rapidjson::internal::TypeHelper<ValueType, alternator::shard_id>

				@@ -356,7 +358,7 @@ static stream_view_type cdc_options_to_steam_view_type(const cdc::options& opts)

				    return type;

				}

				}

				} // namespace alternator

				template<typename ValueType>

				struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_view_type>

				@@ -428,6 +430,25 @@ using namespace std::chrono_literals;

				// Dynamo docs says no data shall live longer than 24h.

				static constexpr auto dynamodb_streams_max_window = 24h;

				// find the parent shard in previous generation for the given child shard

				// takes care of wrap-around case in vnodes

				// prev_streams must be sorted by token

				const cdc::stream_id& find_parent_shard_in_previous_generation(db_clock::time_point prev_timestamp, const utils::chunked_vector<cdc::stream_id> &prev_streams, const cdc::stream_id &child) {

				    if (prev_streams.empty()) {

				        // something is really wrong - streams are empty

				        // let's try internal_error in hope it will be notified and fixed

				        on_internal_error(elogger, fmt::format("streams are empty for cdc generation at {} ({})", prev_timestamp, prev_timestamp.time_since_epoch().count()));

				    }

				    auto it = std::lower_bound(prev_streams.begin(), prev_streams.end(), child.token(), [](const cdc::stream_id& id, const dht::token& t) {

				        return id.token() < t;

				    });

				    if (it == prev_streams.end()) {

				        // wrap around case - take first

				        it = prev_streams.begin();

				    }

				    return *it;

				}

				future<executor::request_return_type> executor::describe_stream(client_state& client_state, service_permit permit, rjson::value request) {

				    _stats.api_operations.describe_stream++;

				@@ -475,10 +496,10 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				        } else {

				            status = "ENABLED";

				        }

				    } 

				    }

				    auto ttl = std::chrono::seconds(opts.ttl());

				    rjson::add(stream_desc, "StreamStatus", rjson::from_string(status));

				    stream_view_type type = cdc_options_to_steam_view_type(opts);

				@@ -491,7 +512,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				    if (!opts.enabled()) {

				        rjson::add(ret, "StreamDescription", std::move(stream_desc));

				        return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				        co_return rjson::print(std::move(ret));

				    }

				    // TODO: label

				@@ -502,123 +523,113 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl

				    // filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)

				    auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);

				    return _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners }).then([db, shard_start, limit, ret = std::move(ret), stream_desc = std::move(stream_desc)] (std::map<db_clock::time_point, cdc::streams_version> topologies) mutable {

				    std::map<db_clock::time_point, cdc::streams_version> topologies = co_await _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners });

				    auto e = topologies.end();

				    auto prev = e;

				    auto shards = rjson::empty_array();

				        auto e = topologies.end();

				        auto prev = e;

				        auto shards = rjson::empty_array();

				    std::optional<shard_id> last;

				        std::optional<shard_id> last;

				    auto i = topologies.begin();

				    // if we're a paged query, skip to the generation where we left of.

				    if (shard_start) {

				        i = topologies.find(shard_start->time);

				    }

				        auto i = topologies.begin();

				        // if we're a paged query, skip to the generation where we left of.

				        if (shard_start) {

				            i = topologies.find(shard_start->time);

				        }

				    // for parent-child stuff we need id:s to be sorted by token

				    // (see explanation above) since we want to find closest

				    // token boundary when determining parent.

				    // #7346 - we processed and searched children/parents in

				    // stored order, which is not necessarily token order,

				    // so the finding of "closest" token boundary (using upper bound)

				    // could give somewhat weird results.

				    static auto token_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {

				        return id1.token() < id2.token();

				    };

				        // for parent-child stuff we need id:s to be sorted by token

				        // (see explanation above) since we want to find closest

				        // token boundary when determining parent.

				        // #7346 - we processed and searched children/parents in

				        // stored order, which is not necessarily token order,

				        // so the finding of "closest" token boundary (using upper bound)

				        // could give somewhat weird results.

				        static auto token_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {

				            return id1.token() < id2.token();

				        };

				    // #7409 - shards must be returned in lexicographical order,

				    // normal bytes compare is string_traits<int8_t>::compare.

				    // thus bytes 0x8000 is less than 0x0000. By doing unsigned

				    // compare instead we inadvertently will sort in string lexical.

				    static auto id_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {

				        return compare_unsigned(id1.to_bytes(), id2.to_bytes()) < 0;

				    };

				    // need a prev even if we are skipping stuff

				    if (i != topologies.begin()) {

				        prev = std::prev(i);

				    }

				    for (; limit > 0 && i != e; prev = i, ++i) {

				        auto& [ts, sv] = *i;

				        last = std::nullopt;

				        auto lo = sv.streams.begin();

				        auto end = sv.streams.end();

				        // #7409 - shards must be returned in lexicographical order,

				        // normal bytes compare is string_traits<int8_t>::compare.

				        // thus bytes 0x8000 is less than 0x0000. By doing unsigned

				        // compare instead we inadvertently will sort in string lexical.

				        static auto id_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {

				            return compare_unsigned(id1.to_bytes(), id2.to_bytes()) < 0;

				        };

				        std::sort(lo, end, id_cmp);

				        // need a prev even if we are skipping stuff

				        if (i != topologies.begin()) {

				            prev = std::prev(i);

				        if (shard_start) {

				            // find next shard position

				            lo = std::upper_bound(lo, end, shard_start->id, id_cmp);

				            shard_start = std::nullopt;

				        }

				        for (; limit > 0 && i != e; prev = i, ++i) {

				            auto& [ts, sv] = *i;

				        if (lo != end && prev != e) {

				            // We want older stuff sorted in token order so we can find matching

				            // token range when determining parent shard.

				            std::stable_sort(prev->second.streams.begin(), prev->second.streams.end(), token_cmp);

				        }

				        auto expired = [&]() -> std::optional<db_clock::time_point> {

				            auto j = std::next(i);

				            if (j == e) {

				                return std::nullopt;

				            }

				            // add this so we sort of match potential 

				            // sequence numbers in get_records result.

				            return j->first + confidence_interval(db);

				        }();

				        while (lo != end) {

				            auto& id = *lo++;

				            auto shard = rjson::empty_object();

				            if (prev != e) {

				                auto &pid = find_parent_shard_in_previous_generation(prev->first, prev->second.streams, id);

				                rjson::add(shard, "ParentShardId", shard_id(prev->first, pid));

				            }

				            last.emplace(ts, id);

				            rjson::add(shard, "ShardId", *last);

				            auto range = rjson::empty_object();

				            rjson::add(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch())));

				            if (expired) {

				                rjson::add(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch())));

				            }

				            rjson::add(shard, "SequenceNumberRange", std::move(range));

				            rjson::push_back(shards, std::move(shard));

				            if (--limit == 0) {

				                break;

				            }

				            last = std::nullopt;

				            auto lo = sv.streams.begin();

				            auto end = sv.streams.end();

				            // #7409 - shards must be returned in lexicographical order,

				            std::sort(lo, end, id_cmp);

				            if (shard_start) {

				                // find next shard position

				                lo = std::upper_bound(lo, end, shard_start->id, id_cmp);

				                shard_start = std::nullopt;

				            }

				            if (lo != end && prev != e) {

				                // We want older stuff sorted in token order so we can find matching

				                // token range when determining parent shard.

				                std::stable_sort(prev->second.streams.begin(), prev->second.streams.end(), token_cmp);

				            }

				            auto expired = [&]() -> std::optional<db_clock::time_point> {

				                auto j = std::next(i);

				                if (j == e) {

				                    return std::nullopt;

				                }

				                // add this so we sort of match potential 

				                // sequence numbers in get_records result.

				                return j->first + confidence_interval(db);

				            }();

				            while (lo != end) {

				                auto& id = *lo++;

				                auto shard = rjson::empty_object();

				                if (prev != e) {

				                    auto& pids = prev->second.streams;

				                    auto pid = std::upper_bound(pids.begin(), pids.end(), id.token(), [](const dht::token& t, const cdc::stream_id& id) {

				                        return t < id.token();

				                    });

				                    if (pid != pids.begin()) {

				                        pid = std::prev(pid);

				                    }

				                    if (pid != pids.end()) {

				                        rjson::add(shard, "ParentShardId", shard_id(prev->first, *pid));

				                    }

				                }

				                last.emplace(ts, id);

				                rjson::add(shard, "ShardId", *last);

				                auto range = rjson::empty_object();

				                rjson::add(range, "StartingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(ts.time_since_epoch())));

				                if (expired) {

				                    rjson::add(range, "EndingSequenceNumber", sequence_number(utils::UUID_gen::min_time_UUID(expired->time_since_epoch())));

				                }

				                rjson::add(shard, "SequenceNumberRange", std::move(range));

				                rjson::push_back(shards, std::move(shard));

				                if (--limit == 0) {

				                    break;

				                }

				                last = std::nullopt;

				            }

				        }

				    }

				        if (last) {

				            rjson::add(stream_desc, "LastEvaluatedShardId", *last);

				        }

				    if (last) {

				        rjson::add(stream_desc, "LastEvaluatedShardId", *last);

				    }

				        rjson::add(stream_desc, "Shards", std::move(shards));

				        rjson::add(ret, "StreamDescription", std::move(stream_desc));

				        return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				    });

				    rjson::add(stream_desc, "Shards", std::move(shards));

				    rjson::add(ret, "StreamDescription", std::move(stream_desc));

				    co_return rjson::print(std::move(ret));

				}

				enum class shard_iterator_type {

				@@ -714,7 +725,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&

				    auto type = rjson::get<shard_iterator_type>(request, "ShardIteratorType");

				    auto seq_num = rjson::get_opt<sequence_number>(request, "SequenceNumber");

				    if (type < shard_iterator_type::TRIM_HORIZON && !seq_num) {

				        throw api_error::validation("Missing required parameter \"SequenceNumber\"");

				    }

				@@ -724,7 +735,7 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&

				    auto stream_arn = rjson::get<alternator::stream_arn>(request, "StreamArn");

				    auto db = _proxy.data_dictionary();

				    schema_ptr schema = nullptr;

				    std::optional<shard_id> sid;

				@@ -789,7 +800,7 @@ struct event_id {

				        return os;

				    }

				};

				}

				} // namespace alternator

				template<typename ValueType>

				struct rapidjson::internal::TypeHelper<ValueType, alternator::event_id>

				@@ -827,7 +838,7 @@ future<executor::request_return_type> executor::get_records(client_state& client

				    tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());

				    co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::SELECT);

				    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::SELECT, _stats);

				    db::consistency_level cl = db::consistency_level::LOCAL_QUORUM;

				    partition_key pk = iter.shard.id.to_partition_key(*schema);

				@@ -871,10 +882,12 @@ future<executor::request_return_type> executor::get_records(client_state& client

				    std::transform(pks.begin(), pks.end(), std::back_inserter(columns), [](auto& c) { return &c; });

				    std::transform(cks.begin(), cks.end(), std::back_inserter(columns), [](auto& c) { return &c; });

				    auto regular_column_start_idx = columns.size();

				    auto regular_column_filter = std::views::filter([](const column_definition& cdef) { return cdef.name() == op_column_name || cdef.name() == eor_column_name || !cdc::is_cdc_metacolumn_name(cdef.name_as_text()); });

				    std::ranges::transform(schema->regular_columns() | regular_column_filter, std::back_inserter(columns), [](auto& c) { return &c; });

				    auto regular_columns = schema->regular_columns()

				        | std::views::filter([](const column_definition& cdef) { return cdef.name() == op_column_name || cdef.name() == eor_column_name || !cdc::is_cdc_metacolumn_name(cdef.name_as_text()); })

				        | std::views::transform([&] (const column_definition& cdef) { columns.emplace_back(&cdef); return cdef.id; })

				    auto regular_columns = std::ranges::subrange(columns.begin() + regular_column_start_idx, columns.end())

				        | std::views::transform(&column_definition::id)

				        | std::ranges::to<query::column_id_vector>()

				    ;

				@@ -896,160 +909,169 @@ future<executor::request_return_type> executor::get_records(client_state& client

				    auto command = ::make_lw_shared<query::read_command>(schema->id(), schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),

				            query::tombstone_limit(_proxy.get_tombstone_limit()), query::row_limit(limit * mul));

				    co_return co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state)).then(

				            [this, schema, partition_slice = std::move(partition_slice), selection = std::move(selection), start_time = std::move(start_time), limit, key_names = std::move(key_names), attr_names = std::move(attr_names), type, iter, high_ts] (service::storage_proxy::coordinator_query_result qr) mutable {       

				        cql3::selection::result_set_builder builder(*selection, gc_clock::now());

				        query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));

				    service::storage_proxy::coordinator_query_result qr = co_await _proxy.query(schema, std::move(command), std::move(partition_ranges), cl, service::storage_proxy::coordinator_query_options(default_timeout(), std::move(permit), client_state));

				    cql3::selection::result_set_builder builder(*selection, gc_clock::now());

				    query::result_view::consume(*qr.query_result, partition_slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));

				        auto result_set = builder.build();

				        auto records = rjson::empty_array();

				    auto result_set = builder.build();

				    auto records = rjson::empty_array();

				        auto& metadata = result_set->get_metadata();

				    auto& metadata = result_set->get_metadata();

				        auto op_index = std::distance(metadata.get_names().begin(), 

				            std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {

				                return cdef->name->name() == op_column_name;

				            })

				        );

				        auto ts_index = std::distance(metadata.get_names().begin(), 

				            std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {

				                return cdef->name->name() == timestamp_column_name;

				            })

				        );

				        auto eor_index = std::distance(metadata.get_names().begin(), 

				            std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {

				                return cdef->name->name() == eor_column_name;

				            })

				        );

				    auto op_index = std::distance(metadata.get_names().begin(), 

				        std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {

				            return cdef->name->name() == op_column_name;

				        })

				    );

				    auto ts_index = std::distance(metadata.get_names().begin(), 

				        std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {

				            return cdef->name->name() == timestamp_column_name;

				        })

				    );

				    auto eor_index = std::distance(metadata.get_names().begin(), 

				        std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {

				            return cdef->name->name() == eor_column_name;

				        })

				    );

				        std::optional<utils::UUID> timestamp;

				        auto dynamodb = rjson::empty_object();

				        auto record = rjson::empty_object();

				    std::optional<utils::UUID> timestamp;

				    auto dynamodb = rjson::empty_object();

				    auto record = rjson::empty_object();

				    const auto dc_name = _proxy.get_token_metadata_ptr()->get_topology().get_datacenter();

				        using op_utype = std::underlying_type_t<cdc::operation>;

				    using op_utype = std::underlying_type_t<cdc::operation>;

				        auto maybe_add_record = [&] {

				            if (!dynamodb.ObjectEmpty()) {

				                rjson::add(record, "dynamodb", std::move(dynamodb));

				                dynamodb = rjson::empty_object();

				            }

				            if (!record.ObjectEmpty()) {

				                // TODO: awsRegion?

				                rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));

				                rjson::add(record, "eventSource", "scylladb:alternator");

				                rjson::push_back(records, std::move(record));

				                record = rjson::empty_object();

				                --limit;

				            }

				        };

				    auto maybe_add_record = [&] {

				        if (!dynamodb.ObjectEmpty()) {

				            rjson::add(record, "dynamodb", std::move(dynamodb));

				            dynamodb = rjson::empty_object();

				        }

				        if (!record.ObjectEmpty()) {

				            rjson::add(record, "awsRegion", rjson::from_string(dc_name));

				            rjson::add(record, "eventID", event_id(iter.shard.id, *timestamp));

				            rjson::add(record, "eventSource", "scylladb:alternator");

				            rjson::add(record, "eventVersion", "1.1");

				            rjson::push_back(records, std::move(record));

				            record = rjson::empty_object();

				            --limit;

				        }

				    };

				        for (auto& row : result_set->rows()) {

				            auto op = static_cast<cdc::operation>(value_cast<op_utype>(data_type_for<op_utype>()->deserialize(*row[op_index])));

				            auto ts = value_cast<utils::UUID>(data_type_for<utils::UUID>()->deserialize(*row[ts_index]));

				            auto eor = row[eor_index].has_value() ? value_cast<bool>(boolean_type->deserialize(*row[eor_index])) : false;

				    for (auto& row : result_set->rows()) {

				        auto op = static_cast<cdc::operation>(value_cast<op_utype>(data_type_for<op_utype>()->deserialize(*row[op_index])));

				        auto ts = value_cast<utils::UUID>(data_type_for<utils::UUID>()->deserialize(*row[ts_index]));

				        auto eor = row[eor_index].has_value() ? value_cast<bool>(boolean_type->deserialize(*row[eor_index])) : false;

				            if (!dynamodb.HasMember("Keys")) {

				                auto keys = rjson::empty_object();

				                describe_single_item(*selection, row, key_names, keys);

				                rjson::add(dynamodb, "Keys", std::move(keys));

				                rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());

				                rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));

				                rjson::add(dynamodb, "StreamViewType", type);

				                //TODO: SizeInBytes

				            }

				            /**

				             * We merge rows with same timestamp into a single event.

				             * This is pretty much needed, because a CDC row typically

				             * encodes ~half the info of an alternator write. 

				             * 

				             * A big, big downside to how alternator records are written

				             * (i.e. CQL), is that the distinction between INSERT and UPDATE

				             * is somewhat lost/unmappable to actual eventName. 

				             * A write (currently) always looks like an insert+modify

				             * regardless whether we wrote existing record or not. 

				             * 

				             * Maybe RMW ops could be done slightly differently so 

				             * we can distinguish them here...

				             * 

				             * For now, all writes will become MODIFY.

				             * 

				             * Note: we do not check the current pre/post

				             * flags on CDC log, instead we use data to 

				             * drive what is returned. This is (afaict)

				             * consistent with dynamo streams

				             */

				            switch (op) {

				            case cdc::operation::pre_image:

				            case cdc::operation::post_image:

				            {

				                auto item = rjson::empty_object();

				                describe_single_item(*selection, row, attr_names, item, nullptr, true);

				                describe_single_item(*selection, row, key_names, item);

				                rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));

				                break;

				            }

				            case cdc::operation::update:

				                rjson::add(record, "eventName", "MODIFY");

				                break;

				            case cdc::operation::insert:

				                rjson::add(record, "eventName", "INSERT");

				                break;

				            default:

				                rjson::add(record, "eventName", "REMOVE");

				                break;

				            }

				            if (eor) {

				                maybe_add_record();

				                timestamp = ts;

				                if (limit == 0) {

				                    break;

				                }

				            }

				        if (!dynamodb.HasMember("Keys")) {

				            auto keys = rjson::empty_object();

				            describe_single_item(*selection, row, key_names, keys);

				            rjson::add(dynamodb, "Keys", std::move(keys));

				            rjson::add(dynamodb, "ApproximateCreationDateTime", utils::UUID_gen::unix_timestamp_in_sec(ts).count());

				            rjson::add(dynamodb, "SequenceNumber", sequence_number(ts));

				            rjson::add(dynamodb, "StreamViewType", type);

				            // TODO: SizeBytes

				        }

				        auto ret = rjson::empty_object();

				        auto nrecords = records.Size();

				        rjson::add(ret, "Records", std::move(records));

				        if (nrecords != 0) {

				            // #9642. Set next iterators threshold to > last

				            shard_iterator next_iter(iter.table, iter.shard, *timestamp, false);

				            // Note that here we unconditionally return NextShardIterator,

				            // without checking if maybe we reached the end-of-shard. If the

				            // shard did end, then the next read will have nrecords == 0 and

				            // will notice end end of shard and not return NextShardIterator.

				            rjson::add(ret, "NextShardIterator", next_iter);

				            _stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);

				            return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				        /**

				         * We merge rows with same timestamp into a single event.

				         * This is pretty much needed, because a CDC row typically

				         * encodes ~half the info of an alternator write. 

				         * 

				         * A big, big downside to how alternator records are written

				         * (i.e. CQL), is that the distinction between INSERT and UPDATE

				         * is somewhat lost/unmappable to actual eventName. 

				         * A write (currently) always looks like an insert+modify

				         * regardless whether we wrote existing record or not. 

				         * 

				         * Maybe RMW ops could be done slightly differently so 

				         * we can distinguish them here...

				         * 

				         * For now, all writes will become MODIFY.

				         * 

				         * Note: we do not check the current pre/post

				         * flags on CDC log, instead we use data to 

				         * drive what is returned. This is (afaict)

				         * consistent with dynamo streams

				         */

				        switch (op) {

				        case cdc::operation::pre_image:

				        case cdc::operation::post_image:

				        {

				            auto item = rjson::empty_object();

				            describe_single_item(*selection, row, attr_names, item, nullptr, true);

				            describe_single_item(*selection, row, key_names, item);

				            rjson::add(dynamodb, op == cdc::operation::pre_image ? "OldImage" : "NewImage", std::move(item));

				            break;

				        }

				        // ugh. figure out if we are and end-of-shard

				        auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();

				        return _sdks.cdc_current_generation_timestamp({ normal_token_owners }).then([this, iter, high_ts, start_time, ret = std::move(ret)](db_clock::time_point ts) mutable {

				            auto& shard = iter.shard;            

				            if (shard.time < ts && ts < high_ts) {

				                // The DynamoDB documentation states that when a shard is

				                // closed, reading it until the end has NextShardIterator

				                // "set to null". Our test test_streams_closed_read

				                // confirms that by "null" they meant not set at all.

				            } else {

				                // We could have return the same iterator again, but we did

				                // a search from it until high_ts and found nothing, so we

				                // can also start the next search from high_ts.

				                // TODO: but why? It's simpler just to leave the iterator be.

				                shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);

				                rjson::add(ret, "NextShardIterator", iter);

				        case cdc::operation::update:

				            rjson::add(record, "eventName", "MODIFY");

				            break;

				        case cdc::operation::insert:

				            rjson::add(record, "eventName", "INSERT");

				            break;

				        case cdc::operation::service_row_delete:

				        case cdc::operation::service_partition_delete:

				        {

				            auto user_identity = rjson::empty_object();

				            rjson::add(user_identity, "Type", "Service");

				            rjson::add(user_identity, "PrincipalId", "dynamodb.amazonaws.com");

				            rjson::add(record, "userIdentity", std::move(user_identity));

				            rjson::add(record, "eventName", "REMOVE");

				            break;

				        }

				        default:

				            rjson::add(record, "eventName", "REMOVE");

				            break;

				        }

				        if (eor) {

				            maybe_add_record();

				            timestamp = ts;

				            if (limit == 0) {

				                break;

				            }

				            _stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);

				            if (is_big(ret)) {

				                return make_ready_future<executor::request_return_type>(make_streamed(std::move(ret)));

				            }

				            return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));

				        });

				    });

				        }

				    }

				    auto ret = rjson::empty_object();

				    auto nrecords = records.Size();

				    rjson::add(ret, "Records", std::move(records));

				    if (nrecords != 0) {

				        // #9642. Set next iterators threshold to > last

				        shard_iterator next_iter(iter.table, iter.shard, *timestamp, false);

				        // Note that here we unconditionally return NextShardIterator,

				        // without checking if maybe we reached the end-of-shard. If the

				        // shard did end, then the next read will have nrecords == 0 and

				        // will notice end end of shard and not return NextShardIterator.

				        rjson::add(ret, "NextShardIterator", next_iter);

				        _stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);

				        co_return rjson::print(std::move(ret));

				    }

				    // ugh. figure out if we are and end-of-shard

				    auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();

				    db_clock::time_point ts = co_await _sdks.cdc_current_generation_timestamp({ normal_token_owners });

				    auto& shard = iter.shard;

				    if (shard.time < ts && ts < high_ts) {

				        // The DynamoDB documentation states that when a shard is

				        // closed, reading it until the end has NextShardIterator

				        // "set to null". Our test test_streams_closed_read

				        // confirms that by "null" they meant not set at all.

				    } else {

				        // We could have return the same iterator again, but we did

				        // a search from it until high_ts and found nothing, so we

				        // can also start the next search from high_ts.

				        // TODO: but why? It's simpler just to leave the iterator be.

				        shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);

				        rjson::add(ret, "NextShardIterator", iter);

				    }

				    _stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);

				    if (is_big(ret)) {

				        co_return make_streamed(std::move(ret));

				    }

				    co_return rjson::print(std::move(ret));

				}

				bool executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {

				@@ -1059,9 +1081,7 @@ bool executor::add_stream_options(const rjson::value& stream_specification, sche

				    }

				    if (stream_enabled->GetBool()) {

				        auto db = sp.data_dictionary();

				        if (!db.features().alternator_streams) {

				        if (!sp.features().alternator_streams) {

				            throw api_error::validation("StreamSpecification: alternator streams feature not enabled in cluster.");

				        }

				@@ -1120,4 +1140,4 @@ void executor::supplement_table_stream_info(rjson::value& descr, const schema& s

				    }

				}

				}

				} // namespace alternator

									
										118

alternator/ttl.cc
									
												View File
												
				@@ -17,6 +17,7 @@

				#include <seastar/core/lowres_clock.hh>

				#include <seastar/coroutine/maybe_yield.hh>

				#include "cdc/log.hh"

				#include "exceptions/exceptions.hh"

				#include "gms/gossiper.hh"

				#include "gms/inet_address.hh"

				@@ -27,7 +28,7 @@

				#include "replica/database.hh"

				#include "service/client_state.hh"

				#include "service_permit.hh"

				#include "timestamp.hh"

				#include "mutation/timestamp.hh"

				#include "service/storage_proxy.hh"

				#include "service/pager/paging_state.hh"

				#include "service/pager/query_pagers.hh"

				@@ -45,6 +46,7 @@

				#include "alternator/executor.hh"

				#include "alternator/controller.hh"

				#include "alternator/serialization.hh"

				#include "alternator/ttl_tag.hh"

				#include "dht/sharder.hh"

				#include "db/config.hh"

				#include "db/tags/utils.hh"

				@@ -56,19 +58,10 @@ static logging::logger tlogger("alternator_ttl");

				namespace alternator {

				// We write the expiration-time attribute enabled on a table using a

				// tag TTL_TAG_KEY.

				// Currently, the *value* of this tag is simply the name of the attribute,

				// and the expiration scanner interprets it as an Alternator attribute name -

				// It can refer to a real column or if that doesn't exist, to a member of

				// the ":attrs" map column. Although this is designed for Alternator, it may

				// be good enough for CQL as well (there, the ":attrs" column won't exist).

				static const sstring TTL_TAG_KEY("system:ttl_attribute");

				future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {

				    _stats.api_operations.update_time_to_live++;

				    if (!_proxy.data_dictionary().features().alternator_ttl) {

				        co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");

				    if (!_proxy.features().alternator_ttl) {

				        co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Upgrade all nodes to a version that supports it.");

				    }

				    schema_ptr schema = get_table(_proxy, request);

				@@ -92,9 +85,9 @@ future<executor::request_return_type> executor::update_time_to_live(client_state

				    if (v->GetStringLength() < 1 || v->GetStringLength() > 255) {

				        co_return api_error::validation("The length of AttributeName must be between 1 and 255");

				    }

				    sstring attribute_name(v->GetString(), v->GetStringLength());

				    sstring attribute_name = rjson::to_sstring(*v);

				    co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::ALTER);

				    co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::ALTER, _stats);

				    co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {

				        if (enabled) {

				            if (tags_map.contains(TTL_TAG_KEY)) {

				@@ -140,7 +133,7 @@ future<executor::request_return_type> executor::describe_time_to_live(client_sta

				// expiration_service is a sharded service responsible for cleaning up expired

				// items in all tables with per-item expiration enabled. Currently, this means

				// Alternator tables with TTL configured via a UpdateTimeToLive request.

				// Alternator tables with TTL configured via an UpdateTimeToLive request.

				//

				// Here is a brief overview of how the expiration service works:

				//

				@@ -292,7 +285,12 @@ static future<> expire_item(service::storage_proxy& proxy,

				        db::consistency_level::LOCAL_QUORUM,

				        executor::default_timeout(), // FIXME - which timeout?

				        qs.get_trace_state(), qs.get_permit(),

				        db::allow_per_partition_rate_limit::no);

				        db::allow_per_partition_rate_limit::no,

				        false,

				        cdc::per_request_options{

				            .is_system_originated = true,

				        }

				    );

				}

				static size_t random_offset(size_t min, size_t max) {

				@@ -318,9 +316,7 @@ static future<std::vector<std::pair<dht::token_range, locator::host_id>>> get_se

				    const auto& tm = *erm->get_token_metadata_ptr();

				    const auto& sorted_tokens = tm.sorted_tokens();

				    std::vector<std::pair<dht::token_range, locator::host_id>> ret;

				    if (sorted_tokens.empty()) {

				        on_internal_error(tlogger, "Token metadata is empty");

				    }

				    throwing_assert(!sorted_tokens.empty());

				    auto prev_tok = sorted_tokens.back();

				    for (const auto& tok : sorted_tokens) {

				        co_await coroutine::maybe_yield();

				@@ -557,7 +553,7 @@ static future<> scan_table_ranges(

				        expiration_service::stats& expiration_stats)

				{

				    const schema_ptr& s = scan_ctx.s;

				    SCYLLA_ASSERT (partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.

				    throwing_assert(partition_ranges.size() == 1); // otherwise issue #9167 will cause incorrect results.

				    auto p = service::pager::query_pagers::pager(proxy, s, scan_ctx.selection, *scan_ctx.query_state_ptr,

				            *scan_ctx.query_options, scan_ctx.command, std::move(partition_ranges), nullptr);

				    while (!p->is_exhausted()) {

				@@ -587,7 +583,7 @@ static future<> scan_table_ranges(

				            if (retries >= 10) {

				                // Don't get stuck forever asking the same page, maybe there's

				                // a bug or a real problem in several replicas. Give up on

				                // this scan an retry the scan from a random position later,

				                // this scan and retry the scan from a random position later,

				                // in the next scan period.

				                throw runtime_exception("scanner thread failed after too many timeouts for the same page");

				            }

				@@ -634,13 +630,38 @@ static future<> scan_table_ranges(

				                }

				            } else {

				                // For a real column to contain an expiration time, it

				                // must be a numeric type.

				                // FIXME: Currently we only support decimal_type (which is

				                // what Alternator uses), but other numeric types can be

				                // supported as well to make this feature more useful in CQL.

				                // Note that kind::decimal is also checked above.

				                big_decimal n = value_cast<big_decimal>(v);

				                expired = is_expired(n, now);

				                // must be a numeric type. We currently support decimal

				                // (used by Alternator TTL) as well as bigint, int and

				                // timestamp (used by CQL per-row TTL).

				                switch (meta[*expiration_column]->type->get_kind()) {

				                    case abstract_type::kind::decimal:

				                        // Used by Alternator TTL for key columns not stored

				                        // in the map. The value is in seconds, fractional

				                        // part is ignored.

				                        expired = is_expired(value_cast<big_decimal>(v), now);

				                        break;

				                    case abstract_type::kind::long_kind:

				                        // Used by CQL per-row TTL. The value is in seconds.

				                        expired = is_expired(gc_clock::time_point(std::chrono::seconds(value_cast<int64_t>(v))), now);

				                        break;

				                    case abstract_type::kind::int32:

				                        // Used by CQL per-row TTL. The value is in seconds.

				                        // Using int type is not recommended because it will

				                        // overflow in 2038, but we support it to allow users

				                        // to use existing int columns for expiration.

				                        expired = is_expired(gc_clock::time_point(std::chrono::seconds(value_cast<int32_t>(v))), now);

				                        break;

				                    case abstract_type::kind::timestamp:

				                        // Used by CQL per-row TTL. The value is in milliseconds

				                        // but we truncate it to gc_clock's precision (whole seconds).

				                        expired = is_expired(gc_clock::time_point(std::chrono::duration_cast<gc_clock::duration>(value_cast<db_clock::time_point>(v).time_since_epoch())), now);

				                        break;

				                    default:

				                        // Should never happen - we verified the column's type

				                        // before starting the scan.

				                        [[unlikely]]

				                        on_internal_error(tlogger, format("expiration scanner value of unsupported type {} in column {}", meta[*expiration_column]->type->cql3_type_name(), scan_ctx.column_name) );

				                }

				            }

				            if (expired) {

				                expiration_stats.items_deleted++;

				@@ -702,16 +723,12 @@ static future<bool> scan_table(

				        co_return false;

				    }

				    // attribute_name may be one of the schema's columns (in Alternator, this

				    // means it's a key column), or an element in Alternator's attrs map

				    // encoded in Alternator's JSON encoding.

				    // FIXME: To make this less Alternators-specific, we should encode in the

				    // single key's value three things:

				    // 1. The name of a column

				    // 2. Optionally if column is a map, a member in the map

				    // 3. The deserializer for the value: CQL or Alternator (JSON).

				    // The deserializer can be guessed: If the given column or map item is

				    // numeric, it can be used directly. If it is a "bytes" type, it needs to

				    // be deserialized using Alternator's deserializer.

				    // means a key column, in CQL it's a regular column), or an element in

				    // Alternator's attrs map encoded in Alternator's JSON encoding (which we

				    // decode). If attribute_name is a real column, in Alternator it will have

				    // the type decimal, counting seconds since the UNIX epoch, while in CQL

				    // it will one of the types bigint or int (counting seconds) or timestamp

				    // (counting milliseconds).

				    bytes column_name = to_bytes(*attribute_name);

				    const column_definition *cd = s->get_column_definition(column_name);

				    std::optional<std::string> member;

				@@ -730,11 +747,14 @@ static future<bool> scan_table(

				    data_type column_type = cd->type;

				    // Verify that the column has the right type: If "member" exists

				    // the column must be a map, and if it doesn't, the column must

				    // (currently) be a decimal_type. If the column has the wrong type

				    // nothing can get expired in this table, and it's pointless to

				    // scan it.

				    // be decimal_type (Alternator), bigint, int or timestamp (CQL).

				    // If the column has the wrong type nothing can get expired in

				    // this table, and it's pointless to scan it.

				    if ((member && column_type->get_kind() != abstract_type::kind::map) ||

				        (!member && column_type->get_kind() != abstract_type::kind::decimal)) {

				        (!member && column_type->get_kind() != abstract_type::kind::decimal &&

				         column_type->get_kind() != abstract_type::kind::long_kind &&

				         column_type->get_kind() != abstract_type::kind::int32 &&

				         column_type->get_kind() != abstract_type::kind::timestamp)) {

				        tlogger.info("table {} TTL column has unsupported type, not scanning", s->cf_name());

				        co_return false;

				    }

				@@ -747,7 +767,7 @@ static future<bool> scan_table(

				        auto my_host_id = erm->get_topology().my_host_id();

				        const auto &tablet_map = erm->get_token_metadata().tablets().get_tablet_map(s->id());

				        for (std::optional tablet = tablet_map.first_tablet(); tablet; tablet = tablet_map.next_tablet(*tablet)) {

				            auto tablet_primary_replica = tablet_map.get_primary_replica(*tablet);

				            auto tablet_primary_replica = tablet_map.get_primary_replica(*tablet, erm->get_topology());

				            // check if this is the primary replica for the current tablet

				            if (tablet_primary_replica.host == my_host_id && tablet_primary_replica.shard == this_shard_id()) {

				                co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);

				@@ -761,7 +781,7 @@ static future<bool> scan_table(

				                // by tasking another node to take over scanning of the dead node's primary

				                // ranges. What we do here is that this node will also check expiration

				                // on its *secondary* ranges - but only those whose primary owner is down.

				                auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet); // throws if no secondary replica

				                auto tablet_secondary_replica = tablet_map.get_secondary_replica(*tablet, erm->get_topology()); // throws if no secondary replica

				                if (tablet_secondary_replica.host == my_host_id && tablet_secondary_replica.shard == this_shard_id()) {

				                    if (!gossiper.is_alive(tablet_primary_replica.host)) {

				                        co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);

				@@ -872,12 +892,10 @@ future<> expiration_service::run() {

				future<> expiration_service::start() {

				    // Called by main() on each shard to start the expiration-service

				    // thread. Just runs run() in the background and allows stop().

				    if (_db.features().alternator_ttl) {

				        if (!shutting_down()) {

				            _end = run().handle_exception([] (std::exception_ptr ep) {

				                tlogger.error("expiration_service failed: {}", ep);

				            });

				        }

				    if (!shutting_down()) {

				        _end = run().handle_exception([] (std::exception_ptr ep) {

				            tlogger.error("expiration_service failed: {}", ep);

				        });

				    }

				    return make_ready_future<>();

				}

									
										4

alternator/ttl.hh
									
												View File
												
				@@ -30,7 +30,7 @@ namespace alternator {

				// expiration_service is a sharded service responsible for cleaning up expired

				// items in all tables with per-item expiration enabled. Currently, this means

				// Alternator tables with TTL configured via a UpdateTimeToLeave request.

				// Alternator tables with TTL configured via an UpdateTimeToLive request.

				class expiration_service final : public seastar::peering_sharded_service<expiration_service> {

				public:

				    // Object holding per-shard statistics related to the expiration service.

				@@ -52,7 +52,7 @@ private:

				    data_dictionary::database _db;

				    service::storage_proxy& _proxy;

				    gms::gossiper& _gossiper;

				    // _end is set by start(), and resolves when the the background service

				    // _end is set by start(), and resolves when the background service

				    // started by it ends. To ask the background service to end, _abort_source

				    // should be triggered. stop() below uses both _abort_source and _end.

				    std::optional<future<>> _end;

									
										26

alternator/ttl_tag.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,26 @@

				/*

				 * Copyright 2026-present ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#pragma once

				#include "seastarx.hh"

				#include <seastar/core/sstring.hh>

				namespace alternator {

				// We use the table tag TTL_TAG_KEY ("system:ttl_attribute") to remember

				// which attribute was chosen as the expiration-time attribute for

				// Alternator's TTL and CQL's per-row TTL features.

				// Currently, the *value* of this tag is simply the name of the attribute:

				// It can refer to a real column or if that doesn't exist, to a member of

				// the ":attrs" map column (which Alternator uses).

				extern const sstring TTL_TAG_KEY;

				} // namespace alternator

				// let users use TTL_TAG_KEY without the "alternator::" prefix,

				// to make it easier to move it to a different namespace later.

				using alternator::TTL_TAG_KEY;

									
										5

api/CMakeLists.txt
									
												View File
												
				@@ -31,6 +31,7 @@ set(swagger_files

				  api-doc/column_family.json

				  api-doc/commitlog.json

				  api-doc/compaction_manager.json

				  api-doc/client_routes.json

				  api-doc/config.json

				  api-doc/cql_server_test.json

				  api-doc/endpoint_snitch_info.json

				@@ -68,6 +69,7 @@ target_sources(api

				  PRIVATE

				    api.cc

				    cache_service.cc

				    client_routes.cc

				    collectd.cc

				    column_family.cc

				    commitlog.cc

				@@ -106,5 +108,8 @@ target_link_libraries(api

				    wasmtime_bindings

				    absl::headers)

				if (Scylla_USE_PRECOMPILED_HEADER_USE)

				  target_precompile_headers(api REUSE_FROM scylla-precompiled-header)

				endif()

				check_headers(check-headers api

				  GLOB_RECURSE ${CMAKE_CURRENT_SOURCE_DIR}/*.hh)

									
										2

api/api-doc/authorization_cache.json
									
												View File
												
				@@ -12,7 +12,7 @@

				      "operations":[

				        {

				          "method":"POST",

				          "summary":"Reset cache",

				          "summary":"Resets authorized prepared statements cache",

				          "type":"void",

				          "nickname":"authorization_cache_reset",

				          "produces":[

									
										23

api/api-doc/client_routes.def.json
									
										Normal file
									
												View File
												
				@@ -0,0 +1,23 @@

				    , "client_routes_entry": {

				        "id": "client_routes_entry",

				        "summary": "An entry storing client routes",

				        "properties": {

				            "connection_id": {"type": "string"},

				            "host_id": {"type": "string", "format": "uuid"},

				            "address": {"type": "string"},

				            "port": {"type": "integer"},

				            "tls_port": {"type": "integer"},

				            "alternator_port": {"type": "integer"},

				            "alternator_https_port": {"type": "integer"}

				        },

				        "required": ["connection_id", "host_id", "address"]

				    }

				    , "client_routes_key": {

				        "id": "client_routes_key",

				        "summary": "A key of client_routes_entry",

				        "properties": {

				            "connection_id": {"type": "string"},

				            "host_id": {"type": "string", "format": "uuid"}

				        }

				    }

									
										74

api/api-doc/client_routes.json
									
										Normal file
									
												View File
												
				@@ -0,0 +1,74 @@

				    , "/v2/client-routes":{

				        "get": {

				            "description":"List all client route entries",

				            "operationId":"get_client_routes",

				            "tags":["client_routes"],

				            "produces":[

				                "application/json"

				            ],

				            "parameters":[],

				            "responses":{

				                "200":{

				                    "schema":{

				                        "type":"array",

				                        "items":{ "$ref":"#/definitions/client_routes_entry" }

				                    }

				                },

				                "default":{

				                    "description":"unexpected error",

				                    "schema":{"$ref":"#/definitions/ErrorModel"}

				                }

				            }

				        },

				        "post": {

				            "description":"Upsert one or more client route entries",

				            "operationId":"set_client_routes",

				            "tags":["client_routes"],

				            "parameters":[

				                {

				                    "name":"body",

				                    "in":"body",

				                    "required":true,

				                    "schema":{

				                        "type":"array",

				                        "items":{ "$ref":"#/definitions/client_routes_entry" }

				                    }

				                }

				            ],

				            "responses":{

				                "200":{ "description": "OK" },

				                "default":{

				                    "description":"unexpected error",

				                    "schema":{ "$ref":"#/definitions/ErrorModel" }

				                }

				            }

				        },

				        "delete": {

				            "description":"Delete one or more client route entries",

				            "operationId":"delete_client_routes",

				            "tags":["client_routes"],

				            "parameters":[

				                {

				                    "name":"body",

				                    "in":"body",

				                    "required":true,

				                    "schema":{

				                        "type":"array",

				                        "items":{ "$ref":"#/definitions/client_routes_key" }

				                    }

				                }

				            ],

				            "responses":{

				                "200":{

				                    "description": "OK"

				                },

				                "default":{

				                    "description":"unexpected error",

				                    "schema":{

				                        "$ref":"#/definitions/ErrorModel"

				                    }

				                }

				            }

				        }

				    }

									
										2

api/api-doc/messaging_service.json
									
												View File
												
				@@ -243,7 +243,7 @@

				                 "GOSSIP_DIGEST_SYN",

				                 "GOSSIP_DIGEST_ACK2",

				                 "GOSSIP_SHUTDOWN",

				                 "DEFINITIONS_UPDATE",

				                 "UNUSED__DEFINITIONS_UPDATE",

				                 "TRUNCATE",

				                 "UNUSED__REPLICATION_FINISHED",

				                 "MIGRATION_REQUEST",

									
										297

api/api-doc/storage_service.json
									
												View File
												
				@@ -220,6 +220,25 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/nodes/excluded",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Retrieve host ids of nodes which are marked as excluded",

				               "type":"array",

				               "items":{

				                  "type":"string"

				               },

				               "nickname":"get_excluded_nodes",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/nodes/joining",

				         "operations":[

				@@ -594,6 +613,50 @@

				            }

				         ]

				      },

				      {

				         "path": "/storage_service/natural_endpoints/v2/{keyspace}",

				         "operations": [

				            {

				               "method": "GET",

				               "summary":"This method returns the N endpoints that are responsible for storing the specified key i.e for replication. the endpoint responsible for this key",

				               "type": "array",

				               "items": {

				                  "type": "string"

				               },

				               "nickname": "get_natural_endpoints_v2",

				               "produces": [

				                  "application/json"

				               ],

				               "parameters": [

				                  {

				                     "name": "keyspace",

				                     "description": "The keyspace to query about.",

				                     "required": true,

				                     "allowMultiple": false,

				                     "type": "string",

				                     "paramType": "path"

				                  },

				                  {

				                     "name": "cf",

				                     "description": "Column family name.",

				                     "required": true,

				                     "allowMultiple": false,

				                     "type": "string",

				                     "paramType": "query"

				                  },

				                  {

				                     "name": "key_component",

				                     "description": "Each component of the key for which we need to find the endpoint (e.g. ?key_component=part1&key_component=part2).",

				                     "required": true,

				                     "allowMultiple": true,

				                     "type": "string",

				                     "paramType": "query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/cdc_streams_check_and_repair",

				         "operations":[

				@@ -898,6 +961,14 @@

				                          "type":"string",

				                          "paramType":"query",

				                          "enum": ["all", "dc", "rack", "node"]

				                      },

				                      {

				                         "name":"primary_replica_only",

				                         "description":"Load the sstables and stream to the primary replica node within the scope, if one is specified. If not, stream to the global primary replica.",

				                         "required":false,

				                         "allowMultiple":false,

				                         "type":"boolean",

				                         "paramType":"query"

				                      }

				                  ]

				              }

				@@ -984,7 +1055,7 @@

				         ]

				      },

				      {

				         "path":"/storage_service/cleanup_all",

				         "path":"/storage_service/cleanup_all/",

				         "operations":[

				            {

				               "method":"POST",

				@@ -994,6 +1065,30 @@

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                    {

				                     "name":"global",

				                     "description":"true if cleanup of entire cluster is requested",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/mark_node_as_clean",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Mark the node as clean. After that the node will not be considered as needing cleanup during automatic cleanup which is triggered by some topology operations",

				               "type":"void",

				               "nickname":"reset_cleanup_needed",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[]

				            }

				         ]

				@@ -1100,6 +1195,14 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name": "drop_unfixable_sstables",

				                     "description": "When set to true, drop unfixable sstables. Applies only to scrub mode SEGREGATE.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

				@@ -1192,6 +1295,45 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/logstor_compaction",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Trigger compaction of the key-value storage",

				               "type":"void",

				               "nickname":"logstor_compaction",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"major",

				                     "description":"When true, perform a major compaction",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/logstor_flush",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Trigger flush of logstor storage",

				               "type":"void",

				               "nickname":"logstor_flush",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/active_repair/",

				         "operations":[

				@@ -1519,6 +1661,30 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/exclude_node",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Marks the node as permanently down (excluded).",

				               "type":"void",

				               "nickname":"exclude_node",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"hosts",

				                     "description":"Comma-separated list of host ids to exclude",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/removal_status",

				         "operations":[

				@@ -2921,6 +3087,14 @@

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"incremental_mode",

				                     "description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				@@ -2950,6 +3124,48 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/tablets/snapshots",

				         "operations":[

				            {

				               "method":"POST",

				               "summary":"Takes the snapshot for the given keyspaces/tables. A snapshot name must be specified.",

				               "type":"void",

				               "nickname":"take_cluster_snapshot",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"tag",

				                     "description":"the tag given to the snapshot",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"keyspace",

				                     "description":"Keyspace(s) to snapshot. Multiple keyspaces can be provided using a comma-separated list. If omitted, snapshot all keyspaces.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"table",

				                     "description":"Table(s) to snapshot. Multiple tables (in a single keyspace) can be provided using a comma-separated list. If omitted, snapshot all tables in the given keyspace(s).",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/quiesce_topology",

				         "operations":[

				@@ -3052,6 +3268,38 @@

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/logstor_info",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Logstor segment information for one table",

				               "type":"table_logstor_info",

				               "nickname":"logstor_info",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[

				                  {

				                     "name":"keyspace",

				                     "description":"The keyspace",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"table",

				                     "description":"table name",

				                     "required":true,

				                     "allowMultiple":false,

				                     "type":"string",

				                     "paramType":"query"

				                  }

				               ]

				            }

				         ]

				      },

				      {

				         "path":"/storage_service/retrain_dict",

				         "operations":[

				@@ -3345,11 +3593,11 @@

				         "properties":{

				            "start_token":{

				               "type":"string",

				               "description":"The range start token"

				               "description":"The range start token (exclusive)"

				            },

				            "end_token":{

				               "type":"string",

				               "description":"The range start token"

				               "description":"The range end token (inclusive)"

				            },

				            "endpoints":{

				               "type":"array",

				@@ -3422,7 +3670,7 @@

				            "version":{

				               "type":"string",

				               "enum":[

				                  "ka", "la", "mc", "md", "me"

				                  "ka", "la", "mc", "md", "me", "ms"

				               ],

				               "description":"SSTable version"

				            },

				@@ -3460,6 +3708,47 @@

				            }

				        }

				      },

				        "logstor_hist_bucket":{

				         "id":"logstor_hist_bucket",

				         "properties":{

				            "bucket":{

				               "type":"long"

				            },

				            "count":{

				               "type":"long"

				            },

				            "min_data_size":{

				               "type":"long"

				            },

				            "max_data_size":{

				               "type":"long"

				            }

				         }

				        },

				        "table_logstor_info":{

				         "id":"table_logstor_info",

				         "description":"Per-table logstor segment distribution",

				         "properties":{

				            "keyspace":{

				               "type":"string"

				            },

				            "table":{

				               "type":"string"

				            },

				            "compaction_groups":{

				               "type":"long"

				            },

				            "segments":{

				               "type":"long"

				            },

				            "data_size_histogram":{

				               "type":"array",

				               "items":{

				                  "$ref":"logstor_hist_bucket"

				               }

				            }

				         }

				        },

				      "tablet_repair_result":{

				        "id":"tablet_repair_result",

				        "description":"Tablet repair result",

									
										15

api/api-doc/system.json
									
												View File
												
				@@ -209,6 +209,21 @@

				               "parameters":[]

				            }

				         ]

				      },

				      {

				         "path":"/system/chosen_sstable_version",

				         "operations":[

				            {

				               "method":"GET",

				               "summary":"Get sstable version currently chosen for use in new sstables",

				               "type":"string",

				               "nickname":"get_chosen_sstable_version",

				               "produces":[

				                  "application/json"

				               ],

				               "parameters":[]

				            }

				         ]

				      }

				   ]

				}

									
										2

api/api-doc/task_manager.json
									
												View File
												
				@@ -447,7 +447,7 @@

				               "description":"The number of units completed so far"

				            },

				            "children_ids":{

				               "type":"array",

				               "type":"chunked_array",

				               "items":{

				                  "type":"task_identity"

				               },

									
										8

api/api-doc/tasks.json
									
												View File
												
				@@ -42,6 +42,14 @@

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  },

				                  {

				                     "name":"consider_only_existing_data",

				                     "description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",

				                     "required":false,

				                     "allowMultiple":false,

				                     "type":"boolean",

				                     "paramType":"query"

				                  }

				               ]

				            }

									
										55

api/api.cc
									
												View File
												
				@@ -37,6 +37,7 @@

				#include "raft.hh"

				#include "gms/gossip_address_map.hh"

				#include "service_levels.hh"

				#include "client_routes.hh"

				logging::logger apilog("api");

				@@ -67,9 +68,11 @@ future<> set_server_init(http_context& ctx) {

				        rb02->set_api_doc(r);

				        rb02->register_api_file(r, "swagger20_header");

				        rb02->register_api_file(r, "metrics");

				        rb02->register_api_file(r, "client_routes");

				        rb->register_function(r, "system",

				                "The system related API");

				        rb02->add_definitions_file(r, "metrics");

				        rb02->add_definitions_file(r, "client_routes");

				        set_system(ctx, r);

				        rb->register_function(r, "error_injection",

				            "The error injection API");

				@@ -119,9 +122,9 @@ future<> unset_thrift_controller(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_thrift_controller(ctx, r); });

				}

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {

				    return ctx.http_server.set_routes([&ctx, &ss, &group0_client] (routes& r) {

				            set_storage_service(ctx, r, ss, group0_client);

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, service::raft_group0_client& group0_client) {

				    return ctx.http_server.set_routes([&ctx, &ss, &ssc, &group0_client] (routes& r) {

				            set_storage_service(ctx, r, ss, ssc, group0_client);

				        });

				}

				@@ -129,6 +132,16 @@ future<> unset_server_storage_service(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_storage_service(ctx, r); });

				}

				future<> set_server_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr) {

				    return ctx.http_server.set_routes([&ctx, &cr] (routes& r) {

				        set_client_routes(ctx, r, cr);

				    });

				}

				future<> unset_server_client_routes(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_client_routes(ctx, r); });

				}

				future<> set_load_meter(http_context& ctx, service::load_meter& lm) {

				    return ctx.http_server.set_routes([&ctx, &lm] (routes& r) { set_load_meter(ctx, r, lm); });

				}

				@@ -137,14 +150,6 @@ future<> unset_load_meter(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_load_meter(ctx, r); });

				}

				future<> set_format_selector(http_context& ctx, db::sstables_format_selector& sel) {

				    return ctx.http_server.set_routes([&ctx, &sel] (routes& r) { set_format_selector(ctx, r, sel); });

				}

				future<> unset_format_selector(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_format_selector(ctx, r); });

				}

				future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader) {

				    return ctx.http_server.set_routes([&ctx, &sst_loader] (routes& r) { set_sstables_loader(ctx, r, sst_loader); });

				}

				@@ -224,15 +229,22 @@ future<> unset_server_gossip(http_context& ctx) {

				    });

				}

				future<> set_server_column_family(http_context& ctx, sharded<db::system_keyspace>& sys_ks) {

				    return register_api(ctx, "column_family",

				                "The column family API", [&sys_ks] (http_context& ctx, routes& r) {

				                    set_column_family(ctx, r, sys_ks);

				future<> set_server_column_family(http_context& ctx, sharded<replica::database>& db) {

				    co_await register_api(ctx, "column_family",

				                "The column family API", [&db] (http_context& ctx, routes& r) {

				                    set_column_family(ctx, r, db);

				                });

				    co_await register_api(ctx, "cache_service",

				            "The cache service API", [&db] (http_context& ctx, routes& r) {

				                    set_cache_service(ctx, db, r);

				                });

				}

				future<> unset_server_column_family(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_column_family(ctx, r); });

				    return ctx.http_server.set_routes([&ctx] (routes& r) {

				        unset_column_family(ctx, r);

				        unset_cache_service(ctx, r);

				    });

				}

				future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms) {

				@@ -264,15 +276,6 @@ future<> unset_server_stream_manager(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_stream_manager(ctx, r); });

				}

				future<> set_server_cache(http_context& ctx) {

				    return register_api(ctx, "cache_service",

				            "The cache service API", set_cache_service);

				}

				future<> unset_server_cache(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_cache_service(ctx, r); });

				}

				future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& proxy, sharded<gms::gossiper>& g) {

				    return register_api(ctx, "hinted_handoff",

				                "The hinted handoff API", [&proxy, &g] (http_context& ctx, routes& r) {

				@@ -284,7 +287,7 @@ future<> unset_hinted_handoff(http_context& ctx) {

				    return ctx.http_server.set_routes([&ctx] (routes& r) { unset_hinted_handoff(ctx, r); });

				}

				future<> set_server_compaction_manager(http_context& ctx, sharded<compaction_manager>& cm) {

				future<> set_server_compaction_manager(http_context& ctx, sharded<compaction::compaction_manager>& cm) {

				    return register_api(ctx, "compaction_manager", "The Compaction manager API", [&cm] (http_context& ctx, routes& r) {

				        set_compaction_manager(ctx, r, cm);

				    });

									
										33

api/api.hh
									
												View File
												
				@@ -23,31 +23,6 @@

				namespace api {

				template<class T>

				std::vector<T> map_to_key_value(const std::map<sstring, sstring>& map) {

				    std::vector<T> res;

				    res.reserve(map.size());

				    for (const auto& [key, value] : map) {

				        res.push_back(T());

				        res.back().key = key;

				        res.back().value = value;

				    }

				    return res;

				}

				template<class T, class MAP>

				std::vector<T>& map_to_key_value(const MAP& map, std::vector<T>& res) {

				    res.reserve(res.size() + std::size(map));

				    for (const auto& [key, value] : map) {

				        T val;

				        val.key = fmt::to_string(key);

				        val.value = fmt::to_string(value);

				        res.push_back(val);

				    }

				    return res;

				}

				template <typename T, typename S = T>

				T map_sum(T&& dest, const S& src) {

				    for (const auto& i : src) {

				@@ -73,7 +48,7 @@ inline std::vector<sstring> split(const sstring& text, const char* separator) {

				 *

				 */

				template<class T, class F, class V>

				future<json::json_return_type>  sum_stats(distributed<T>& d, V F::*f) {

				future<json::json_return_type>  sum_stats(sharded<T>& d, V F::*f) {

				    return d.map_reduce0([f](const T& p) {return p.get_stats().*f;}, 0,

				            std::plus<V>()).then([](V val) {

				        return make_ready_future<json::json_return_type>(val);

				@@ -106,7 +81,7 @@ httpd::utils_json::rate_moving_average_and_histogram timer_to_json(const utils::

				}

				template<class T, class F>

				future<json::json_return_type>  sum_histogram_stats(distributed<T>& d, utils::timed_rate_moving_average_and_histogram F::*f) {

				future<json::json_return_type>  sum_histogram_stats(sharded<T>& d, utils::timed_rate_moving_average_and_histogram F::*f) {

				    return d.map_reduce0([f](const T& p) {return (p.get_stats().*f).hist;}, utils::ihistogram(),

				            std::plus<utils::ihistogram>()).then([](const utils::ihistogram& val) {

				@@ -115,7 +90,7 @@ future<json::json_return_type>  sum_histogram_stats(distributed<T>& d, utils::ti

				}

				template<class T, class F>

				future<json::json_return_type>  sum_timer_stats(distributed<T>& d, utils::timed_rate_moving_average_and_histogram F::*f) {

				future<json::json_return_type>  sum_timer_stats(sharded<T>& d, utils::timed_rate_moving_average_and_histogram F::*f) {

				    return d.map_reduce0([f](const T& p) {return (p.get_stats().*f).rate();}, utils::rate_moving_average_and_histogram(),

				            std::plus<utils::rate_moving_average_and_histogram>()).then([](const utils::rate_moving_average_and_histogram& val) {

				@@ -124,7 +99,7 @@ future<json::json_return_type>  sum_timer_stats(distributed<T>& d, utils::timed_

				}

				template<class T, class F>

				future<json::json_return_type>  sum_timer_stats(distributed<T>& d, utils::timed_rate_moving_average_summary_and_histogram F::*f) {

				future<json::json_return_type>  sum_timer_stats(sharded<T>& d, utils::timed_rate_moving_average_summary_and_histogram F::*f) {

				    return d.map_reduce0([f](const T& p) {return (p.get_stats().*f).rate();}, utils::rate_moving_average_and_histogram(),

				            std::plus<utils::rate_moving_average_and_histogram>()).then([](const utils::rate_moving_average_and_histogram& val) {

				        return make_ready_future<json::json_return_type>(timer_to_json(val));

									
										20

api/api_init.hh
									
												View File
												
				@@ -18,7 +18,9 @@

				using request = http::request;

				using reply = http::reply;

				namespace compaction {

				class compaction_manager;

				}

				namespace service {

				@@ -27,6 +29,7 @@ class storage_proxy;

				class storage_service;

				class raft_group0_client;

				class raft_group_registry;

				class client_routes_service;

				} // namespace service

				@@ -56,7 +59,6 @@ class sstables_format_selector;

				namespace view {

				class view_builder;

				}

				class system_keyspace;

				}

				namespace netw { class messaging_service; }

				class repair_service;

				@@ -83,9 +85,9 @@ struct http_context {

				    sstring api_dir;

				    sstring api_doc;

				    httpd::http_server_control http_server;

				    distributed<replica::database>& db;

				    sharded<replica::database>& db;

				    http_context(distributed<replica::database>& _db)

				    http_context(sharded<replica::database>& _db)

				            : db(_db)

				    {

				    }

				@@ -96,8 +98,10 @@ future<> set_server_config(http_context& ctx, db::config& cfg);

				future<> unset_server_config(http_context& ctx);

				future<> set_server_snitch(http_context& ctx, sharded<locator::snitch_ptr>& snitch);

				future<> unset_server_snitch(http_context& ctx);

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, service::raft_group0_client&);

				future<> set_server_storage_service(http_context& ctx, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>&, service::raft_group0_client&);

				future<> unset_server_storage_service(http_context& ctx);

				future<> set_server_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr);

				future<> unset_server_client_routes(http_context& ctx);

				future<> set_server_sstables_loader(http_context& ctx, sharded<sstables_loader>& sst_loader);

				future<> unset_server_sstables_loader(http_context& ctx);

				future<> set_server_view_builder(http_context& ctx, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g);

				@@ -116,7 +120,7 @@ future<> set_server_token_metadata(http_context& ctx, sharded<locator::shared_to

				future<> unset_server_token_metadata(http_context& ctx);

				future<> set_server_gossip(http_context& ctx, sharded<gms::gossiper>& g);

				future<> unset_server_gossip(http_context& ctx);

				future<> set_server_column_family(http_context& ctx, sharded<db::system_keyspace>& sys_ks);

				future<> set_server_column_family(http_context& ctx, sharded<replica::database>& db);

				future<> unset_server_column_family(http_context& ctx);

				future<> set_server_messaging_service(http_context& ctx, sharded<netw::messaging_service>& ms);

				future<> unset_server_messaging_service(http_context& ctx);

				@@ -126,9 +130,7 @@ future<> set_server_stream_manager(http_context& ctx, sharded<streaming::stream_

				future<> unset_server_stream_manager(http_context& ctx);

				future<> set_hinted_handoff(http_context& ctx, sharded<service::storage_proxy>& p, sharded<gms::gossiper>& g);

				future<> unset_hinted_handoff(http_context& ctx);

				future<> set_server_cache(http_context& ctx);

				future<> unset_server_cache(http_context& ctx);

				future<> set_server_compaction_manager(http_context& ctx, sharded<compaction_manager>& cm);

				future<> set_server_compaction_manager(http_context& ctx, sharded<compaction::compaction_manager>& cm);

				future<> unset_server_compaction_manager(http_context& ctx);

				future<> set_server_done(http_context& ctx);

				future<> set_server_task_manager(http_context& ctx, sharded<tasks::task_manager>& tm, lw_shared_ptr<db::config> cfg, sharded<gms::gossiper>& gossiper);

				@@ -141,8 +143,6 @@ future<> set_server_raft(http_context&, sharded<service::raft_group_registry>&);

				future<> unset_server_raft(http_context&);

				future<> set_load_meter(http_context& ctx, service::load_meter& lm);

				future<> unset_load_meter(http_context& ctx);

				future<> set_format_selector(http_context& ctx, db::sstables_format_selector& sel);

				future<> unset_format_selector(http_context& ctx);

				future<> set_server_cql_server_test(http_context& ctx, cql_transport::controller& ctl);

				future<> unset_server_cql_server_test(http_context& ctx);

				future<> set_server_service_levels(http_context& ctx, cql_transport::controller& ctl, sharded<cql3::query_processor>& qp);

									
										30

api/cache_service.cc
									
												View File
												
				@@ -16,7 +16,7 @@ using namespace json;

				using namespace seastar::httpd;

				namespace cs = httpd::cache_service_json;

				void set_cache_service(http_context& ctx, routes& r) {

				void set_cache_service(http_context& ctx, sharded<replica::database>& db, routes& r) {

				    cs::get_row_cache_save_period_in_seconds.set(r, [](std::unique_ptr<http::request> req) {

				        // We never save the cache

				        // Origin uses 0 for never

				@@ -204,53 +204,53 @@ void set_cache_service(http_context& ctx, routes& r) {

				        });

				    });

				    cs::get_row_hits.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [](const replica::column_family& cf) {

				    cs::get_row_hits.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(db, uint64_t(0), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.count();

				        }, std::plus<uint64_t>());

				    });

				    cs::get_row_requests.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(ctx, uint64_t(0), [](const replica::column_family& cf) {

				    cs::get_row_requests.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(db, uint64_t(0), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.count() + cf.get_row_cache().stats().misses.count();

				        }, std::plus<uint64_t>());

				    });

				    cs::get_row_hit_rate.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(ctx, ratio_holder(), [](const replica::column_family& cf) {

				    cs::get_row_hit_rate.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(db, ratio_holder(), [](const replica::column_family& cf) {

				            return ratio_holder(cf.get_row_cache().stats().hits.count() + cf.get_row_cache().stats().misses.count(),

				                    cf.get_row_cache().stats().hits.count());

				        }, std::plus<ratio_holder>());

				    });

				    cs::get_row_hits_moving_avrage.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {

				    cs::get_row_hits_moving_avrage.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf_raw(db, utils::rate_moving_average(), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				        });

				    });

				    cs::get_row_requests_moving_avrage.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf_raw(ctx, utils::rate_moving_average(), [](const replica::column_family& cf) {

				    cs::get_row_requests_moving_avrage.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf_raw(db, utils::rate_moving_average(), [](const replica::column_family& cf) {

				            return cf.get_row_cache().stats().hits.rate() + cf.get_row_cache().stats().misses.rate();

				        }, std::plus<utils::rate_moving_average>()).then([](const utils::rate_moving_average& m) {

				            return make_ready_future<json::json_return_type>(meter_to_json(m));

				        });

				    });

				    cs::get_row_size.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				    cs::get_row_size.set(r, [&db] (std::unique_ptr<http::request> req) {

				        // In origin row size is the weighted size.

				        // We currently do not support weights, so we use raw size in bytes instead

				        return ctx.db.map_reduce0([](replica::database& db) -> uint64_t {

				        return db.map_reduce0([](replica::database& db) -> uint64_t {

				            return db.row_cache_tracker().region().occupancy().used_space();

				        }, uint64_t(0), std::plus<uint64_t>()).then([](const int64_t& res) {

				            return make_ready_future<json::json_return_type>(res);

				        });

				    });

				    cs::get_row_entries.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return ctx.db.map_reduce0([](replica::database& db) -> uint64_t {

				    cs::get_row_entries.set(r, [&db] (std::unique_ptr<http::request> req) {

				        return db.map_reduce0([](replica::database& db) -> uint64_t {

				            return db.row_cache_tracker().partitions();

				        }, uint64_t(0), std::plus<uint64_t>()).then([](const int64_t& res) {

				            return make_ready_future<json::json_return_type>(res);

									
										7

api/cache_service.hh
									
												View File
												
				@@ -7,15 +7,20 @@

				 */

				#pragma once

				#include <seastar/core/sharded.hh>

				namespace seastar::httpd {

				class routes;

				}

				namespace replica {

				class database;

				}

				namespace api {

				struct http_context;

				void set_cache_service(http_context& ctx, seastar::httpd::routes& r);

				void set_cache_service(http_context& ctx, seastar::sharded<replica::database>& db, seastar::httpd::routes& r);

				void unset_cache_service(http_context& ctx, seastar::httpd::routes& r);

				}

									
										176

api/client_routes.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,176 @@

				/*

				 * Copyright (C) 2025-present ScyllaDB

				 *

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				 #include <seastar/http/short_streams.hh>

				#include "client_routes.hh"

				#include "api/api.hh"

				#include "service/storage_service.hh"

				#include "service/client_routes.hh"

				#include "utils/rjson.hh"

				#include "api/api-doc/client_routes.json.hh"

				using namespace seastar::httpd;

				using namespace std::chrono_literals;

				using namespace json;

				extern logging::logger apilog;

				namespace api {

				static void validate_client_routes_endpoint(sharded<service::client_routes_service>& cr, sstring endpoint_name) {

				    if (!cr.local().get_feature_service().client_routes) {

				        apilog.warn("{}: called before the cluster feature was enabled", endpoint_name);

				        throw std::runtime_error(fmt::format("{} requires all nodes to support the CLIENT_ROUTES cluster feature", endpoint_name));

				    }

				}

				static sstring parse_string(const char* name, rapidjson::Value const& v) {

				    const auto it = v.FindMember(name);

				    if (it == v.MemberEnd()) {

				        throw bad_param_exception(fmt::format("Missing '{}'", name));

				    }

				    if (!it->value.IsString()) {

				        throw bad_param_exception(fmt::format("'{}' must be a string", name));

				    }

				    return {it->value.GetString(), it->value.GetStringLength()};

				}

				static std::optional<uint32_t> parse_port(const char* name, rapidjson::Value const& v) {

				    const auto it = v.FindMember(name);

				    if (it == v.MemberEnd()) {

				        return std::nullopt;

				    }

				    if (!it->value.IsInt()) {

				        throw bad_param_exception(fmt::format("'{}' must be an integer", name));

				    }

				    auto port = it->value.GetInt();

				    if (port < 1 || port > 65535) {

				        throw bad_param_exception(fmt::format("'{}' value={} is outside the allowed port range", name, port));

				    }

				    return port;

				}

				static std::vector<service::client_routes_service::client_route_entry> parse_set_client_array(const rapidjson::Document& root) {

				    if (!root.IsArray()) {

				        throw bad_param_exception("Body must be a JSON array");

				    }

				    std::vector<service::client_routes_service::client_route_entry> v;

				    v.reserve(root.GetArray().Size());

				    for (const auto& element : root.GetArray()) {

				        if (!element.IsObject()) { throw bad_param_exception("Each element must be object"); }

				        const auto port = parse_port("port", element);

				        const auto tls_port = parse_port("tls_port", element);

				        const auto alternator_port = parse_port("alternator_port", element);

				        const auto alternator_https_port = parse_port("alternator_https_port", element);

				        if (!port.has_value() && !tls_port.has_value() && !alternator_port.has_value() && !alternator_https_port.has_value()) {

				            throw bad_param_exception("At least one port field ('port', 'tls_port', 'alternator_port', 'alternator_https_port') must be specified");

				        }

				        v.emplace_back(

				            parse_string("connection_id", element),

				            utils::UUID{parse_string("host_id", element)},

				            parse_string("address", element),

				            port,

				            tls_port,

				            alternator_port,

				            alternator_https_port

				        );

				    }

				    return v;

				}

				static

				future<json::json_return_type>

				rest_set_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {

				    validate_client_routes_endpoint(cr, "rest_set_client_routes");

				    rapidjson::Document root;

				    auto content = co_await util::read_entire_stream_contiguous(*req->content_stream);

				    root.Parse(content.c_str());

				    co_await cr.local().set_client_routes(parse_set_client_array(root));

				    co_return seastar::json::json_void();

				}

				static std::vector<service::client_routes_service::client_route_key> parse_delete_client_array(const rapidjson::Document& root) {

				    if (!root.IsArray()) {

				        throw bad_param_exception("Body must be a JSON array");

				    }

				    std::vector<service::client_routes_service::client_route_key> v;

				    v.reserve(root.GetArray().Size());

				    for (const auto& element : root.GetArray()) {

				        v.emplace_back(

				            parse_string("connection_id", element),

				            utils::UUID{parse_string("host_id", element)}

				        );

				    }

				    return v;

				}

				static

				future<json::json_return_type>

				rest_delete_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {

				    validate_client_routes_endpoint(cr, "delete_client_routes");

				    rapidjson::Document root;

				    auto content = co_await util::read_entire_stream_contiguous(*req->content_stream);

				    root.Parse(content.c_str());

				    co_await cr.local().delete_client_routes(parse_delete_client_array(root));

				    co_return seastar::json::json_void();

				}

				static

				future<json::json_return_type>

				rest_get_client_routes(http_context& ctx, sharded<service::client_routes_service>& cr, std::unique_ptr<http::request> req) {

				    validate_client_routes_endpoint(cr, "get_client_routes");

				    co_return co_await cr.invoke_on(0, [] (service::client_routes_service& cr) -> future<json::json_return_type> {

				        co_return json::json_return_type(stream_range_as_array(co_await cr.get_client_routes(), [](const service::client_routes_service::client_route_entry & entry) {

				            seastar::httpd::client_routes_json::client_routes_entry obj;

				            obj.connection_id = entry.connection_id;

				            obj.host_id = fmt::to_string(entry.host_id);

				            obj.address = entry.address;

				            if (entry.port.has_value()) { obj.port = entry.port.value(); }

				            if (entry.tls_port.has_value()) { obj.tls_port = entry.tls_port.value(); }

				            if (entry.alternator_port.has_value()) { obj.alternator_port = entry.alternator_port.value(); }

				            if (entry.alternator_https_port.has_value()) { obj.alternator_https_port = entry.alternator_https_port.value(); }

				            return obj;

				        }));

				    });

				}

				void set_client_routes(http_context& ctx, routes& r, sharded<service::client_routes_service>& cr) {

				    seastar::httpd::client_routes_json::set_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {

				        return rest_set_client_routes(ctx, cr, std::move(req));

				    });

				    seastar::httpd::client_routes_json::delete_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {

				        return rest_delete_client_routes(ctx, cr, std::move(req));

				    });

				    seastar::httpd::client_routes_json::get_client_routes.set(r, [&ctx, &cr] (std::unique_ptr<seastar::http::request> req) {

				        return rest_get_client_routes(ctx, cr, std::move(req));

				    });

				}

				void unset_client_routes(http_context& ctx, routes& r) {

				    seastar::httpd::client_routes_json::set_client_routes.unset(r);

				    seastar::httpd::client_routes_json::delete_client_routes.unset(r);

				    seastar::httpd::client_routes_json::get_client_routes.unset(r);

				}

				}

									
										20

api/client_routes.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,20 @@

				/*

				 * Copyright (C) 2025-present ScyllaDB

				 *

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#pragma once

				#include <seastar/core/sharded.hh>

				#include <seastar/json/json_elements.hh>

				#include "api/api_init.hh"

				namespace api {

				void set_client_routes(http_context& ctx, httpd::routes& r, sharded<service::client_routes_service>& cr);

				void unset_client_routes(http_context& ctx, httpd::routes& r);

				}

606

api/column_family.cc

View File

File diff suppressed because it is too large Load Diff

									
										37

api/column_family.hh
									
												View File
												
				@@ -13,24 +13,20 @@

				#include <any>

				#include "api/api_init.hh"

				namespace db {

				class system_keyspace;

				}

				namespace api {

				void set_column_family(http_context& ctx, httpd::routes& r, sharded<db::system_keyspace>& sys_ks);

				void set_column_family(http_context& ctx, httpd::routes& r, sharded<replica::database>& db);

				void unset_column_family(http_context& ctx, httpd::routes& r);

				table_info parse_table_info(const sstring& name, const replica::database& db);

				template<class Mapper, class I, class Reducer>

				future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,

				future<I> map_reduce_cf_raw(sharded<replica::database>& db, const sstring& name, I init,

				        Mapper mapper, Reducer reducer) {

				    auto uuid = parse_table_info(name, ctx.db.local()).id;

				    auto uuid = parse_table_info(name, db.local()).id;

				    using mapper_type = std::function<future<std::unique_ptr<std::any>>(replica::database&)>;

				    using reducer_type = std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)>;

				    return ctx.db.map_reduce0(mapper_type([mapper, uuid](replica::database& db) {

				    return db.map_reduce0(mapper_type([mapper, uuid](replica::database& db) {

				        return futurize_invoke([mapper, &db, uuid] {

				            return mapper(db.find_column_family(uuid));

				        }).then([] (auto result) {

				@@ -45,24 +41,22 @@ future<I> map_reduce_cf_raw(http_context& ctx, const sstring& name, I init,

				template<class Mapper, class I, class Reducer>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& name, I init,

				future<json::json_return_type> map_reduce_cf(sharded<replica::database>& db, const sstring& name, I init,

				        Mapper mapper, Reducer reducer) {

				    return map_reduce_cf_raw(ctx, name, init, mapper, reducer).then([](const I& res) {

				    return map_reduce_cf_raw(db, name, init, mapper, reducer).then([](const I& res) {

				        return make_ready_future<json::json_return_type>(res);

				    });

				}

				template<class Mapper, class I, class Reducer, class Result>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, const sstring& name, I init,

				future<json::json_return_type> map_reduce_cf(sharded<replica::database>& db, const sstring& name, I init,

				        Mapper mapper, Reducer reducer, Result result) {

				    return map_reduce_cf_raw(ctx, name, init, mapper, reducer).then([result](const I& res) mutable {

				    return map_reduce_cf_raw(db, name, init, mapper, reducer).then([result](const I& res) mutable {

				        result = res;

				        return make_ready_future<json::json_return_type>(result);

				    });

				}

				future<json::json_return_type> map_reduce_cf_time_histogram(http_context& ctx, const sstring& name, std::function<utils::time_estimated_histogram(const replica::column_family&)> f);

				struct map_reduce_column_families_locally {

				    std::any init;

				    std::function<future<std::unique_ptr<std::any>>(replica::column_family&)> mapper;

				@@ -78,7 +72,7 @@ struct map_reduce_column_families_locally {

				};

				template<class Mapper, class I, class Reducer>

				future<I> map_reduce_cf_raw(http_context& ctx, I init,

				future<I> map_reduce_cf_raw(sharded<replica::database>& db, I init,

				        Mapper mapper, Reducer reducer) {

				    using mapper_type = std::function<future<std::unique_ptr<std::any>>(replica::column_family&)>;

				    using reducer_type = std::function<std::unique_ptr<std::any>(std::unique_ptr<std::any>, std::unique_ptr<std::any>)>;

				@@ -92,7 +86,7 @@ future<I> map_reduce_cf_raw(http_context& ctx, I init,

				    auto wrapped_reducer = reducer_type([reducer = std::move(reducer)] (std::unique_ptr<std::any> a, std::unique_ptr<std::any> b) mutable {

				        return std::make_unique<std::any>(I(reducer(std::any_cast<I>(std::move(*a)), std::any_cast<I>(std::move(*b)))));

				    });

				    return ctx.db.map_reduce0(map_reduce_column_families_locally{init,

				    return db.map_reduce0(map_reduce_column_families_locally{init,

				            std::move(wrapped_mapper), wrapped_reducer}, std::make_unique<std::any>(init), wrapped_reducer).then([] (std::unique_ptr<std::any> res) {

				        return std::any_cast<I>(std::move(*res));

				    });

				@@ -100,20 +94,13 @@ future<I> map_reduce_cf_raw(http_context& ctx, I init,

				template<class Mapper, class I, class Reducer>

				future<json::json_return_type> map_reduce_cf(http_context& ctx, I init,

				future<json::json_return_type> map_reduce_cf(sharded<replica::database>& db, I init,

				        Mapper mapper, Reducer reducer) {

				    return map_reduce_cf_raw(ctx, init, mapper, reducer).then([](const I& res) {

				    return map_reduce_cf_raw(db, init, mapper, reducer).then([](const I& res) {

				        return make_ready_future<json::json_return_type>(res);

				    });

				}

				future<json::json_return_type>  get_cf_stats(http_context& ctx, const sstring& name,

				        int64_t replica::column_family_stats::*f);

				future<json::json_return_type>  get_cf_stats(http_context& ctx,

				        int64_t replica::column_family_stats::*f);

				std::tuple<sstring, sstring> parse_fully_qualified_cf_name(sstring name);

				}

									
										30

api/compaction_manager.cc
									
												View File
												
				@@ -29,9 +29,9 @@ namespace ss = httpd::storage_service_json;

				using namespace json;

				using namespace seastar::httpd;

				static future<json::json_return_type> get_cm_stats(sharded<compaction_manager>& cm,

				        int64_t compaction_manager::stats::*f) {

				    return cm.map_reduce0([f](compaction_manager& cm) {

				static future<json::json_return_type> get_cm_stats(sharded<compaction::compaction_manager>& cm,

				        int64_t compaction::compaction_manager::stats::*f) {

				    return cm.map_reduce0([f](compaction::compaction_manager& cm) {

				        return cm.get_stats().*f;

				    }, int64_t(0), std::plus<int64_t>()).then([](const int64_t& res) {

				        return make_ready_future<json::json_return_type>(res);

				@@ -47,9 +47,9 @@ static std::unordered_map<std::pair<sstring, sstring>, uint64_t, utils::tuple_ha

				    return std::move(a);

				}

				void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_manager>& cm) {

				void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction::compaction_manager>& cm) {

				    cm::get_compactions.set(r, [&cm] (std::unique_ptr<http::request> req) {

				        return cm.map_reduce0([](compaction_manager& cm) {

				        return cm.map_reduce0([](compaction::compaction_manager& cm) {

				            std::vector<cm::summary> summaries;

				            for (const auto& c : cm.get_compactions()) {

				@@ -58,7 +58,7 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				                s.ks = c.ks_name;

				                s.cf = c.cf_name;

				                s.unit = "keys";

				                s.task_type = sstables::compaction_name(c.type);

				                s.task_type = compaction::compaction_name(c.type);

				                s.completed = c.total_keys_written;

				                s.total = c.total_partitions;

				                summaries.push_back(std::move(s));

				@@ -103,22 +103,20 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				    cm::stop_compaction.set(r, [&cm] (std::unique_ptr<http::request> req) {

				        auto type = req->get_query_param("type");

				        return cm.invoke_on_all([type] (compaction_manager& cm) {

				        return cm.invoke_on_all([type] (compaction::compaction_manager& cm) {

				            return cm.stop_compaction(type);

				        }).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				    });

				    cm::stop_keyspace_compaction.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				    cm::stop_keyspace_compaction.set(r, [&ctx, &cm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto [ks_name, tables] = parse_table_infos(ctx, *req, "tables");

				        auto type = req->get_query_param("type");

				        co_await ctx.db.invoke_on_all([&] (replica::database& db) {

				            auto& cm = db.get_compaction_manager();

				        co_await cm.invoke_on_all([&] (compaction::compaction_manager& cm) {

				            return parallel_for_each(tables, [&] (const table_info& ti) {

				                auto& t = db.find_column_family(ti.id);

				                return t.parallel_foreach_compaction_group_view([&] (compaction::compaction_group_view& ts) {

				                    return cm.stop_compaction(type, &ts);

				                return cm.stop_compaction(type, [id = ti.id] (const compaction::compaction_group_view* x) {

				                    return x->schema()->id() == id;

				                });

				            });

				        });

				@@ -126,13 +124,13 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				    });

				    cm::get_pending_tasks.set(r, [&ctx] (std::unique_ptr<http::request> req) {

				        return map_reduce_cf(ctx, int64_t(0), [](replica::column_family& cf) {

				        return map_reduce_cf(ctx.db, int64_t(0), [](replica::column_family& cf) {

				            return cf.estimate_pending_compactions();

				        }, std::plus<int64_t>());

				    });

				    cm::get_completed_tasks.set(r, [&cm] (std::unique_ptr<http::request> req) {

				        return get_cm_stats(cm, &compaction_manager::stats::completed_tasks);

				        return get_cm_stats(cm, &compaction::compaction_manager::stats::completed_tasks);

				    });

				    cm::get_total_compactions_completed.set(r, [] (std::unique_ptr<http::request> req) {

				@@ -150,7 +148,7 @@ void set_compaction_manager(http_context& ctx, routes& r, sharded<compaction_man

				    });

				    cm::get_compaction_history.set(r, [&cm] (std::unique_ptr<http::request> req) {

				        std::function<future<>(output_stream<char>&&)> f = [&cm] (output_stream<char>&& out) -> future<> {

				        noncopyable_function<future<>(output_stream<char>&&)> f = [&cm] (output_stream<char>&& out) -> future<> {

				            auto s = std::move(out);

				            bool first = true;

				            std::exception_ptr ex;

									
										4

api/compaction_manager.hh
									
												View File
												
				@@ -13,11 +13,13 @@ namespace seastar::httpd {

				class routes;

				}

				namespace compaction {

				class compaction_manager;

				}

				namespace api {

				struct http_context;

				void set_compaction_manager(http_context& ctx, seastar::httpd::routes& r, seastar::sharded<compaction_manager>& cm);

				void set_compaction_manager(http_context& ctx, seastar::httpd::routes& r, seastar::sharded<compaction::compaction_manager>& cm);

				void unset_compaction_manager(http_context& ctx, seastar::httpd::routes& r);

				}

									
										9

api/error_injection.cc
									
												View File
												
				@@ -21,10 +21,10 @@ namespace hf = httpd::error_injection_json;

				void set_error_injection(http_context& ctx, routes& r) {

				    hf::enable_injection.set(r, [](std::unique_ptr<request> req) {

				    hf::enable_injection.set(r, [](std::unique_ptr<request> req) -> future<json::json_return_type> {

				        sstring injection = req->get_path_param("injection");

				        bool one_shot = req->get_query_param("one_shot") == "True";

				        auto params = req->content;

				        auto params = co_await util::read_entire_stream_contiguous(*req->content_stream);

				        const size_t max_params_size = 1024 * 1024;

				        if (params.size() > max_params_size) {

				@@ -39,12 +39,11 @@ void set_error_injection(http_context& ctx, routes& r) {

				                : rjson::parse_to_map<utils::error_injection_parameters>(params);

				            auto& errinj = utils::get_local_injector();

				            return errinj.enable_on_all(injection, one_shot, std::move(parameters)).then([] {

				                return make_ready_future<json::json_return_type>(json::json_void());

				            });

				            co_await errinj.enable_on_all(injection, one_shot, std::move(parameters));

				        } catch (const rjson::error& e) {

				            throw httpd::bad_param_exception(format("Failed to parse injections parameters: {}", e.what()));

				        }

				        co_return json::json_void();

				    });

				    hf::get_enabled_injections_on_all.set(r, [](std::unique_ptr<request> req) {

									
										6

api/raft.cc
									
												View File
												
				@@ -71,7 +71,7 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis

				        co_return json_void{};

				    });

				    r::get_leader_host.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {

				        if (!req->query_parameters.contains("group_id")) {

				        if (req->get_query_param("group_id").empty()) {

				            const auto leader_id = co_await raft_gr.invoke_on(0, [] (service::raft_group_registry& raft_gr) {

				                auto& srv = raft_gr.group0();

				                return srv.current_leader();

				@@ -100,7 +100,7 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis

				    r::read_barrier.set(r, [&raft_gr] (std::unique_ptr<http::request> req) -> future<json_return_type> {

				        auto timeout = get_request_timeout(*req);

				        if (!req->query_parameters.contains("group_id")) {

				        if (req->get_query_param("group_id").empty()) {

				            // Read barrier on group 0 by default

				            co_await raft_gr.invoke_on(0, [timeout] (service::raft_group_registry& raft_gr) -> future<> {

				                co_await raft_gr.group0_with_timeouts().read_barrier(nullptr, timeout);

				@@ -131,7 +131,7 @@ void set_raft(http_context&, httpd::routes& r, sharded<service::raft_group_regis

				        const auto stepdown_timeout_ticks = dur / service::raft_tick_interval;

				        auto timeout_dur = raft::logical_clock::duration(stepdown_timeout_ticks);

				        if (!req->query_parameters.contains("group_id")) {

				        if (req->get_query_param("group_id").empty()) {

				            // Stepdown on group 0 by default

				            co_await raft_gr.invoke_on(0, [timeout_dur] (service::raft_group_registry& raft_gr) {

				                apilog.info("Triggering stepdown for group0");

									
										16

api/storage_proxy.cc
									
												View File
												
				@@ -39,7 +39,7 @@ utils::time_estimated_histogram timed_rate_moving_average_summary_merge(utils::t

				 * @return A future that resolves to the result of the aggregation.

				 */

				template<typename V, typename Reducer, typename InnerMapper>

				future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,

				future<V> two_dimensional_map_reduce(sharded<service::storage_proxy>& d,

				        InnerMapper mapper, Reducer reducer, V initial_value) {

				    return d.map_reduce0( [mapper, reducer, initial_value] (const service::storage_proxy& sp) {

				        return map_reduce_scheduling_group_specific<service::storage_proxy_stats::stats>(

				@@ -59,7 +59,7 @@ future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,

				 * @return A future that resolves to the result of the aggregation.

				 */

				template<typename V, typename Reducer, typename F, typename C>

				future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,

				future<V> two_dimensional_map_reduce(sharded<service::storage_proxy>& d,

				        C F::*f, Reducer reducer, V initial_value) {

				    return two_dimensional_map_reduce(d, [f] (F& stats) -> V {

				        return stats.*f;

				@@ -75,20 +75,20 @@ future<V> two_dimensional_map_reduce(distributed<service::storage_proxy>& d,

				 *

				 */

				template<typename V, typename F>

				future<json::json_return_type>  sum_stats_storage_proxy(distributed<proxy>& d, V F::*f) {

				future<json::json_return_type>  sum_stats_storage_proxy(sharded<proxy>& d, V F::*f) {

				    return two_dimensional_map_reduce(d, [f] (F& stats) { return stats.*f; }, std::plus<V>(), V(0)).then([] (V val) {

				        return make_ready_future<json::json_return_type>(val);

				    });

				}

				static future<utils::rate_moving_average>  sum_timed_rate(distributed<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				static future<utils::rate_moving_average>  sum_timed_rate(sharded<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				    return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {

				        return (stats.*f).rate();

				    }, std::plus<utils::rate_moving_average>(), utils::rate_moving_average());

				}

				static future<json::json_return_type>  sum_timed_rate_as_obj(distributed<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				static future<json::json_return_type>  sum_timed_rate_as_obj(sharded<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				    return sum_timed_rate(d, f).then([](const utils::rate_moving_average& val) {

				        httpd::utils_json::rate_moving_average m;

				        m = val;

				@@ -100,7 +100,7 @@ httpd::utils_json::rate_moving_average_and_histogram get_empty_moving_average()

				    return timer_to_json(utils::rate_moving_average_and_histogram());

				}

				static future<json::json_return_type>  sum_timed_rate_as_long(distributed<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				static future<json::json_return_type>  sum_timed_rate_as_long(sharded<proxy>& d, utils::timed_rate_moving_average service::storage_proxy_stats::stats::*f) {

				    return sum_timed_rate(d, f).then([](const utils::rate_moving_average& val) {

				        return make_ready_future<json::json_return_type>(val.count);

				    });

				@@ -152,7 +152,7 @@ static future<json::json_return_type>  total_latency(sharded<service::storage_pr

				 */

				template<typename F>

				future<json::json_return_type>

				sum_histogram_stats_storage_proxy(distributed<proxy>& d,

				sum_histogram_stats_storage_proxy(sharded<proxy>& d,

				        utils::timed_rate_moving_average_summary_and_histogram F::*f) {

				    return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {

				        return (stats.*f).hist;

				@@ -172,7 +172,7 @@ sum_histogram_stats_storage_proxy(distributed<proxy>& d,

				 */

				template<typename F>

				future<json::json_return_type>

				sum_timer_stats_storage_proxy(distributed<proxy>& d,

				sum_timer_stats_storage_proxy(sharded<proxy>& d,

				        utils::timed_rate_moving_average_summary_and_histogram F::*f) {

				    return two_dimensional_map_reduce(d, [f] (service::storage_proxy_stats::stats& stats) {

									
										582

api/storage_service.cc
									
												View File
												
				@@ -17,9 +17,8 @@

				#include "gms/feature_service.hh"

				#include "schema/schema_builder.hh"

				#include "sstables/sstables_manager.hh"

				#include "utils/hash.hh"

				#include <optional>

				#include <sstream>

				#include <stdexcept>

				#include <time.h>

				#include <algorithm>

				#include <functional>

				@@ -36,6 +35,7 @@

				#include "gms/gossiper.hh"

				#include "db/system_keyspace.hh"

				#include <seastar/http/exception.hh>

				#include <seastar/http/short_streams.hh>

				#include <seastar/core/coroutine.hh>

				#include <seastar/coroutine/parallel_for_each.hh>

				#include <seastar/coroutine/exception.hh>

				@@ -174,9 +174,7 @@ std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_con

				std::pair<sstring, std::vector<table_info>> parse_table_infos(const http_context& ctx, const http::request& req, sstring cf_param_name) {

				    auto keyspace = validate_keyspace(ctx, req);

				    const auto& query_params = req.query_parameters;

				    auto it = query_params.find(cf_param_name);

				    auto tis = parse_table_infos(keyspace, ctx, it != query_params.end() ? it->second : "");

				    auto tis = parse_table_infos(keyspace, ctx, req.get_query_param(cf_param_name));

				    return std::make_pair(std::move(keyspace), std::move(tis));

				}

				@@ -198,7 +196,7 @@ static ss::token_range token_range_endpoints_to_json(const dht::token_range_endp

				    return r;

				}

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request) {

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, bool legacy_request) {

				    return q.scatter().then([&q, legacy_request] {

				        return sleep(q.duration()).then([&q, legacy_request] {

				            return q.gather(q.capacity()).then([&q, legacy_request] (auto topk_results) {

				@@ -228,37 +226,36 @@ seastar::future<json::json_return_type> run_toppartitions_query(db::toppartition

				    });

				}

				future<scrub_info> parse_scrub_options(const http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl, std::unique_ptr<http::request> req) {

				scrub_info parse_scrub_options(const http_context& ctx, std::unique_ptr<http::request> req) {

				    scrub_info info;

				    auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");

				    info.keyspace = std::move(keyspace);

				    info.column_families = table_infos | std::views::transform(&table_info::name) | std::ranges::to<std::vector>();

				    auto scrub_mode_str = req->get_query_param("scrub_mode");

				    auto scrub_mode = sstables::compaction_type_options::scrub::mode::abort;

				    auto scrub_mode = compaction::compaction_type_options::scrub::mode::abort;

				    if (scrub_mode_str.empty()) {

				        const auto skip_corrupted = validate_bool_x(req->get_query_param("skip_corrupted"), false);

				        if (skip_corrupted) {

				            scrub_mode = sstables::compaction_type_options::scrub::mode::skip;

				            scrub_mode = compaction::compaction_type_options::scrub::mode::skip;

				        }

				    } else {

				        if (scrub_mode_str == "ABORT") {

				            scrub_mode = sstables::compaction_type_options::scrub::mode::abort;

				            scrub_mode = compaction::compaction_type_options::scrub::mode::abort;

				        } else if (scrub_mode_str == "SKIP") {

				            scrub_mode = sstables::compaction_type_options::scrub::mode::skip;

				            scrub_mode = compaction::compaction_type_options::scrub::mode::skip;

				        } else if (scrub_mode_str == "SEGREGATE") {

				            scrub_mode = sstables::compaction_type_options::scrub::mode::segregate;

				            scrub_mode = compaction::compaction_type_options::scrub::mode::segregate;

				        } else if (scrub_mode_str == "VALIDATE") {

				            scrub_mode = sstables::compaction_type_options::scrub::mode::validate;

				            scrub_mode = compaction::compaction_type_options::scrub::mode::validate;

				        } else {

				            throw httpd::bad_param_exception(fmt::format("Unknown argument for 'scrub_mode' parameter: {}", scrub_mode_str));

				        }

				    }

				    if (!req_param<bool>(*req, "disable_snapshot", false) && !info.column_families.empty()) {

				        auto tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());

				        co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, tag, db::snapshot_ctl::skip_flush::no);

				        info.snapshot_tag = format("pre-scrub-{:d}", db_clock::now().time_since_epoch().count());

				    }

				    info.opts = {

				@@ -266,16 +263,23 @@ future<scrub_info> parse_scrub_options(const http_context& ctx, sharded<db::snap

				    };

				    const sstring quarantine_mode_str = req_param<sstring>(*req, "quarantine_mode", "INCLUDE");

				    if (quarantine_mode_str == "INCLUDE") {

				        info.opts.quarantine_operation_mode = sstables::compaction_type_options::scrub::quarantine_mode::include;

				        info.opts.quarantine_operation_mode = compaction::compaction_type_options::scrub::quarantine_mode::include;

				    } else if (quarantine_mode_str == "EXCLUDE") {

				        info.opts.quarantine_operation_mode = sstables::compaction_type_options::scrub::quarantine_mode::exclude;

				        info.opts.quarantine_operation_mode = compaction::compaction_type_options::scrub::quarantine_mode::exclude;

				    } else if (quarantine_mode_str == "ONLY") {

				        info.opts.quarantine_operation_mode = sstables::compaction_type_options::scrub::quarantine_mode::only;

				        info.opts.quarantine_operation_mode = compaction::compaction_type_options::scrub::quarantine_mode::only;

				    } else {

				        throw httpd::bad_param_exception(fmt::format("Unknown argument for 'quarantine_mode' parameter: {}", quarantine_mode_str));

				    }

				    co_return info;

				    if(req_param<bool>(*req, "drop_unfixable_sstables", false)) {

				        if(scrub_mode != compaction::compaction_type_options::scrub::mode::segregate) {

				            throw httpd::bad_param_exception("The 'drop_unfixable_sstables' parameter is only valid when 'scrub_mode' is 'SEGREGATE'");

				        }

				        info.opts.drop_unfixable = compaction::compaction_type_options::scrub::drop_unfixable_sstables::yes;

				    }

				    return info;

				}

				void set_transport_controller(http_context& ctx, routes& r, cql_transport::controller& ctl) {

				@@ -332,7 +336,7 @@ void set_repair(http_context& ctx, routes& r, sharded<repair_service>& repair, s

				        // Nodetool still sends those unsupported options. Ignore them to avoid failing nodetool repair.

				        static std::unordered_set<sstring> legacy_options_to_ignore = {"pullRepair", "ignoreUnreplicatedKeyspaces"};

				        for (auto& x : req->query_parameters) {

				        for (auto& x : req->get_query_params()) {

				            if (legacy_options_to_ignore.contains(x.first)) {

				                continue;

				            }

				@@ -499,17 +503,26 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&

				        auto bucket = req->get_query_param("bucket");

				        auto prefix = req->get_query_param("prefix");

				        auto scope = parse_stream_scope(req->get_query_param("scope"));

				        auto primary_replica_only = validate_bool_x(req->get_query_param("primary_replica_only"), false);

				        // TODO: the http_server backing the API does not use content streaming

				        // should use it for better performance

				        rjson::value parsed = rjson::parse(req->content);

				        rjson::chunked_content content = co_await util::read_entire_stream(*req->content_stream);

				        rjson::value parsed = rjson::parse(std::move(content));

				        if (!parsed.IsArray()) {

				            throw httpd::bad_param_exception("malformatted sstables in body");

				        }

				        auto sstables = parsed.GetArray() |

				            std::views::transform([] (const auto& s) { return sstring(rjson::to_string_view(s)); }) |

				            std::ranges::to<std::vector>();

				        auto task_id = co_await sst_loader.local().download_new_sstables(keyspace, table, prefix, std::move(sstables), endpoint, bucket, scope);

				        apilog.info("Restore invoked with following parameters: keyspace={}, table={}, endpoint={}, bucket={}, prefix={}, sstables_count={}, scope={}, primary_replica_only={}",

				                    keyspace,

				                    table,

				                    endpoint,

				                    bucket,

				                    prefix,

				                    sstables.size(),

				                    scope,

				                    primary_replica_only);

				        auto task_id = co_await sst_loader.local().download_new_sstables(keyspace, table, prefix, std::move(sstables), endpoint, bucket, scope, primary_replica_only);

				        co_return json::json_return_type(fmt::to_string(task_id));

				    });

				@@ -521,19 +534,42 @@ void unset_sstables_loader(http_context& ctx, routes& r) {

				}

				void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g) {

				    ss::view_build_statuses.set(r, [&ctx, &vb, &g] (std::unique_ptr<http::request> req) {

				    ss::view_build_statuses.set(r, [&ctx, &vb, &g] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto keyspace = validate_keyspace(ctx, req);

				        auto view = req->get_path_param("view");

				        return vb.local().view_build_statuses(std::move(keyspace), std::move(view), g.local()).then([] (std::unordered_map<sstring, sstring> status) {

				            std::vector<storage_service_json::mapper> res;

				            return make_ready_future<json::json_return_type>(map_to_key_value(std::move(status), res));

				        });

				        co_return json::json_return_type(stream_range_as_array(co_await vb.local().view_build_statuses(std::move(keyspace), std::move(view), g.local()), [] (const auto& i) {

				            storage_service_json::mapper res;

				            res.key = i.first;

				            res.value = i.second;

				            return res;

				        }));

				    });

				    cf::get_built_indexes.set(r, [&vb](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto [ks, cf_name] = parse_fully_qualified_cf_name(req->get_path_param("name"));

				        // Use of load_built_views() as filtering table should be in sync with

				        // built_indexes_virtual_reader filtering with BUILT_VIEWS table

				        std::vector<db::system_keyspace::view_name> vn = co_await vb.local().get_sys_ks().load_built_views();

				        std::set<sstring> vp;

				        for (auto b : vn) {

				            if (b.first == ks) {

				                vp.insert(b.second);

				            }

				        }

				        replica::database& db = vb.local().get_db();

				        auto uuid = validate_table(db, ks, cf_name);

				        replica::column_family& cf = db.find_column_family(uuid);

				        co_return cf.get_index_manager().list_indexes()

				                | std::views::transform([] (const auto& i) { return i.metadata().name(); })

				                | std::views::filter([&vp] (const auto& n) { return vp.contains(secondary_index::index_table_name(n)); })

				                | std::ranges::to<std::vector>();

				    });

				}

				void unset_view_builder(http_context& ctx, routes& r) {

				    ss::view_build_statuses.unset(r);

				    cf::get_built_indexes.unset(r);

				}

				static future<json::json_return_type> describe_ring_as_json(sharded<service::storage_service>& ss, sstring keyspace) {

				@@ -544,6 +580,16 @@ static future<json::json_return_type> describe_ring_as_json_for_table(const shar

				    co_return json::json_return_type(stream_range_as_array(co_await ss.local().describe_ring_for_table(keyspace, table), token_range_endpoints_to_json));

				}

				namespace {

				template <typename Key, typename Value>

				storage_service_json::mapper map_to_json(const std::pair<Key, Value>& i) {

				    storage_service_json::mapper val;

				    val.key = fmt::to_string(i.first);

				    val.value = fmt::to_string(i.second);

				    return val;

				}

				}

				static

				future<json::json_return_type>

				rest_get_token_endpoint(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				@@ -555,72 +601,13 @@ rest_get_token_endpoint(http_context& ctx, sharded<service::storage_service>& ss

				            token_endpoints = ss.local().get_token_to_endpoint_map();

				        } else if (!keyspace_name.empty() && !table_name.empty()) {

				            auto& db = ctx.db.local();

				            if (!db.has_schema(keyspace_name, table_name)) {

				                throw bad_param_exception(fmt::format("Failed to find table {}.{}", keyspace_name, table_name));

				            }

				            token_endpoints = co_await ss.local().get_tablet_to_endpoint_map(db.find_schema(keyspace_name, table_name)->id());

				            auto tid = validate_table(db, keyspace_name, table_name);

				            token_endpoints = co_await ss.local().get_tablet_to_endpoint_map(tid);

				        } else {

				            throw bad_param_exception("Either provide both keyspace and table (for tablet table) or neither (for vnodes)");

				        }

				        co_return json::json_return_type(stream_range_as_array(token_endpoints, [](const auto& i) {

				            storage_service_json::mapper val;

				            val.key = fmt::to_string(i.first);

				            val.value = fmt::to_string(i.second);

				            return val;

				        }));

				}

				static

				future<json::json_return_type>

				rest_toppartitions_generic(http_context& ctx, std::unique_ptr<http::request> req) {

				        bool filters_provided = false;

				        std::unordered_set<std::tuple<sstring, sstring>, utils::tuple_hash> table_filters {};

				        if (req->query_parameters.contains("table_filters")) {

				            filters_provided = true;

				            auto filters = req->get_query_param("table_filters");

				            std::stringstream ss { filters };

				            std::string filter;

				            while (!filters.empty() && ss.good()) {

				                std::getline(ss, filter, ',');

				                table_filters.emplace(parse_fully_qualified_cf_name(filter));

				            }

				        }

				        std::unordered_set<sstring> keyspace_filters {};

				        if (req->query_parameters.contains("keyspace_filters")) {

				            filters_provided = true;

				            auto filters = req->get_query_param("keyspace_filters");

				            std::stringstream ss { filters };

				            std::string filter;

				            while (!filters.empty() && ss.good()) {

				                std::getline(ss, filter, ',');

				                keyspace_filters.emplace(std::move(filter));

				            }

				        }

				        // when the query is empty return immediately

				        if (filters_provided && table_filters.empty() && keyspace_filters.empty()) {

				            apilog.debug("toppartitions query: processing results");

				            httpd::column_family_json::toppartitions_query_results results;

				            results.read_cardinality = 0;

				            results.write_cardinality = 0;

				            return make_ready_future<json::json_return_type>(results);

				        }

				        api::req_param<std::chrono::milliseconds, unsigned> duration{*req, "duration", 1000ms};

				        api::req_param<unsigned> capacity(*req, "capacity", 256);

				        api::req_param<unsigned> list_size(*req, "list_size", 10);

				        apilog.info("toppartitions query: #table_filters={} #keyspace_filters={} duration={} list_size={} capacity={}",

				            !table_filters.empty() ? std::to_string(table_filters.size()) : "all", !keyspace_filters.empty() ? std::to_string(keyspace_filters.size()) : "all", duration.value, list_size.value, capacity.value);

				        return seastar::do_with(db::toppartitions_query(ctx.db, std::move(table_filters), std::move(keyspace_filters), duration.value, list_size, capacity), [&ctx] (db::toppartitions_query& q) {

				            return run_toppartitions_query(q, ctx);

				        });

				        co_return json::json_return_type(stream_range_as_array(token_endpoints, &map_to_json<dht::token, gms::inet_address>));

				}

				static

				@@ -646,21 +633,15 @@ future<json::json_return_type>

				rest_get_range_to_endpoint_map(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        auto keyspace = validate_keyspace(ctx, req);

				        auto table = req->get_query_param("cf");

				        std::optional<table_id> table_id;

				        auto erm = std::invoke([&]() -> locator::effective_replication_map_ptr {

				            auto& ks = ctx.db.local().find_keyspace(keyspace);

				            if (table.empty()) {

				                ensure_tablets_disabled(ctx, keyspace, "storage_service/range_to_endpoint_map");

				                return ks.get_static_effective_replication_map();

				            } else {

				                auto table_id = validate_table(ctx.db.local(), keyspace, table);

				                auto& cf = ctx.db.local().find_column_family(table_id);

				                return cf.get_effective_replication_map();

				            }

				        });

				        if (table.empty()) {

				            ensure_tablets_disabled(ctx, keyspace, "storage_service/range_to_endpoint_map");

				        } else {

				            table_id = validate_table(ctx.db.local(), keyspace, table);

				        }

				        std::vector<ss::maplist_mapper> res;

				        co_return stream_range_as_array(co_await ss.local().get_range_to_address_map(erm),

				        co_return stream_range_as_array(co_await ss.local().get_range_to_address_map(keyspace, table_id),

				                [](const std::pair<dht::token_range, inet_address_vector_replica_set>& entry){

				            ss::maplist_mapper m;

				            if (entry.first.start()) {

				@@ -705,12 +686,6 @@ rest_describe_ring(http_context& ctx, sharded<service::storage_service>& ss, std

				        return describe_ring_as_json(ss, validate_keyspace(ctx, req));

				}

				static

				future<json::json_return_type>

				rest_get_load(http_context& ctx, std::unique_ptr<http::request> req) {

				        return get_cf_stats(ctx, &replica::column_family_stats::live_disk_space_used);

				}

				static

				future<json::json_return_type>

				rest_get_current_generation_number(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				@@ -728,6 +703,14 @@ rest_get_natural_endpoints(http_context& ctx, sharded<service::storage_service>&

				        return res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>();

				}

				static

				json::json_return_type

				rest_get_natural_endpoints_v2(http_context& ctx, sharded<service::storage_service>& ss, const_req req) {

				        auto keyspace = validate_keyspace(ctx, req);

				        auto res = ss.local().get_natural_endpoints(keyspace, req.get_query_param("cf"), req.get_query_param_array("key_component"));

				        return res | std::views::transform([] (auto& ep) { return fmt::to_string(ep); }) | std::ranges::to<std::vector>();

				}

				static

				future<json::json_return_type>

				rest_cdc_streams_check_and_repair(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				@@ -738,115 +721,44 @@ rest_cdc_streams_check_and_repair(sharded<service::storage_service>& ss, std::un

				        });

				}

				static

				future<json::json_return_type>

				rest_force_compaction(http_context& ctx, std::unique_ptr<http::request> req) {

				        auto& db = ctx.db;

				        auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);

				        auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);

				        apilog.info("force_compaction: flush={} consider_only_existing_data={}", flush, consider_only_existing_data);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        std::optional<flush_mode> fmopt;

				        if (!flush && !consider_only_existing_data) {

				            fmopt = flush_mode::skip;

				        }

				        auto task = co_await compaction_module.make_and_start_task<global_major_compaction_task_impl>({}, db, fmopt, consider_only_existing_data);

				        co_await task->done();

				        co_return json_void();

				}

				static

				future<json::json_return_type>

				rest_force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request> req) {

				        auto& db = ctx.db;

				        auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");

				        auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);

				        auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);

				        apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        std::optional<flush_mode> fmopt;

				        if (!flush && !consider_only_existing_data) {

				            fmopt = flush_mode::skip;

				        }

				        auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);

				        co_await task->done();

				        co_return json_void();

				}

				static

				future<json::json_return_type>

				rest_force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        auto& db = ctx.db;

				        auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				        const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();

				        if (rs.is_local() || !rs.is_vnode_based()) {

				            auto reason = rs.is_local() ? "require" : "support";

				            apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);

				            co_return json::json_return_type(0);

				        }

				        apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);

				        if (!co_await ss.local().is_cleanup_allowed(keyspace)) {

				            auto msg = "Can not perform cleanup operation when topology changes";

				            apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);

				            co_await coroutine::return_exception(std::runtime_error(msg));

				        }

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>(

				            {}, std::move(keyspace), db, table_infos, flush_mode::all_tables, tasks::is_user_task::yes);

				        co_await task->done();

				        co_return json::json_return_type(0);

				}

				static

				future<json::json_return_type>

				rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        apilog.info("cleanup_all");

				        auto done = co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<bool> {

				            if (!ss.is_topology_coordinator_enabled()) {

				                co_return false;

				            }

				            co_await ss.do_cluster_cleanup();

				            co_return true;

				        });

				        if (done) {

				        bool global = true;

				        if (auto global_param = req->get_query_param("global"); !global_param.empty()) {

				            global = validate_bool(global_param);

				        }

				        apilog.info("cleanup_all global={}", global);

				        if (global) {

				            co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<> {

				                co_return co_await ss.do_clusterwide_vnodes_cleanup();

				            });

				            co_return json::json_return_type(0);

				        }

				        // fall back to the local global cleanup if topology coordinator is not enabled

				        // fall back to the local cleanup if local cleanup is requested

				        auto& db = ctx.db;

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<global_cleanup_compaction_task_impl>({}, db);

				        auto task = co_await compaction_module.make_and_start_task<compaction::global_cleanup_compaction_task_impl>({}, db);

				        co_await task->done();

				        // Mark this node as clean

				        co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<> {

				            co_await ss.reset_cleanup_needed();

				        });

				        co_return json::json_return_type(0);

				}

				static

				future<json::json_return_type>

				rest_perform_keyspace_offstrategy_compaction(http_context& ctx, std::unique_ptr<http::request> req) {

				        auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				        apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);

				        bool res = false;

				        auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, &res);

				        co_await task->done();

				        co_return json::json_return_type(res);

				}

				static

				future<json::json_return_type>

				rest_upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req) {

				        auto& db = ctx.db;

				        auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				        bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);

				        apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);

				        co_await task->done();

				        co_return json::json_return_type(0);

				rest_reset_cleanup_needed(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        apilog.info("reset_cleanup_needed");

				        co_await ss.invoke_on(0, [] (service::storage_service& ss) {

				            return ss.reset_cleanup_needed();

				        });

				        co_return json_void();

				}

				static

				@@ -871,9 +783,31 @@ rest_force_keyspace_flush(http_context& ctx, std::unique_ptr<http::request> req)

				static

				future<json::json_return_type>

				rest_decommission(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				rest_logstor_compaction(http_context& ctx, std::unique_ptr<http::request> req) {

				        bool major = false;

				        if (auto major_param = req->get_query_param("major"); !major_param.empty()) {

				            major = validate_bool(major_param);

				        }

				        apilog.info("logstor_compaction: major={}", major);

				        auto& db = ctx.db;

				        co_await replica::database::trigger_logstor_compaction_on_all_shards(db, major);

				        co_return json_void();

				}

				static

				future<json::json_return_type>

				rest_logstor_flush(http_context& ctx, std::unique_ptr<http::request> req) {

				        apilog.info("logstor_flush");

				        auto& db = ctx.db;

				        co_await replica::database::flush_logstor_separator_on_all_shards(db);

				        co_return json_void();

				}

				static

				future<json::json_return_type>

				rest_decommission(sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, std::unique_ptr<http::request> req) {

				        apilog.info("decommission");

				        return ss.local().decommission().then([] {

				        return ss.local().decommission(ssc).then([] {

				            return make_ready_future<json::json_return_type>(json_void());

				        });

				}

				@@ -911,6 +845,25 @@ rest_remove_node(sharded<service::storage_service>& ss, std::unique_ptr<http::re

				        });

				}

				static

				future<json::json_return_type>

				rest_exclude_node(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				    auto hosts = utils::split_comma_separated_list(req->get_query_param("hosts"))

				        | std::views::transform([] (const sstring& s) { return locator::host_id(utils::UUID(s)); })

				        | std::ranges::to<std::vector<locator::host_id>>();

				    auto& topo = ss.local().get_token_metadata().get_topology();

				    for (auto host : hosts) {

				        if (!topo.has_node(host)) {

				            throw bad_param_exception(fmt::format("Host ID {} does not belong to this cluster", host));

				        }

				    }

				    apilog.info("exclude_node: hosts={}", hosts);

				    co_await ss.local().mark_excluded(hosts);

				    co_return json_void();

				}

				static

				future<json::json_return_type>

				rest_get_removal_status(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				@@ -968,9 +921,9 @@ rest_is_starting(sharded<service::storage_service>& ss, std::unique_ptr<http::re

				static

				future<json::json_return_type>

				rest_get_drain_progress(http_context& ctx, std::unique_ptr<http::request> req) {

				        return ctx.db.map_reduce(adder<replica::database::drain_progress>(), [] (auto& db) {

				            return db.get_drain_progress();

				rest_get_drain_progress(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        return ss.map_reduce(adder<replica::database::drain_progress>(), [] (auto& ss) {

				            return ss.get_database().get_drain_progress();

				        }).then([] (auto&& progress) {

				            auto progress_str = format("Drained {}/{} ColumnFamilies", progress.remaining_cfs, progress.total_cfs);

				            return make_ready_future<json::json_return_type>(std::move(progress_str));

				@@ -986,28 +939,6 @@ rest_drain(sharded<service::storage_service>& ss, std::unique_ptr<http::request>

				        });

				}

				static

				json::json_return_type

				rest_get_keyspaces(http_context& ctx, const_req req) {

				        auto type = req.get_query_param("type");

				        auto replication = req.get_query_param("replication");

				        std::vector<sstring> keyspaces;

				        if (type == "user") {

				            keyspaces = ctx.db.local().get_user_keyspaces();

				        } else if (type == "non_local_strategy") {

				            keyspaces = ctx.db.local().get_non_local_strategy_keyspaces();

				        } else {

				            keyspaces = ctx.db.local().get_all_keyspaces();

				        }

				        if (replication.empty() || replication == "all") {

				            return keyspaces;

				        }

				        const auto want_tablets = replication == "tablets";

				        return keyspaces | std::views::filter([&ctx, want_tablets] (const sstring& ks) {

				            return ctx.db.local().find_keyspace(ks).get_replication_strategy().uses_tablets() == want_tablets;

				        }) | std::ranges::to<std::vector>();

				}

				static

				future<json::json_return_type>

				rest_stop_gossiping(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				@@ -1324,12 +1255,6 @@ rest_set_hinted_handoff_throttle_in_kb(std::unique_ptr<http::request> req) {

				        return make_ready_future<json::json_return_type>(json_void());

				}

				static

				future<json::json_return_type>

				rest_get_metrics_load(http_context& ctx, std::unique_ptr<http::request> req) {

				        return get_cf_stats(ctx, &replica::column_family_stats::live_disk_space_used);

				}

				static

				json::json_return_type

				rest_get_exceptions(sharded<service::storage_service>& ss, const_req req) {

				@@ -1359,10 +1284,7 @@ rest_get_ownership(http_context& ctx, sharded<service::storage_service>& ss, std

				            throw httpd::bad_param_exception("storage_service/ownership cannot be used when a keyspace uses tablets");

				        }

				        return ss.local().get_ownership().then([] (auto&& ownership) {

				            std::vector<storage_service_json::mapper> res;

				            return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));

				        });

				        co_return json::json_return_type(stream_range_as_array(co_await ss.local().get_ownership(), &map_to_json<gms::inet_address, float>));

				}

				static

				@@ -1379,10 +1301,7 @@ rest_get_effective_ownership(http_context& ctx, sharded<service::storage_service

				            }

				        }

				        return ss.local().effective_ownership(keyspace_name, table_name).then([] (auto&& ownership) {

				            std::vector<storage_service_json::mapper> res;

				            return make_ready_future<json::json_return_type>(map_to_key_value(ownership, res));

				        });

				        co_return json::json_return_type(stream_range_as_array(co_await ss.local().effective_ownership(keyspace_name, table_name), &map_to_json<gms::inet_address, float>));

				}

				static

				@@ -1392,7 +1311,7 @@ rest_estimate_compression_ratios(http_context& ctx, sharded<service::storage_ser

				        apilog.warn("estimate_compression_ratios: called before the cluster feature was enabled");

				        throw std::runtime_error("estimate_compression_ratios requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");

				    }

				    auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);

				    auto ticket = co_await get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);

				    auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;

				    auto cf = api::req_param<sstring>(*req, "cf", {}).value;

				    apilog.debug("estimate_compression_ratios: called with ks={} cf={}", ks, cf);

				@@ -1458,7 +1377,7 @@ rest_retrain_dict(http_context& ctx, sharded<service::storage_service>& ss, serv

				        apilog.warn("retrain_dict: called before the cluster feature was enabled");

				        throw std::runtime_error("retrain_dict requires all nodes to support the SSTABLE_COMPRESSION_DICTS cluster feature");

				    }

				    auto ticket = get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);

				    auto ticket = co_await get_units(ss.local().get_do_sample_sstables_concurrency_limiter(), 1);

				    auto ks = api::req_param<sstring>(*req, "keyspace", {}).value;

				    auto cf = api::req_param<sstring>(*req, "cf", {}).value;

				    apilog.debug("retrain_dict: called with ks={} cf={}", ks, cf);

				@@ -1604,6 +1523,54 @@ rest_sstable_info(http_context& ctx, std::unique_ptr<http::request> req) {

				        });

				}

				static

				future<json::json_return_type>

				rest_logstor_info(http_context& ctx, std::unique_ptr<http::request> req) {

				        auto keyspace = api::req_param<sstring>(*req, "keyspace", {}).value;

				        auto table = api::req_param<sstring>(*req, "table", {}).value;

				        if (table.empty()) {

				            table = api::req_param<sstring>(*req, "cf", {}).value;

				        }

				        if (keyspace.empty()) {

				            throw bad_param_exception("The query parameter 'keyspace' is required");

				        }

				        if (table.empty()) {

				            throw bad_param_exception("The query parameter 'table' is required");

				        }

				        keyspace = validate_keyspace(ctx, keyspace);

				        auto tid = validate_table(ctx.db.local(), keyspace, table);

				        auto& cf = ctx.db.local().find_column_family(tid);

				        if (!cf.uses_logstor()) {

				            throw bad_param_exception(fmt::format("Table {}.{} does not use logstor", keyspace, table));

				        }

				        return do_with(replica::logstor::table_segment_stats{}, [keyspace = std::move(keyspace), table = std::move(table), tid, &ctx] (replica::logstor::table_segment_stats& merged_stats) {

				            return ctx.db.map_reduce([&merged_stats](replica::logstor::table_segment_stats&& shard_stats) {

				                merged_stats += shard_stats;

				            }, [tid](const replica::database& db) {

				                return db.get_logstor_table_segment_stats(tid);

				            }).then([&merged_stats, keyspace = std::move(keyspace), table = std::move(table)] {

				                ss::table_logstor_info result;

				                result.keyspace = keyspace;

				                result.table = table;

				                result.compaction_groups = merged_stats.compaction_group_count;

				                result.segments = merged_stats.segment_count;

				                for (const auto& bucket : merged_stats.histogram) {

				                    ss::logstor_hist_bucket hist;

				                    hist.count = bucket.count;

				                    hist.max_data_size = bucket.max_data_size;

				                    result.data_size_histogram.push(std::move(hist));

				                }

				                return make_ready_future<json::json_return_type>(stream_object(result));

				            });

				        });

				}

				static

				future<json::json_return_type>

				rest_reload_raft_topology_state(sharded<service::storage_service>& ss, service::raft_group0_client& group0_client, std::unique_ptr<http::request> req) {

				@@ -1616,26 +1583,14 @@ rest_reload_raft_topology_state(sharded<service::storage_service>& ss, service::

				static

				future<json::json_return_type>

				rest_upgrade_to_raft_topology(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        apilog.info("Requested to schedule upgrade to raft topology");

				        try {

				            co_await ss.invoke_on(0, [] (auto& ss) {

				                return ss.start_upgrade_to_raft_topology();

				            });

				        } catch (...) {

				            auto ex = std::current_exception();

				            apilog.error("Failed to schedule upgrade to raft topology: {}", ex);

				            std::rethrow_exception(std::move(ex));

				        }

				        apilog.info("Requested to schedule upgrade to raft topology, but this version does not need it since it uses raft topology by default.");

				        co_return json_void();

				}

				static

				future<json::json_return_type>

				rest_raft_topology_upgrade_status(sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				        const auto ustate = co_await ss.invoke_on(0, [] (auto& ss) {

				            return ss.get_topology_upgrade_state();

				        });

				        co_return sstring(format("{}", ustate));

				        co_return sstring("done");

				}

				static

				@@ -1730,6 +1685,10 @@ rest_repair_tablet(http_context& ctx, sharded<service::storage_service>& ss, std

				        if (!await.empty()) {

				            await_completion = validate_bool(await);

				        }

				        // Use regular mode if the incremental_mode option is not provided by user.

				        auto incremental = req->get_query_param("incremental_mode");

				        auto incremental_mode = incremental.empty() ? locator::default_tablet_repair_incremental_mode : locator::tablet_repair_incremental_mode_from_string(incremental);

				        auto table_id = validate_table(ctx.db.local(), ks, table);

				        std::variant<utils::chunked_vector<dht::token>, service::storage_service::all_tokens_tag> tokens_variant;

				        if (all_tokens) {

				@@ -1752,8 +1711,12 @@ rest_repair_tablet(http_context& ctx, sharded<service::storage_service>& ss, std

				            }) | std::ranges::to<std::unordered_set>();

				        }

				        auto dcs_filter = locator::tablet_task_info::deserialize_repair_dcs_filter(dcs);

				        auto res = co_await ss.local().add_repair_tablet_request(table_id, tokens_variant, hosts_filter, dcs_filter, await_completion);

				        co_return json::json_return_type(res);

				        try {

				            auto res = co_await ss.local().add_repair_tablet_request(table_id, tokens_variant, hosts_filter, dcs_filter, await_completion, incremental_mode);

				            co_return json::json_return_type(res);

				        } catch (std::invalid_argument& e) {

				            throw httpd::bad_param_exception(e.what());

				        }

				}

				static

				@@ -1794,8 +1757,7 @@ rest_drop_quarantined_sstables(http_context& ctx, sharded<service::storage_servi

				    try {

				        if (!keyspace.empty()) {

				            keyspace = validate_keyspace(ctx, keyspace);

				            auto it = req->query_parameters.find("tables");

				            auto table_infos = parse_table_infos(keyspace, ctx, it != req->query_parameters.end() ? it->second : "");

				            auto table_infos = parse_table_infos(keyspace, ctx, req->get_query_param("tables"));

				            co_await ctx.db.invoke_on_all([&table_infos](replica::database& db) -> future<> {

				                return parallel_for_each(table_infos, [&db](const auto& table) -> future<> {

				@@ -1838,39 +1800,36 @@ rest_bind(FuncType func, BindArgs&... args) {

				    return std::bind_front(func, std::ref(args)...);

				}

				void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, service::raft_group0_client& group0_client) {

				void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, service::raft_group0_client& group0_client) {

				    ss::get_token_endpoint.set(r, rest_bind(rest_get_token_endpoint, ctx, ss));

				    ss::toppartitions_generic.set(r, rest_bind(rest_toppartitions_generic, ctx));

				    ss::get_release_version.set(r, rest_bind(rest_get_release_version, ss));

				    ss::get_scylla_release_version.set(r, rest_bind(rest_get_scylla_release_version, ss));

				    ss::get_schema_version.set(r, rest_bind(rest_get_schema_version, ss));

				    ss::get_range_to_endpoint_map.set(r, rest_bind(rest_get_range_to_endpoint_map, ctx, ss));

				    ss::get_pending_range_to_endpoint_map.set(r, rest_bind(rest_get_pending_range_to_endpoint_map, ctx));

				    ss::describe_ring.set(r, rest_bind(rest_describe_ring, ctx, ss));

				    ss::get_load.set(r, rest_bind(rest_get_load, ctx));

				    ss::get_current_generation_number.set(r, rest_bind(rest_get_current_generation_number, ss));

				    ss::get_natural_endpoints.set(r, rest_bind(rest_get_natural_endpoints, ctx, ss));

				    ss::get_natural_endpoints_v2.set(r, rest_bind(rest_get_natural_endpoints_v2, ctx, ss));

				    ss::cdc_streams_check_and_repair.set(r, rest_bind(rest_cdc_streams_check_and_repair, ss));

				    ss::force_compaction.set(r, rest_bind(rest_force_compaction, ctx));

				    ss::force_keyspace_compaction.set(r, rest_bind(rest_force_keyspace_compaction, ctx));

				    ss::force_keyspace_cleanup.set(r, rest_bind(rest_force_keyspace_cleanup, ctx, ss));

				    ss::cleanup_all.set(r, rest_bind(rest_cleanup_all, ctx, ss));

				    ss::perform_keyspace_offstrategy_compaction.set(r, rest_bind(rest_perform_keyspace_offstrategy_compaction, ctx));

				    ss::upgrade_sstables.set(r, rest_bind(rest_upgrade_sstables, ctx));

				    ss::reset_cleanup_needed.set(r, rest_bind(rest_reset_cleanup_needed, ctx, ss));

				    ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));

				    ss::force_keyspace_flush.set(r, rest_bind(rest_force_keyspace_flush, ctx));

				    ss::decommission.set(r, rest_bind(rest_decommission, ss));

				    ss::decommission.set(r, rest_bind(rest_decommission, ss, ssc));

				    ss::logstor_compaction.set(r, rest_bind(rest_logstor_compaction, ctx));

				    ss::logstor_flush.set(r, rest_bind(rest_logstor_flush, ctx));

				    ss::move.set(r, rest_bind(rest_move, ss));

				    ss::remove_node.set(r, rest_bind(rest_remove_node, ss));

				    ss::exclude_node.set(r, rest_bind(rest_exclude_node, ss));

				    ss::get_removal_status.set(r, rest_bind(rest_get_removal_status, ss));

				    ss::force_remove_completion.set(r, rest_bind(rest_force_remove_completion, ss));

				    ss::set_logging_level.set(r, rest_bind(rest_set_logging_level));

				    ss::get_logging_levels.set(r, rest_bind(rest_get_logging_levels));

				    ss::get_operation_mode.set(r, rest_bind(rest_get_operation_mode, ss));

				    ss::is_starting.set(r, rest_bind(rest_is_starting, ss));

				    ss::get_drain_progress.set(r, rest_bind(rest_get_drain_progress, ctx));

				    ss::get_drain_progress.set(r, rest_bind(rest_get_drain_progress, ss));

				    ss::drain.set(r, rest_bind(rest_drain, ss));

				    ss::get_keyspaces.set(r, rest_bind(rest_get_keyspaces, ctx));

				    ss::stop_gossiping.set(r, rest_bind(rest_stop_gossiping, ss));

				    ss::start_gossiping.set(r, rest_bind(rest_start_gossiping, ss));

				    ss::is_gossip_running.set(r, rest_bind(rest_is_gossip_running, ss));

				@@ -1900,7 +1859,6 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				    ss::get_batch_size_failure_threshold.set(r, rest_bind(rest_get_batch_size_failure_threshold));

				    ss::set_batch_size_failure_threshold.set(r, rest_bind(rest_set_batch_size_failure_threshold));

				    ss::set_hinted_handoff_throttle_in_kb.set(r, rest_bind(rest_set_hinted_handoff_throttle_in_kb));

				    ss::get_metrics_load.set(r, rest_bind(rest_get_metrics_load, ctx));

				    ss::get_exceptions.set(r, rest_bind(rest_get_exceptions, ss));

				    ss::get_total_hints_in_progress.set(r, rest_bind(rest_get_total_hints_in_progress));

				    ss::get_total_hints.set(r, rest_bind(rest_get_total_hints));

				@@ -1909,6 +1867,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				    ss::retrain_dict.set(r, rest_bind(rest_retrain_dict, ctx, ss, group0_client));

				    ss::estimate_compression_ratios.set(r, rest_bind(rest_estimate_compression_ratios, ctx, ss));

				    ss::sstable_info.set(r, rest_bind(rest_sstable_info, ctx));

				    ss::logstor_info.set(r, rest_bind(rest_logstor_info, ctx));

				    ss::reload_raft_topology_state.set(r, rest_bind(rest_reload_raft_topology_state, ss, group0_client));

				    ss::upgrade_to_raft_topology.set(r, rest_bind(rest_upgrade_to_raft_topology, ss));

				    ss::raft_topology_upgrade_status.set(r, rest_bind(rest_raft_topology_upgrade_status, ss));

				@@ -1925,28 +1884,25 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_

				void unset_storage_service(http_context& ctx, routes& r) {

				    ss::get_token_endpoint.unset(r);

				    ss::toppartitions_generic.unset(r);

				    ss::get_release_version.unset(r);

				    ss::get_scylla_release_version.unset(r);

				    ss::get_schema_version.unset(r);

				    ss::get_range_to_endpoint_map.unset(r);

				    ss::get_pending_range_to_endpoint_map.unset(r);

				    ss::describe_ring.unset(r);

				    ss::get_load.unset(r);

				    ss::get_current_generation_number.unset(r);

				    ss::get_natural_endpoints.unset(r);

				    ss::cdc_streams_check_and_repair.unset(r);

				    ss::force_compaction.unset(r);

				    ss::force_keyspace_compaction.unset(r);

				    ss::force_keyspace_cleanup.unset(r);

				    ss::cleanup_all.unset(r);

				    ss::perform_keyspace_offstrategy_compaction.unset(r);

				    ss::upgrade_sstables.unset(r);

				    ss::reset_cleanup_needed.unset(r);

				    ss::force_flush.unset(r);

				    ss::force_keyspace_flush.unset(r);

				    ss::logstor_compaction.unset(r);

				    ss::logstor_flush.unset(r);

				    ss::decommission.unset(r);

				    ss::move.unset(r);

				    ss::remove_node.unset(r);

				    ss::exclude_node.unset(r);

				    ss::get_removal_status.unset(r);

				    ss::force_remove_completion.unset(r);

				    ss::set_logging_level.unset(r);

				@@ -1955,7 +1911,6 @@ void unset_storage_service(http_context& ctx, routes& r) {

				    ss::is_starting.unset(r);

				    ss::get_drain_progress.unset(r);

				    ss::drain.unset(r);

				    ss::get_keyspaces.unset(r);

				    ss::stop_gossiping.unset(r);

				    ss::start_gossiping.unset(r);

				    ss::is_gossip_running.unset(r);

				@@ -1985,13 +1940,13 @@ void unset_storage_service(http_context& ctx, routes& r) {

				    ss::get_batch_size_failure_threshold.unset(r);

				    ss::set_batch_size_failure_threshold.unset(r);

				    ss::set_hinted_handoff_throttle_in_kb.unset(r);

				    ss::get_metrics_load.unset(r);

				    ss::get_exceptions.unset(r);

				    ss::get_total_hints_in_progress.unset(r);

				    ss::get_total_hints.unset(r);

				    ss::get_ownership.unset(r);

				    ss::get_effective_ownership.unset(r);

				    ss::sstable_info.unset(r);

				    ss::logstor_info.unset(r);

				    ss::reload_raft_topology_state.unset(r);

				    ss::upgrade_to_raft_topology.unset(r);

				    ss::raft_topology_upgrade_status.unset(r);

				@@ -2027,7 +1982,7 @@ void unset_load_meter(http_context& ctx, routes& r) {

				void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_ctl) {

				    ss::get_snapshot_details.set(r, [&snap_ctl](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto result = co_await snap_ctl.local().get_snapshot_details();

				        co_return std::function([res = std::move(result)] (output_stream<char>&& o) -> future<> {

				        co_return noncopyable_function<future<> (output_stream<char>&&)>([res = std::move(result)] (output_stream<char>&& o) -> future<> {

				            std::exception_ptr ex;

				            output_stream<char> out = std::move(o);

				            try {

				@@ -2067,16 +2022,20 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_

				    });

				    ss::take_snapshot.set(r, [&snap_ctl](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        apilog.info("take_snapshot: {}", req->query_parameters);

				        apilog.info("take_snapshot: {}", req->get_query_params());

				        auto tag = req->get_query_param("tag");

				        auto column_families = split(req->get_query_param("cf"), ",");

				        auto sfopt = req->get_query_param("sf");

				        auto sf = db::snapshot_ctl::skip_flush(strcasecmp(sfopt.c_str(), "true") == 0);

				        auto tcopt = req->get_query_param("tc");

				        db::snapshot_options opts = {

				            .skip_flush = strcasecmp(sfopt.c_str(), "true") == 0,

				        };

				        std::vector<sstring> keynames = split(req->get_query_param("kn"), ",");

				        try {

				            if (column_families.empty()) {

				                co_await snap_ctl.local().take_snapshot(tag, keynames, sf);

				                co_await snap_ctl.local().take_snapshot(tag, keynames, opts);

				            } else {

				                if (keynames.empty()) {

				                    throw httpd::bad_param_exception("The keyspace of column families must be specified");

				@@ -2084,7 +2043,7 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_

				                if (keynames.size() > 1) {

				                    throw httpd::bad_param_exception("Only one keyspace allowed when specifying a column family");

				                }

				                co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, sf);

				                co_await snap_ctl.local().take_column_family_snapshot(keynames[0], column_families, tag, opts);

				            }

				            co_return json_void();

				        } catch (...) {

				@@ -2093,8 +2052,29 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_

				        }

				    });

				    ss::take_cluster_snapshot.set(r, [&snap_ctl](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        apilog.info("take_cluster_snapshot: {}", req->get_query_params());

				        auto tag = req->get_query_param("tag");

				        auto column_families = split(req->get_query_param("table"), ",");

				        // Note: not published/active. Retain as internal option, but...

				        auto sfopt = req->get_query_param("skip_flush");

				        db::snapshot_options opts = {

				            .skip_flush = strcasecmp(sfopt.c_str(), "true") == 0,

				        };

				        std::vector<sstring> keynames = split(req->get_query_param("keyspace"), ",");

				        try {

				            co_await snap_ctl.local().take_cluster_column_family_snapshot(keynames, column_families, tag, opts);

				            co_return json_void();

				        } catch (...) {

				            apilog.error("take_cluster_snapshot failed: {}", std::current_exception());

				            throw;

				        }

				    });

				    ss::del_snapshot.set(r, [&snap_ctl](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        apilog.info("del_snapshot: {}", req->query_parameters);

				        apilog.info("del_snapshot: {}", req->get_query_params());

				        auto tag = req->get_query_param("tag");

				        auto column_family = req->get_query_param("cf");

				@@ -2116,17 +2096,22 @@ void set_snapshot(http_context& ctx, routes& r, sharded<db::snapshot_ctl>& snap_

				    ss::scrub.set(r, [&ctx, &snap_ctl] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        auto info = co_await parse_scrub_options(ctx, snap_ctl, std::move(req));

				        auto info = parse_scrub_options(ctx, std::move(req));

				        sstables::compaction_stats stats;

				        if (!info.snapshot_tag.empty()) {

				            db::snapshot_options opts = {.skip_flush = false};

				            co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, opts);

				        }

				        compaction::compaction_stats stats;

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<scrub_sstables_compaction_task_impl>({}, info.keyspace, db, info.column_families, info.opts, &stats);

				        auto task = co_await compaction_module.make_and_start_task<compaction::scrub_sstables_compaction_task_impl>({}, info.keyspace, db, info.column_families, info.opts, &stats);

				        try {

				            co_await task->done();

				            if (stats.validation_errors) {

				                co_return json::json_return_type(static_cast<int>(scrub_status::validation_errors));

				            }

				        } catch (const sstables::compaction_aborted_exception&) {

				        } catch (const compaction::compaction_aborted_exception&) {

				            co_return json::json_return_type(static_cast<int>(scrub_status::aborted));

				        } catch (...) {

				            apilog.error("scrub keyspace={} tables={} failed: {}", info.keyspace, info.column_families, std::current_exception());

				@@ -2178,6 +2163,7 @@ void unset_snapshot(http_context& ctx, routes& r) {

				    ss::start_backup.unset(r);

				    cf::get_true_snapshots_size.unset(r);

				    cf::get_all_true_snapshots_size.unset(r);

				    ss::decommission.unset(r);

				}

				}

									
										9

api/storage_service.hh
									
												View File
												
				@@ -58,14 +58,15 @@ std::vector<table_info> parse_table_infos(const sstring& ks_name, const http_con

				std::pair<sstring, std::vector<table_info>> parse_table_infos(const http_context& ctx, const http::request& req, sstring cf_param_name = "cf");

				struct scrub_info {

				    sstables::compaction_type_options::scrub opts;

				    compaction::compaction_type_options::scrub opts;

				    sstring keyspace;

				    std::vector<sstring> column_families;

				    sstring snapshot_tag;

				};

				future<scrub_info> parse_scrub_options(const http_context& ctx, sharded<db::snapshot_ctl>& snap_ctl, std::unique_ptr<http::request> req);

				scrub_info parse_scrub_options(const http_context& ctx, std::unique_ptr<http::request> req);

				void set_storage_service(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, service::raft_group0_client&);

				void set_storage_service(http_context& ctx, httpd::routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>&, service::raft_group0_client&);

				void unset_storage_service(http_context& ctx, httpd::routes& r);

				void set_sstables_loader(http_context& ctx, httpd::routes& r, sharded<sstables_loader>& sst_loader);

				void unset_sstables_loader(http_context& ctx, httpd::routes& r);

				@@ -81,7 +82,7 @@ void set_snapshot(http_context& ctx, httpd::routes& r, sharded<db::snapshot_ctl>

				void unset_snapshot(http_context& ctx, httpd::routes& r);

				void set_load_meter(http_context& ctx, httpd::routes& r, service::load_meter& lm);

				void unset_load_meter(http_context& ctx, httpd::routes& r);

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, http_context &ctx, bool legacy_request = false);

				seastar::future<json::json_return_type> run_toppartitions_query(db::toppartitions_query& q, bool legacy_request = false);

				// converts string value of boolean parameter into bool

				// maps (case insensitively)

									
										47

api/system.cc
									
												View File
												
				@@ -10,7 +10,7 @@

				#include "api/api-doc/system.json.hh"

				#include "api/api-doc/metrics.json.hh"

				#include "replica/database.hh"

				#include "db/sstables-format-selector.hh"

				#include "sstables/sstables_manager.hh"

				#include <rapidjson/document.h>

				#include <boost/lexical_cast.hpp>

				@@ -54,7 +54,8 @@ void set_system(http_context& ctx, routes& r) {

				    hm::set_metrics_config.set(r, [](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        rapidjson::Document doc;

				        doc.Parse(req->content.c_str());

				        auto content = co_await util::read_entire_stream_contiguous(*req->content_stream);

				        doc.Parse(content.c_str());

				        if (!doc.IsArray()) {

				            throw bad_param_exception("Expected a json array");

				        }

				@@ -87,21 +88,19 @@ void set_system(http_context& ctx, routes& r) {

				                relabels[i].expr = element["regex"].GetString();

				            }

				        }

				        return do_with(std::move(relabels), false, [](const std::vector<seastar::metrics::relabel_config>& relabels, bool& failed) {

				            return smp::invoke_on_all([&relabels, &failed] {

				                return metrics::set_relabel_configs(relabels).then([&failed](const metrics::metric_relabeling_result& result) {

				                    if (result.metrics_relabeled_due_to_collision > 0) {

				                        failed = true;

				                    }

				                    return;

				                });

				            }).then([&failed](){

				                if (failed) {

				                    throw bad_param_exception("conflicts found during relabeling");

				        bool failed = false;

				        co_await smp::invoke_on_all([&relabels, &failed] {

				            return metrics::set_relabel_configs(relabels).then([&failed](const metrics::metric_relabeling_result& result) {

				                if (result.metrics_relabeled_due_to_collision > 0) {

				                    failed = true;

				                }

				                return make_ready_future<json::json_return_type>(seastar::json::json_void());

				                return;

				            });

				        });

				        if (failed) {

				            throw bad_param_exception("conflicts found during relabeling");

				        }

				        co_return seastar::json::json_void();

				    });

				    hs::get_system_uptime.set(r, [](const_req req) {

				@@ -184,18 +183,20 @@ void set_system(http_context& ctx, routes& r) {

				        apilog.info("Profile dumped to {}", profile_dest);

				        return make_ready_future<json::json_return_type>(json::json_return_type(json::json_void()));

				    }) ;

				}

				void set_format_selector(http_context& ctx, routes& r, db::sstables_format_selector& sel) {

				    hs::get_highest_supported_sstable_version.set(r, [&sel] (std::unique_ptr<request> req) {

				        return smp::submit_to(0, [&sel] {

				            return make_ready_future<json::json_return_type>(seastar::to_sstring(sel.selected_format()));

				    hs::get_highest_supported_sstable_version.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return smp::submit_to(0, [&ctx] {

				            auto format = ctx.db.local().get_user_sstables_manager().get_highest_supported_format();

				            return make_ready_future<json::json_return_type>(seastar::to_sstring(format));

				        });

				    });

				    hs::get_chosen_sstable_version.set(r, [&ctx] (std::unique_ptr<request> req) {

				        return smp::submit_to(0, [&ctx] {

				            auto format = ctx.db.local().get_user_sstables_manager().get_preferred_sstable_version();

				            return make_ready_future<json::json_return_type>(seastar::to_sstring(format));

				        });

				    });

				}

				void unset_format_selector(http_context& ctx, routes& r) {

				    hs::get_highest_supported_sstable_version.unset(r);

				}

				}

									
										5

api/system.hh
									
												View File
												
				@@ -12,14 +12,9 @@ namespace seastar::httpd {

				class routes;

				}

				namespace db { class sstables_format_selector; }

				namespace api {

				struct http_context;

				void set_system(http_context& ctx, seastar::httpd::routes& r);

				void set_format_selector(http_context& ctx, seastar::httpd::routes& r, db::sstables_format_selector& sel);

				void unset_format_selector(http_context& ctx, seastar::httpd::routes& r);

				}

									
										33

api/task_manager.cc
									
												View File
												
				@@ -6,8 +6,10 @@

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#include <seastar/core/chunked_fifo.hh>

				#include <seastar/core/coroutine.hh>

				#include <seastar/coroutine/exception.hh>

				#include <seastar/coroutine/maybe_yield.hh>

				#include <seastar/http/exception.hh>

				#include "task_manager.hh"

				@@ -34,8 +36,9 @@ static ::tm get_time(db_clock::time_point tp) {

				}

				tm::task_status make_status(tasks::task_status status, sharded<gms::gossiper>& gossiper) {

				    std::vector<tm::task_identity> tis{status.children.size()};

				    std::ranges::transform(status.children, tis.begin(), [&gossiper] (const auto& child) {

				    chunked_fifo<tm::task_identity> tis;

				    tis.reserve(status.children.size());

				    for (const auto& child : status.children) {

				        tm::task_identity ident;

				        gms::inet_address addr{};

				        if (gossiper.local_is_initialized()) {

				@@ -43,8 +46,8 @@ tm::task_status make_status(tasks::task_status status, sharded<gms::gossiper>& g

				        }

				        ident.task_id = child.task_id.to_sstring();

				        ident.node = fmt::format("{}", addr);

				        return ident;

				    });

				        tis.push_back(std::move(ident));

				    }

				    tm::task_status res{};

				    res.id = status.task_id.to_sstring();

				@@ -105,11 +108,11 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				                throw bad_param_exception(fmt::format("{}", std::current_exception()));

				            }

				            if (auto it = req->query_parameters.find("keyspace"); it != req->query_parameters.end()) {

				                keyspace = it->second;

				            if (auto param = req->get_query_param("keyspace"); !param.empty()) {

				                keyspace = param;

				            }

				            if (auto it = req->query_parameters.find("table"); it != req->query_parameters.end()) {

				                table = it->second;

				            if (auto param = req->get_query_param("table"); !param.empty()) {

				                table = param;

				            }

				            return module->get_stats(internal, [keyspace = std::move(keyspace), table = std::move(table)] (std::string& ks, std::string& t) {

				@@ -117,7 +120,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				            });

				        });

				        std::function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {

				        noncopyable_function<future<>(output_stream<char>&&)> f = [r = std::move(res)] (output_stream<char>&& os) -> future<> {

				            auto s = std::move(os);

				            std::exception_ptr ex;

				            try {

				@@ -173,8 +176,8 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				        auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};

				        tasks::task_status status;

				        std::optional<std::chrono::seconds> timeout = std::nullopt;

				        if (auto it = req->query_parameters.find("timeout"); it != req->query_parameters.end()) {

				            timeout = std::chrono::seconds(boost::lexical_cast<uint32_t>(it->second));

				        if (auto param = req->get_query_param("timeout"); !param.empty()) {

				            timeout = std::chrono::seconds(boost::lexical_cast<uint32_t>(param));

				        }

				        try {

				            auto task = tasks::task_handler{tm.local(), id};

				@@ -194,7 +197,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				            auto task = tasks::task_handler{tm.local(), id};

				            auto res = co_await task.get_status_recursively(true);

				            std::function<future<>(output_stream<char>&&)> f = [r = std::move(res), &gossiper] (output_stream<char>&& os) -> future<> {

				            noncopyable_function<future<>(output_stream<char>&&)> f = [r = std::move(res), &gossiper] (output_stream<char>&& os) -> future<> {

				                auto s = std::move(os);

				                auto res = std::move(r);

				                co_await s.write("[");

				@@ -215,7 +218,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				    tm::get_and_update_ttl.set(r, [&cfg] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        uint32_t ttl = cfg.task_ttl_seconds();

				        try {

				            co_await cfg.task_ttl_seconds.set_value_on_all_shards(req->query_parameters["ttl"], utils::config_file::config_source::API);

				            co_await cfg.task_ttl_seconds.set_value_on_all_shards(req->get_query_param("ttl"), utils::config_file::config_source::API);

				        } catch (...) {

				            throw bad_param_exception(fmt::format("{}", std::current_exception()));

				        }

				@@ -230,7 +233,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				    tm::get_and_update_user_ttl.set(r, [&cfg] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        uint32_t user_ttl = cfg.user_task_ttl_seconds();

				        try {

				            co_await cfg.user_task_ttl_seconds.set_value_on_all_shards(req->query_parameters["user_ttl"], utils::config_file::config_source::API);

				            co_await cfg.user_task_ttl_seconds.set_value_on_all_shards(req->get_query_param("user_ttl"), utils::config_file::config_source::API);

				        } catch (...) {

				            throw bad_param_exception(fmt::format("{}", std::current_exception()));

				        }

				@@ -262,7 +265,7 @@ void set_task_manager(http_context& ctx, routes& r, sharded<tasks::task_manager>

				                if (id) {

				                    module->unregister_task(id);

				                }

				                co_await maybe_yield();

				                co_await coroutine::maybe_yield();

				            }

				        });

				        co_return json_void();

									
										29

api/task_manager_test.cc
									
												View File
												
				@@ -57,20 +57,16 @@ void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_man

				    tmt::register_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        sharded<tasks::task_manager>& tms = tm;

				        auto it = req->query_parameters.find("task_id");

				        auto id = it != req->query_parameters.end() ? tasks::task_id{utils::UUID{it->second}} : tasks::task_id::create_null_id();

				        it = req->query_parameters.find("shard");

				        unsigned shard = it != req->query_parameters.end() ? boost::lexical_cast<unsigned>(it->second) : 0;

				        it = req->query_parameters.find("keyspace");

				        std::string keyspace = it != req->query_parameters.end() ? it->second : "";

				        it = req->query_parameters.find("table");

				        std::string table = it != req->query_parameters.end() ? it->second : "";

				        it = req->query_parameters.find("entity");

				        std::string entity = it != req->query_parameters.end() ? it->second : "";

				        it = req->query_parameters.find("parent_id");

				        const auto id_param = req->get_query_param("task_id");

				        auto id = !id_param.empty() ? tasks::task_id{utils::UUID{id_param}} : tasks::task_id::create_null_id();

				        const auto shard_param = req->get_query_param("shard");

				        unsigned shard = shard_param.empty() ? 0 : boost::lexical_cast<unsigned>(shard_param);

				        std::string keyspace = req->get_query_param("keyspace");

				        std::string table = req->get_query_param("table");

				        std::string entity = req->get_query_param("entity");

				        tasks::task_info data;

				        if (it != req->query_parameters.end()) {

				            data.id = tasks::task_id{utils::UUID{it->second}};

				        if (auto parent_id = req->get_query_param("parent_id"); !parent_id.empty()) {

				            data.id = tasks::task_id{utils::UUID{parent_id}};

				            auto parent_ptr = co_await tasks::task_manager::lookup_task_on_all_shards(tm, data.id);

				            data.shard = parent_ptr->get_status().shard;

				        }

				@@ -88,7 +84,7 @@ void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_man

				    });

				    tmt::unregister_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->query_parameters["task_id"]}};

				        auto id = tasks::task_id{utils::UUID{req->get_query_param("task_id")}};

				        try {

				            co_await tasks::task_manager::invoke_on_task(tm, id, [] (tasks::task_manager::task_variant task_v, tasks::virtual_task_hint) -> future<> {

				                return std::visit(overloaded_functor{

				@@ -109,9 +105,8 @@ void set_task_manager_test(http_context& ctx, routes& r, sharded<tasks::task_man

				    tmt::finish_test_task.set(r, [&tm] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto id = tasks::task_id{utils::UUID{req->get_path_param("task_id")}};

				        auto it = req->query_parameters.find("error");

				        bool fail = it != req->query_parameters.end();

				        std::string error = fail ? it->second : "";

				        std::string error = req->get_query_param("error");

				        bool fail = !error.empty();

				        try {

				            co_await tasks::task_manager::invoke_on_task(tm, id, [fail, error = std::move(error)] (tasks::task_manager::task_variant task_v, tasks::virtual_task_hint) -> future<> {

									
										146

api/tasks.cc
									
												View File
												
				@@ -12,6 +12,7 @@

				#include "api/api.hh"

				#include "api/storage_service.hh"

				#include "api/api-doc/tasks.json.hh"

				#include "api/api-doc/storage_service.json.hh"

				#include "compaction/compaction_manager.hh"

				#include "compaction/task_manager_module.hh"

				#include "service/storage_service.hh"

				@@ -25,6 +26,7 @@ extern logging::logger apilog;

				namespace api {

				namespace t = httpd::tasks_json;

				namespace ss = httpd::storage_service_json;

				using namespace json;

				using ks_cf_func = std::function<future<json::json_return_type>(http_context&, std::unique_ptr<http::request>, sstring, std::vector<table_info>)>;

				@@ -36,76 +38,152 @@ static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {

				    };

				}

				static future<shared_ptr<compaction::major_keyspace_compaction_task_impl>> force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request> req) {

				    auto& db = ctx.db;

				    auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");

				    auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);

				    auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);

				    apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);

				    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				    std::optional<compaction::flush_mode> fmopt;

				    if (!flush && !consider_only_existing_data) {

				        fmopt = compaction::flush_mode::skip;

				    }

				    return compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);

				}

				static future<shared_ptr<compaction::upgrade_sstables_compaction_task_impl>> upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {

				    auto& db = ctx.db;

				    bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);

				    apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);

				    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				    return compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);

				}

				static future<shared_ptr<compaction::cleanup_keyspace_compaction_task_impl>> force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {

				    auto& db = ctx.db;

				    auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				    const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();

				    if (rs.is_local() || !rs.is_vnode_based()) {

				        auto reason = rs.is_local() ? "require" : "support";

				        apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);

				        co_return nullptr;

				    }

				    apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);

				    if (!co_await ss.local().is_vnodes_cleanup_allowed(keyspace)) {

				        auto msg = "Can not perform cleanup operation when topology changes";

				        apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);

				        co_await coroutine::return_exception(std::runtime_error(msg));

				    }

				    auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				    co_return co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>(

				        {}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);

				}

				void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& snap_ctl) {

				    t::force_keyspace_compaction_async.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");

				        auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);

				        apilog.debug("force_keyspace_compaction_async: keyspace={} tables={}, flush={}", keyspace, table_infos, flush);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        std::optional<flush_mode> fmopt;

				        if (!flush) {

				            fmopt = flush_mode::skip;

				        }

				        auto task = co_await compaction_module.make_and_start_task<major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt);

				        auto task = co_await force_keyspace_compaction(ctx, std::move(req));

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    });

				    ss::force_keyspace_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto task = co_await force_keyspace_compaction(ctx, std::move(req));

				        co_await task->done();

				        co_return json_void();

				    });

				    t::force_keyspace_cleanup_async.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        auto [keyspace, table_infos] = parse_table_infos(ctx, *req);

				        apilog.info("force_keyspace_cleanup_async: keyspace={} tables={}", keyspace, table_infos);

				        if (!co_await ss.local().is_cleanup_allowed(keyspace)) {

				            auto msg = "Can not perform cleanup operation when topology changes";

				            apilog.warn("force_keyspace_cleanup_async: keyspace={} tables={}: {}", keyspace, table_infos, msg);

				            co_await coroutine::return_exception(std::runtime_error(msg));

				        tasks::task_id id = tasks::task_id::create_null_id();

				        auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));

				        if (task) {

				            id = task->get_status().id;

				        }

				        co_return json::json_return_type(id.to_sstring());

				    });

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<cleanup_keyspace_compaction_task_impl>({}, std::move(keyspace), db, table_infos, flush_mode::all_tables, tasks::is_user_task::yes);

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));

				        if (task) {

				            co_await task->done();

				        }

				        co_return json::json_return_type(0);

				    });

				    t::perform_keyspace_offstrategy_compaction_async.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {

				        apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);

				        auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, nullptr);

				        auto task = co_await compaction_module.make_and_start_task<compaction::offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, nullptr);

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    }));

				    ss::perform_keyspace_offstrategy_compaction.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {

				        apilog.info("perform_keyspace_offstrategy_compaction: keyspace={} tables={}", keyspace, table_infos);

				        bool res = false;

				        auto& compaction_module = ctx.db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<compaction::offstrategy_keyspace_compaction_task_impl>({}, std::move(keyspace), ctx.db, table_infos, &res);

				        co_await task->done();

				        co_return json::json_return_type(res);

				    }));

				    t::upgrade_sstables_async.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);

				        apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);

				        auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    }));

				    ss::upgrade_sstables.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {

				        auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));

				        co_await task->done();

				        co_return json::json_return_type(0);

				    }));

				    t::scrub_async.set(r, [&ctx, &snap_ctl] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        auto info = co_await parse_scrub_options(ctx, snap_ctl, std::move(req));

				        auto info = parse_scrub_options(ctx, std::move(req));

				        if (!info.snapshot_tag.empty()) {

				            db::snapshot_options opts = {.skip_flush = false};

				            co_await snap_ctl.local().take_column_family_snapshot(info.keyspace, info.column_families, info.snapshot_tag, opts);

				        }

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        auto task = co_await compaction_module.make_and_start_task<scrub_sstables_compaction_task_impl>({}, std::move(info.keyspace), db, std::move(info.column_families), info.opts, nullptr);

				        auto task = co_await compaction_module.make_and_start_task<compaction::scrub_sstables_compaction_task_impl>({}, std::move(info.keyspace), db, std::move(info.column_families), info.opts, nullptr);

				        co_return json::json_return_type(task->get_status().id.to_sstring());

				    });

				    ss::force_compaction.set(r, [&ctx] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {

				        auto& db = ctx.db;

				        auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);

				        auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);

				        apilog.info("force_compaction: flush={} consider_only_existing_data={}", flush, consider_only_existing_data);

				        auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();

				        std::optional<compaction::flush_mode> fmopt;

				        if (!flush && !consider_only_existing_data) {

				            fmopt = compaction::flush_mode::skip;

				        }

				        auto task = co_await compaction_module.make_and_start_task<compaction::global_major_compaction_task_impl>({}, db, fmopt, consider_only_existing_data);

				        co_await task->done();

				        co_return json_void();

				    });

				}

				void unset_tasks_compaction_module(http_context& ctx, httpd::routes& r) {

				    t::force_keyspace_compaction_async.unset(r);

				    ss::force_keyspace_compaction.unset(r);

				    t::force_keyspace_cleanup_async.unset(r);

				    ss::force_keyspace_cleanup.unset(r);

				    t::perform_keyspace_offstrategy_compaction_async.unset(r);

				    ss::perform_keyspace_offstrategy_compaction.unset(r);

				    t::upgrade_sstables_async.unset(r);

				    ss::upgrade_sstables.unset(r);

				    t::scrub_async.unset(r);

				    ss::force_compaction.unset(r);

				}

				}

									
										12

api/token_metadata.cc
									
												View File
												
				@@ -62,6 +62,17 @@ void set_token_metadata(http_context& ctx, routes& r, sharded<locator::shared_to

				        return addr | std::ranges::to<std::vector>();

				    });

				    ss::get_excluded_nodes.set(r, [&tm](const_req req) {

				        const auto& local_tm = *tm.local().get();

				        std::vector<sstring> eps;

				        local_tm.get_topology().for_each_node([&] (auto& node) {

				            if (node.is_excluded()) {

				                eps.push_back(node.host_id().to_sstring());

				            }

				        });

				        return eps;

				    });

				    ss::get_joining_nodes.set(r, [&tm, &g](const_req req) {

				        const auto& local_tm = *tm.local().get();

				        const auto& points = local_tm.get_bootstrap_tokens();

				@@ -130,6 +141,7 @@ void unset_token_metadata(http_context& ctx, routes& r) {

				    ss::get_leaving_nodes.unset(r);

				    ss::get_moving_nodes.unset(r);

				    ss::get_joining_nodes.unset(r);

				    ss::get_excluded_nodes.unset(r);

				    ss::get_host_id_map.unset(r);

				    httpd::endpoint_snitch_info_json::get_datacenter.unset(r);

				    httpd::endpoint_snitch_info_json::get_rack.unset(r);

									
										4

audit/CMakeLists.txt
									
												View File
												
				@@ -5,6 +5,7 @@ target_sources(scylla_audit

				  PRIVATE

				    audit.cc

				    audit_cf_storage_helper.cc

				    audit_composite_storage_helper.cc

				    audit_syslog_storage_helper.cc)

				target_include_directories(scylla_audit

				  PUBLIC

				@@ -16,4 +17,7 @@ target_link_libraries(scylla_audit

				  PRIVATE

				    cql3)

				if (Scylla_USE_PRECOMPILED_HEADER_USE)

				  target_precompile_headers(scylla_audit REUSE_FROM scylla-precompiled-header)

				endif()

				add_whole_archive(audit scylla_audit)

									
										120

audit/audit.cc
									
												View File
												
				@@ -13,9 +13,11 @@

				#include "cql3/statements/batch_statement.hh"

				#include "cql3/statements/modification_statement.hh"

				#include "storage_helper.hh"

				#include "audit_cf_storage_helper.hh"

				#include "audit_syslog_storage_helper.hh"

				#include "audit_composite_storage_helper.hh"

				#include "audit.hh"

				#include "../db/config.hh"

				#include "utils/class_registrator.hh"

				#include <boost/algorithm/string/split.hpp>

				#include <boost/algorithm/string/trim.hpp>

				@@ -26,6 +28,47 @@ namespace audit {

				logging::logger logger("audit");

				static std::set<sstring> parse_audit_modes(const sstring& data) {

				    std::set<sstring> result;

				    if (!data.empty()) {

				        std::vector<sstring> audit_modes;

				        boost::split(audit_modes, data, boost::is_any_of(","));

				        if (audit_modes.empty()) {

				            return {};

				        }

				        for (sstring& audit_mode : audit_modes) {

				            boost::trim(audit_mode);

				            if (audit_mode == "none") {

				                return {};

				            }

				            if (audit_mode != "table" && audit_mode != "syslog") {

				                throw audit_exception(fmt::format("Bad configuration: invalid 'audit': {}", audit_mode));

				            }

				            result.insert(std::move(audit_mode));

				        }

				    }

				    return result;

				}

				static std::unique_ptr<storage_helper> create_storage_helper(const std::set<sstring>& audit_modes, cql3::query_processor& qp, service::migration_manager& mm) {

				    SCYLLA_ASSERT(!audit_modes.empty() && !audit_modes.contains("none"));

				    std::vector<std::unique_ptr<storage_helper>> helpers;

				    for (const sstring& audit_mode : audit_modes) {

				        if (audit_mode == "table") {

				            helpers.emplace_back(std::make_unique<audit_cf_storage_helper>(qp, mm));

				        } else if (audit_mode == "syslog") {

				            helpers.emplace_back(std::make_unique<audit_syslog_storage_helper>(qp, mm));

				        }

				    }

				    SCYLLA_ASSERT(!helpers.empty());

				    if (helpers.size() == 1) {

				        return std::move(helpers.front());

				    }

				    return std::make_unique<audit_composite_storage_helper>(std::move(helpers));

				}

				static sstring category_to_string(statement_category category)

				{

				    switch (category) {

				@@ -103,7 +146,9 @@ static std::set<sstring> parse_audit_keyspaces(const sstring& data) {

				}

				audit::audit(locator::shared_token_metadata& token_metadata,

				             sstring&& storage_helper_name,

				             cql3::query_processor& qp,

				             service::migration_manager& mm,

				             std::set<sstring>&& audit_modes,

				             std::set<sstring>&& audited_keyspaces,

				             std::map<sstring, std::set<sstring>>&& audited_tables,

				             category_set&& audited_categories,

				@@ -112,28 +157,21 @@ audit::audit(locator::shared_token_metadata& token_metadata,

				    , _audited_keyspaces(std::move(audited_keyspaces))

				    , _audited_tables(std::move(audited_tables))

				    , _audited_categories(std::move(audited_categories))

				    , _storage_helper_class_name(std::move(storage_helper_name))

				    , _cfg(cfg)

				    , _cfg_keyspaces_observer(cfg.audit_keyspaces.observe([this] (sstring const& new_value){ update_config<std::set<sstring>>(new_value, parse_audit_keyspaces, _audited_keyspaces); }))

				    , _cfg_tables_observer(cfg.audit_tables.observe([this] (sstring const& new_value){ update_config<std::map<sstring, std::set<sstring>>>(new_value, parse_audit_tables, _audited_tables); }))

				    , _cfg_categories_observer(cfg.audit_categories.observe([this] (sstring const& new_value){ update_config<category_set>(new_value, parse_audit_categories, _audited_categories); }))

				{ }

				{

				    _storage_helper_ptr = create_storage_helper(std::move(audit_modes), qp, mm);

				}

				audit::~audit() = default;

				future<> audit::create_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm) {

				    sstring storage_helper_name;

				    if (cfg.audit() == "table") {

				        storage_helper_name = "audit_cf_storage_helper";

				    } else if (cfg.audit() == "syslog") {

				        storage_helper_name = "audit_syslog_storage_helper";

				    } else if (cfg.audit() == "none") {

				        // Audit is off

				future<> audit::start_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm) {

				    std::set<sstring> audit_modes = parse_audit_modes(cfg.audit());

				    if (audit_modes.empty()) {

				        logger.info("Audit is disabled");

				        return make_ready_future<>();

				    } else {

				        throw audit_exception(fmt::format("Bad configuration: invalid 'audit': {}", cfg.audit()));

				    }

				    category_set audited_categories = parse_audit_categories(cfg.audit_categories());

				    std::map<sstring, std::set<sstring>> audited_tables = parse_audit_tables(cfg.audit_tables());

				@@ -143,19 +181,20 @@ future<> audit::create_audit(const db::config& cfg, sharded<locator::shared_toke

				                cfg.audit(), cfg.audit_categories(), cfg.audit_keyspaces(), cfg.audit_tables());

				    return audit_instance().start(std::ref(stm),

				                                  std::move(storage_helper_name),

				                                  std::ref(qp),

				                                  std::ref(mm),

				                                  std::move(audit_modes),

				                                  std::move(audited_keyspaces),

				                                  std::move(audited_tables),

				                                  std::move(audited_categories),

				                                  std::cref(cfg));

				}

				future<> audit::start_audit(const db::config& cfg, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm) {

				    if (!audit_instance().local_is_initialized()) {

				        return make_ready_future<>();

				    }

				    return audit_instance().invoke_on_all([&cfg, &qp, &mm] (audit& local_audit) {

				        return local_audit.start(cfg, qp.local(), mm.local());

				                                  std::cref(cfg))

				    .then([&cfg] {

				        if (!audit_instance().local_is_initialized()) {

				            return make_ready_future<>();

				        }

				        return audit_instance().invoke_on_all([&cfg] (audit& local_audit) {

				            return local_audit.start(cfg);

				        });

				    });

				}

				@@ -170,26 +209,14 @@ future<> audit::stop_audit() {

				    });

				}

				audit_info_ptr audit::create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table) {

				audit_info_ptr audit::create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table, bool batch) {

				    if (!audit_instance().local_is_initialized()) {

				        return nullptr;

				    }

				    return std::make_unique<audit_info>(cat, keyspace, table);

				    return std::make_unique<audit_info>(cat, keyspace, table, batch);

				}

				audit_info_ptr audit::create_no_audit_info() {

				    return audit_info_ptr();

				}

				future<> audit::start(const db::config& cfg, cql3::query_processor& qp, service::migration_manager& mm) {

				    try {

				        _storage_helper_ptr = create_object<storage_helper>(_storage_helper_class_name, qp, mm);

				    } catch (no_such_class& e) {

				        logger.error("Can't create audit storage helper {}: not supported", _storage_helper_class_name);

				        throw;

				    } catch (...) {

				        throw;

				    }

				future<> audit::start(const db::config& cfg) {

				    return _storage_helper_ptr->start(cfg);

				}

				@@ -236,18 +263,21 @@ future<> audit::log_login(const sstring& username, socket_address client_ip, boo

				}

				future<> inspect(shared_ptr<cql3::cql_statement> statement, service::query_state& query_state, const cql3::query_options& options, bool error) {

				    cql3::statements::batch_statement* batch = dynamic_cast<cql3::statements::batch_statement*>(statement.get());

				    if (batch != nullptr) {

				    auto audit_info = statement->get_audit_info();

				    if (!audit_info) {

				        return make_ready_future<>();

				    }

				    if (audit_info->batch()) {

				        cql3::statements::batch_statement* batch = static_cast<cql3::statements::batch_statement*>(statement.get());

				        return do_for_each(batch->statements().begin(), batch->statements().end(), [&query_state, &options, error] (auto&& m) {

				            return inspect(m.statement, query_state, options, error);

				        });

				    } else {

				        auto audit_info = statement->get_audit_info();

				        if (bool(audit_info) && audit::local_audit_instance().should_log(audit_info)) {

				        if (audit::local_audit_instance().should_log(audit_info)) {

				            return audit::local_audit_instance().log(audit_info, query_state, options, error);

				        }

				        return make_ready_future<>();

				    }

				    return make_ready_future<>();

				}

				future<> inspect_login(const sstring& username, socket_address client_ip, bool error) {

									
										19

audit/audit.hh
									
												View File
												
				@@ -75,11 +75,13 @@ class audit_info final {

				    sstring _keyspace;

				    sstring _table;

				    sstring _query;

				    bool _batch;

				public:

				    audit_info(statement_category cat, sstring keyspace, sstring table)

				    audit_info(statement_category cat, sstring keyspace, sstring table, bool batch)

				        : _category(cat)

				        , _keyspace(std::move(keyspace))

				        , _table(std::move(table))

				        , _batch(batch)

				    { }

				    void set_query_string(const std::string_view& query_string) {

				        _query = sstring(query_string);

				@@ -89,6 +91,7 @@ public:

				    const sstring& query() const { return _query; }

				    sstring category_string() const;

				    statement_category category() const { return _category; }

				    bool batch() const { return _batch; }

				};

				using audit_info_ptr = std::unique_ptr<audit_info>;

				@@ -102,7 +105,6 @@ class audit final : public seastar::async_sharded_service<audit> {

				    std::map<sstring, std::set<sstring>> _audited_tables;

				    category_set _audited_categories;

				    sstring _storage_helper_class_name;

				    std::unique_ptr<storage_helper> _storage_helper_ptr;

				    const db::config& _cfg;

				@@ -125,18 +127,19 @@ public:

				    static audit& local_audit_instance() {

				        return audit_instance().local();

				    }

				    static future<> create_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm);

				    static future<> start_audit(const db::config& cfg, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm);

				    static future<> start_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm);

				    static future<> stop_audit();

				    static audit_info_ptr create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table);

				    static audit_info_ptr create_no_audit_info();

				    audit(locator::shared_token_metadata& stm, sstring&& storage_helper_name,

				    static audit_info_ptr create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table, bool batch = false);

				    audit(locator::shared_token_metadata& stm,

				          cql3::query_processor& qp,

				          service::migration_manager& mm,

				          std::set<sstring>&& audit_modes,

				          std::set<sstring>&& audited_keyspaces,

				          std::map<sstring, std::set<sstring>>&& audited_tables,

				          category_set&& audited_categories,

				          const db::config& cfg);

				    ~audit();

				    future<> start(const db::config& cfg, cql3::query_processor& qp, service::migration_manager& mm);

				    future<> start(const db::config& cfg);

				    future<> stop();

				    future<> shutdown();

				    bool should_log(const audit_info* audit_info) const;

									
										10

audit/audit_cf_storage_helper.cc
									
												View File
												
				@@ -11,11 +11,11 @@

				#include "cql3/query_processor.hh"

				#include "data_dictionary/keyspace_metadata.hh"

				#include "utils/UUID_gen.hh"

				#include "utils/class_registrator.hh"

				#include "cql3/query_options.hh"

				#include "cql3/statements/ks_prop_defs.hh"

				#include "service/migration_manager.hh"

				#include "service/storage_proxy.hh"

				#include "locator/abstract_replication_strategy.hh"

				namespace audit {

				@@ -64,8 +64,8 @@ future<> audit_cf_storage_helper::migrate_audit_table(service::group0_guard grou

				            data_dictionary::database db = _qp.db();

				            cql3::statements::ks_prop_defs old_ks_prop_defs;

				            auto old_ks_metadata = old_ks_prop_defs.as_ks_metadata_update(

				                    ks->metadata(), *_qp.proxy().get_token_metadata_ptr(), db.features());

				            std::map<sstring, sstring> strategy_opts;

				                    ks->metadata(), *_qp.proxy().get_token_metadata_ptr(), db.features(), db.get_config());

				            locator::replication_strategy_config_options strategy_opts;

				            for (const auto &dc: _qp.proxy().get_token_metadata_ptr()->get_topology().get_datacenters())

				                strategy_opts[dc] = "3";

				@@ -73,6 +73,7 @@ future<> audit_cf_storage_helper::migrate_audit_table(service::group0_guard grou

				                                                                   "org.apache.cassandra.locator.NetworkTopologyStrategy",

				                                                                   strategy_opts,

				                                                                   std::nullopt, // initial_tablets

				                                                                   std::nullopt, // consistency_option

				                                                                   old_ks_metadata->durable_writes(),

				                                                                   old_ks_metadata->get_storage_options(),

				                                                                   old_ks_metadata->tables());

				@@ -196,7 +197,4 @@ cql3::query_options audit_cf_storage_helper::make_login_data(socket_address node

				    return cql3::query_options(cql3::default_cql_config, db::consistency_level::ONE, std::nullopt, std::move(values), false, cql3::query_options::specific_options::DEFAULT);

				}

				using registry = class_registrator<storage_helper, audit_cf_storage_helper, cql3::query_processor&, service::migration_manager&>;

				static registry registrator1("audit_cf_storage_helper");

				}

									
										68

audit/audit_composite_storage_helper.cc
									
										Normal file
									
												View File
												
				@@ -0,0 +1,68 @@

				/*

				 * Copyright (C) 2025 ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#include <seastar/core/loop.hh>

				#include <seastar/core/future-util.hh>

				#include "audit/audit_composite_storage_helper.hh"

				#include "utils/class_registrator.hh"

				namespace audit {

				audit_composite_storage_helper::audit_composite_storage_helper(std::vector<std::unique_ptr<storage_helper>>&& storage_helpers)

				    : _storage_helpers(std::move(storage_helpers))

				{}

				future<> audit_composite_storage_helper::start(const db::config& cfg) {

				    auto res = seastar::parallel_for_each(

				        _storage_helpers,

				        [&cfg] (std::unique_ptr<storage_helper>& h) {

				            return h->start(cfg);

				        }

				    );

				    return res;

				}

				future<> audit_composite_storage_helper::stop() {

				    auto res = seastar::parallel_for_each(

				        _storage_helpers,

				        [] (std::unique_ptr<storage_helper>& h) {

				            return h->stop();

				        }

				    );

				    return res;

				}

				future<> audit_composite_storage_helper::write(const audit_info* audit_info,

				                                               socket_address node_ip,

				                                               socket_address client_ip,

				                                               db::consistency_level cl,

				                                               const sstring& username,

				                                               bool error) {

				    return seastar::parallel_for_each(

				        _storage_helpers,

				        [audit_info, node_ip, client_ip, cl, &username, error](std::unique_ptr<storage_helper>& h) {

				            return h->write(audit_info, node_ip, client_ip, cl, username, error);

				        }

				    );

				}

				future<> audit_composite_storage_helper::write_login(const sstring& username,

				                                                     socket_address node_ip,

				                                                     socket_address client_ip,

				                                                     bool error) {

				    return seastar::parallel_for_each(

				        _storage_helpers,

				        [&username, node_ip, client_ip, error](std::unique_ptr<storage_helper>& h) {

				            return h->write_login(username, node_ip, client_ip, error);

				        }

				    );

				}

				} // namespace audit

									
										37

audit/audit_composite_storage_helper.hh
									
										Normal file
									
												View File
												
				@@ -0,0 +1,37 @@

				/*

				 * Copyright (C) 2025 ScyllaDB

				 */

				/*

				 * SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0

				 */

				#pragma once

				#include "audit/audit.hh"

				#include <seastar/core/future.hh>

				#include "storage_helper.hh"

				namespace audit {

				class audit_composite_storage_helper : public storage_helper {

				    std::vector<std::unique_ptr<storage_helper>> _storage_helpers;

				public:

				    explicit audit_composite_storage_helper(std::vector<std::unique_ptr<storage_helper>>&&);

				    virtual ~audit_composite_storage_helper() = default;

				    virtual future<> start(const db::config& cfg) override;

				    virtual future<> stop() override;

				    virtual future<> write(const audit_info* audit_info,

				                           socket_address node_ip,

				                           socket_address client_ip,

				                           db::consistency_level cl,

				                           const sstring& username,

				                           bool error) override;

				    virtual future<> write_login(const sstring& username,

				                                 socket_address node_ip,

				                                 socket_address client_ip,

				                                 bool error) override;

				};

				} // namespace audit

									
										14

audit/audit_syslog_storage_helper.cc
									
												View File
												
				@@ -21,7 +21,6 @@

				#include <fmt/chrono.h>

				#include "cql3/query_processor.hh"

				#include "utils/class_registrator.hh"

				namespace cql3 {

				@@ -54,10 +53,10 @@ static std::string json_escape(std::string_view str) {

				}

				future<> audit_syslog_storage_helper::syslog_send_helper(const sstring& msg) {

				future<> audit_syslog_storage_helper::syslog_send_helper(temporary_buffer<char> msg) {

				    try {

				        auto lock = co_await get_units(_semaphore, 1, std::chrono::hours(1));

				        co_await _sender.send(_syslog_address, net::packet{msg.data(), msg.size()});

				        co_await _sender.send(_syslog_address, std::span(&msg, 1));

				    }

				    catch (const std::exception& e) {

				        auto error_msg = seastar::format(

				@@ -91,7 +90,7 @@ future<> audit_syslog_storage_helper::start(const db::config& cfg) {

				        co_return;

				    }

				    co_await syslog_send_helper("Initializing syslog audit backend.");

				    co_await syslog_send_helper(temporary_buffer<char>::copy_of("Initializing syslog audit backend."));

				}

				future<> audit_syslog_storage_helper::stop() {

				@@ -121,7 +120,7 @@ future<> audit_syslog_storage_helper::write(const audit_info* audit_info,

				                                    audit_info->table(),

				                                    username);

				    co_await syslog_send_helper(msg);

				    co_await syslog_send_helper(std::move(msg).release());

				}

				future<> audit_syslog_storage_helper::write_login(const sstring& username,

				@@ -140,10 +139,7 @@ future<> audit_syslog_storage_helper::write_login(const sstring& username,

				                                    client_ip,

				                                    username);

				    co_await syslog_send_helper(msg.c_str());

				    co_await syslog_send_helper(std::move(msg).release());

				}

				using registry = class_registrator<storage_helper, audit_syslog_storage_helper, cql3::query_processor&, service::migration_manager&>;

				static registry registrator1("audit_syslog_storage_helper");

				}

Compare commits

3816 Commits dani-tweig ... master

9 .github/CODEOWNERS vendored Unescape Escape View File

101 .github/copilot-instructions.md vendored Normal file Unescape Escape View File

2 .github/dependabot.yml vendored Unescape Escape View File

115 .github/instructions/cpp.instructions.md vendored Normal file Unescape Escape View File

51 .github/instructions/python.instructions.md vendored Normal file Unescape Escape View File

55 .github/scripts/auto-backport.py vendored Unescape Escape View File

20 .github/scripts/sync_labels.py vendored Unescape Escape View File

7 .github/workflows/add-label-when-promoted.yaml vendored Unescape Escape View File

5 .github/workflows/backport-pr-fixes-validation.yaml vendored Unescape Escape View File

53 .github/workflows/call_backport_with_jira.yaml vendored Normal file Unescape Escape View File

11 .github/workflows/call_jira_status_in_progress.yml vendored Unescape Escape View File

11 .github/workflows/call_jira_status_in_review.yml vendored Unescape Escape View File

13 .github/workflows/call_jira_status_ready_for_merge.yml vendored Unescape Escape View File

18 .github/workflows/call_jira_sync.yml vendored Normal file Unescape Escape View File

22 .github/workflows/call_jira_sync_pr_milestone.yml vendored Normal file Unescape Escape View File

14 .github/workflows/call_sync_milestone_to_jira.yml vendored Normal file Unescape Escape View File

13 .github/workflows/call_validate_pr_author_email.yml vendored Normal file Unescape Escape View File

62 .github/workflows/close_issue_for_scylla_associate.yml vendored Normal file Unescape Escape View File

2 .github/workflows/codespell.yaml vendored Unescape Escape View File

8 .github/workflows/docs-pages.yaml vendored Unescape Escape View File

7 .github/workflows/docs-pr.yaml vendored Unescape Escape View File

37 .github/workflows/docs-validate-metrics.yml vendored Normal file Unescape Escape View File

5 .github/workflows/iwyu.yaml vendored Unescape Escape View File

2 .github/workflows/read-toolchain.yaml vendored Unescape Escape View File

4 .github/workflows/sync-labels.yaml vendored Unescape Escape View File

66 .github/workflows/trigger-scylla-ci.yaml vendored Normal file Unescape Escape View File

242 .github/workflows/trigger_ci.yaml vendored Normal file Unescape Escape View File

3 .github/workflows/trigger_jenkins.yaml vendored Unescape Escape View File

66 CMakeLists.txt Unescape Escape View File

2 CONTRIBUTING.md Unescape Escape View File

8 HACKING.md Unescape Escape View File

4 README.md Unescape Escape View File

2 SCYLLA-VERSION-GEN Unescape Escape View File

5 alternator/CMakeLists.txt Unescape Escape View File

10 alternator/auth.cc Unescape Escape View File

2 alternator/auth.hh Unescape Escape View File

14 alternator/conditions.cc Unescape Escape View File

8 alternator/consumed_capacity.cc Unescape Escape View File

6 alternator/consumed_capacity.hh Unescape Escape View File

51 alternator/controller.cc Unescape Escape View File

7 alternator/controller.hh Unescape Escape View File

3 alternator/error.hh Unescape Escape View File

1678 alternator/executor.cc View File

53 alternator/executor.hh Unescape Escape View File

29 alternator/expressions.g Unescape Escape View File

22 alternator/expressions.hh Unescape Escape View File

10 alternator/expressions_types.hh Unescape Escape View File

2 alternator/extract_from_attrs.hh Unescape Escape View File

301 alternator/http_compression.cc Normal file Unescape Escape View File

91 alternator/http_compression.hh Normal file Unescape Escape View File

109 alternator/parsed_expression_cache.cc Normal file Unescape Escape View File

12 alternator/rmw_operation.hh Unescape Escape View File

51 alternator/serialization.cc Unescape Escape View File

2 alternator/serialization.hh Unescape Escape View File

482 alternator/server.cc Unescape Escape View File

20 alternator/server.hh Unescape Escape View File

61 alternator/stats.cc Unescape Escape View File

63 alternator/stats.hh Unescape Escape View File

546 alternator/streams.cc Unescape Escape View File

118 alternator/ttl.cc Unescape Escape View File

4 alternator/ttl.hh Unescape Escape View File

26 alternator/ttl_tag.hh Normal file Unescape Escape View File

5 api/CMakeLists.txt Unescape Escape View File

2 api/api-doc/authorization_cache.json Unescape Escape View File

23 api/api-doc/client_routes.def.json Normal file Unescape Escape View File

74 api/api-doc/client_routes.json Normal file Unescape Escape View File

2 api/api-doc/messaging_service.json Unescape Escape View File

297 api/api-doc/storage_service.json Unescape Escape View File

15 api/api-doc/system.json Unescape Escape View File

2 api/api-doc/task_manager.json Unescape Escape View File

8 api/api-doc/tasks.json Unescape Escape View File

55 api/api.cc Unescape Escape View File

33 api/api.hh Unescape Escape View File

20 api/api_init.hh Unescape Escape View File

30 api/cache_service.cc Unescape Escape View File

7 api/cache_service.hh Unescape Escape View File

176 api/client_routes.cc Normal file Unescape Escape View File

20 api/client_routes.hh Normal file Unescape Escape View File

3816 Commits

dani-tweig ... master

9

.github/CODEOWNERS vendored

View File

101

.github/copilot-instructions.md vendored Normal file

View File

2

.github/dependabot.yml vendored

View File

115

.github/instructions/cpp.instructions.md vendored Normal file

View File

51

.github/instructions/python.instructions.md vendored Normal file

View File

55

.github/scripts/auto-backport.py vendored

View File

20

.github/scripts/sync_labels.py vendored

View File

7

.github/workflows/add-label-when-promoted.yaml vendored

View File

5

.github/workflows/backport-pr-fixes-validation.yaml vendored

View File

53

.github/workflows/call_backport_with_jira.yaml vendored Normal file

View File

11

.github/workflows/call_jira_status_in_progress.yml vendored

View File

11

.github/workflows/call_jira_status_in_review.yml vendored

View File

13

.github/workflows/call_jira_status_ready_for_merge.yml vendored

View File

18

.github/workflows/call_jira_sync.yml vendored Normal file

View File

22

.github/workflows/call_jira_sync_pr_milestone.yml vendored Normal file

View File

14

.github/workflows/call_sync_milestone_to_jira.yml vendored Normal file

View File

13

.github/workflows/call_validate_pr_author_email.yml vendored Normal file

View File

62

.github/workflows/close_issue_for_scylla_associate.yml vendored Normal file

View File

2

.github/workflows/codespell.yaml vendored

View File

8

.github/workflows/docs-pages.yaml vendored

View File

7

.github/workflows/docs-pr.yaml vendored

View File

37

.github/workflows/docs-validate-metrics.yml vendored Normal file

View File

5

.github/workflows/iwyu.yaml vendored

View File

2

.github/workflows/read-toolchain.yaml vendored

View File

4

.github/workflows/sync-labels.yaml vendored

View File

66

.github/workflows/trigger-scylla-ci.yaml vendored Normal file

View File

242

.github/workflows/trigger_ci.yaml vendored Normal file

View File

3

.github/workflows/trigger_jenkins.yaml vendored

View File

66

CMakeLists.txt

View File

2

CONTRIBUTING.md

View File

8

HACKING.md

View File

4

README.md

View File

2

SCYLLA-VERSION-GEN

View File

5

alternator/CMakeLists.txt

View File

10

alternator/auth.cc

View File

2

alternator/auth.hh

View File

14

alternator/conditions.cc

View File

8

alternator/consumed_capacity.cc

View File

6

alternator/consumed_capacity.hh

View File

51

alternator/controller.cc

View File

7

alternator/controller.hh

View File

3

alternator/error.hh

View File

1678

alternator/executor.cc

View File

53

alternator/executor.hh

View File

29

alternator/expressions.g

View File

22

alternator/expressions.hh

View File

10

alternator/expressions_types.hh

View File

2

alternator/extract_from_attrs.hh

View File

301

alternator/http_compression.cc Normal file

View File

91

alternator/http_compression.hh Normal file

View File

109

alternator/parsed_expression_cache.cc Normal file

View File

12

alternator/rmw_operation.hh

View File

51

alternator/serialization.cc

View File

2

alternator/serialization.hh

View File

482

alternator/server.cc

View File

20

alternator/server.hh

View File

61

alternator/stats.cc

View File

63

alternator/stats.hh

View File

546

alternator/streams.cc

View File

118

alternator/ttl.cc

View File

4

alternator/ttl.hh

View File

26

alternator/ttl_tag.hh Normal file

View File

5

api/CMakeLists.txt

View File

2

api/api-doc/authorization_cache.json

View File

23

api/api-doc/client_routes.def.json Normal file

View File

74

api/api-doc/client_routes.json Normal file

View File

2

api/api-doc/messaging_service.json

View File

297

api/api-doc/storage_service.json

View File

15

api/api-doc/system.json

View File

2

api/api-doc/task_manager.json

View File

8

api/api-doc/tasks.json

View File

55

api/api.cc

View File

33

api/api.hh

View File

20

api/api_init.hh

View File

30

api/cache_service.cc

View File

7

api/cache_service.hh

View File

176

api/client_routes.cc Normal file

View File

20

api/client_routes.hh Normal file

View File

606

api/column_family.cc

View File